Fulvio Sanguigni, Davide Morelli, Marcella Cornia, Rita Cucchiara
In recent years, the fashion industry has increasingly adopted AI
technologies to enhance customer experience, driven by the proliferation of
e-commerce platforms and virtual applications. Among the various tasks, virtual
try-on and multimodal fashion image editing -- which utilizes diverse input
modalities such as text, garment sketches, and body poses -- have become a key
area of research. Diffusion models have emerged as a leading approach for such
generative tasks, offering superior image quality and diversity. However, most
existing virtual try-on methods rely on having a specific garment input, which
is often impractical in real-world scenarios where users may only provide
textual specifications. To address this limitation, in this work we introduce
Fashion Retrieval-Augmented Generation (Fashion-RAG), a novel method that
enables the customization of fashion items based on user preferences provided
in textual form. Our approach retrieves multiple garments that match the input
specifications and generates a personalized image by incorporating attributes
from the retrieved items. To achieve this, we employ textual inversion
techniques, where retrieved garment images are projected into the textual
embedding space of the Stable Diffusion text encoder, allowing seamless
integration of retrieved elements into the generative process. Experimental
results on the Dress Code dataset demonstrate that Fashion-RAG outperforms
existing methods both qualitatively and quantitatively, effectively capturing
fine-grained visual details from retrieved garments. To the best of our
knowledge, this is the first work to introduce a retrieval-augmented generation
approach specifically tailored for multimodal fashion image editing.
Authors' comments: IJCNN 2025
Nandan Thakur, Jimmy Lin, Sam Havens, Michael Carbin, Omar Khattab, Andrew Drozdov
We introduce FreshStack, a holistic framework for automatically building
information retrieval (IR) evaluation benchmarks by incorporating challenging
questions and answers. FreshStack conducts the following steps: (1) automatic
corpus collection from code and technical documentation, (2) nugget generation
from community-asked questions and answers, and (3) nugget-level support,
retrieving documents using a fusion of retrieval techniques and hybrid
architectures. We use FreshStack to build five datasets on fast-growing,
recent, and niche topics to ensure the tasks are sufficiently challenging. On
FreshStack, existing retrieval models, when applied out-of-the-box,
significantly underperform oracle approaches on all five topics, denoting
plenty of headroom to improve IR quality. In addition, we identify cases where
rerankers do not improve first-stage retrieval accuracy (two out of five
topics) and oracle context helps an LLM generator generate a high-quality RAG
answer. We hope FreshStack will facilitate future work toward constructing
realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks.
Authors' comments: 21 pages, 4 figures, 8 tables
Yuxuan Zong, Benjamin Piwowarski
Late interaction neural IR models like ColBERT offer a competitive
effectiveness-efficiency trade-off across many benchmarks. However, they
require a huge memory space to store the contextual representation for all the
document tokens. Some works have proposed using either heuristics or
statistical-based techniques to prune tokens from each document. This however
doesn't guarantee that the removed tokens have no impact on the retrieval
score. Our work uses a principled approach to define how to prune tokens
without impacting the score between a document and a query. We introduce three
regularization losses, that induce a solution with high pruning ratios, as well
as two pruning strategies. We study them experimentally (in and out-domain),
showing that we can preserve ColBERT's performance while using only 30\% of the
tokens.
Authors' comments: Accepted at SIGIR 2025 Full Paper Track
Yichao Feng, Shuai Zhao, Yueqiu Li, Luwei Xiao, Xiaobao Wu, Anh Tuan Luu
Aspect-based summarization aims to generate summaries tailored to specific aspects, addressing the resource constraints and limited generalizability of traditional summarization approaches. Recently, large language models have shown promise in this task without the need for training. However, they rely excessively on prompt engineering and face token limits and hallucination challenges, especially with in-context learning. To address these challenges, in this paper, we propose a novel framework for aspect-based summarization: Self-Aspect Retrieval Enhanced Summary Generation. Rather than relying solely on in-context learning, given an aspect, we employ an embedding-driven retrieval mechanism to identify its relevant text segments. This approach extracts the pertinent content while avoiding unnecessary details, thereby mitigating the challenge of token limits. Moreover, our framework optimizes token usage by deleting unrelated parts of the text and ensuring that the model generates output strictly based on the given aspect. With extensive experiments on benchmark datasets, we demonstrate that our framework not only achieves superior performance but also effectively mitigates the token limitation problem.
Grigory Kovalev, Mikhail Tikhomirov, Evgeny Kozhevnikov, Max Kornilov, Natalia Loukachevitch
We introduce RusBEIR, a comprehensive benchmark designed for zero-shot evaluation of information retrieval (IR) models in the Russian language. Comprising 17 datasets from various domains, it integrates adapted, translated, and newly created datasets, enabling systematic comparison of lexical and neural models. Our study highlights the importance of preprocessing for lexical models in morphologically rich languages and confirms BM25 as a strong baseline for full-document retrieval. Neural models, such as mE5-large and BGE-M3, demonstrate superior performance on most datasets, but face challenges with long-document retrieval due to input size constraints. RusBEIR offers a unified, open-source framework that promotes research in Russian-language information retrieval.
Singon Kim, Gunho Jung, Seong-Whan Lee
Abstractive compression utilizes smaller langauge models to condense query-relevant context, reducing computational costs in retrieval-augmented generation (RAG). However,retrieved documents often include information that is either irrelevant to answering the query or misleading due to factual incorrect content, despite having high relevance scores. This behavior indicates that abstractive compressors are more likely to omit important information essential for the correct answer, especially in long contexts where attention dispersion occurs. To address this issue, we categorize retrieved documents in a more fine-grained manner and propose Abstractive Compression Robust against Noise (ACoRN), which introduces two novel training steps. First, we use offline data augmentation on the training dataset to enhance compressor robustness against two distinct types of retrieval noise. Second, since the language modelbased compressor cannot fully utilize information from multiple retrieved documents and exhibits positional bias, we perform finetuning to generate summaries centered around key information that directly supports the correct answer. Our experiments demonstrate that T5-large, trained with ACoRN as a compressor, improves EM and F1 scores while preserving the answer string, which could serve as direct evidence. ACoRN excels on datasets with many accuracy-reducing documents, making it highly useful in real-world scenarios.
Elahe Khatibi, Ziyu Wang, Amir M. Rahmani
Retrieval-Augmented Generation (RAG) has significantly enhanced large language models (LLMs) in knowledge-intensive tasks by incorporating external knowledge retrieval. However, existing RAG frameworks primarily rely on semantic similarity and correlation-driven retrieval, limiting their ability to distinguish true causal relationships from spurious associations. This results in responses that may be factually grounded but fail to establish cause-and-effect mechanisms, leading to incomplete or misleading insights. To address this issue, we introduce Causal Dynamic Feedback for Adaptive Retrieval-Augmented Generation (CDF-RAG), a framework designed to improve causal consistency, factual accuracy, and explainability in generative reasoning. CDF-RAG iteratively refines queries, retrieves structured causal graphs, and enables multi-hop causal reasoning across interconnected knowledge sources. Additionally, it validates responses against causal pathways, ensuring logically coherent and factually grounded outputs. We evaluate CDF-RAG on four diverse datasets, demonstrating its ability to improve response accuracy and causal correctness over existing RAG-based methods. Our code is publicly available at https://github.com/ elakhatibi/CDF-RAG.
Shangyu Liu, Zhenzhe Zheng, Xiaoyao Huang, Fan Wu, Jie Wu
Small language models (SLMs) support efficient deployments on resource-constrained edge devices, but their limited capacity compromises inference performance. Retrieval-augmented generation (RAG) is a promising solution to enhance model performance by integrating external databases, without requiring intensive on-device model retraining. However, large-scale public databases and user-specific private contextual documents are typically located on the cloud and the device separately, while existing RAG implementations are primarily centralized. To bridge this gap, we propose DRAGON, a distributed RAG framework to enhance on-device SLMs through both general and personal knowledge without the risk of leaking document privacy. Specifically, DRAGON decomposes multi-document RAG into multiple parallel token generation processes performed independently and locally on the cloud and the device, and employs a newly designed Speculative Aggregation, a dual-side speculative algorithm to avoid frequent output synchronization between the cloud and device. A new scheduling algorithm is further introduced to identify the optimal aggregation side based on real-time network conditions. Evaluations on real-world hardware testbed demonstrate a significant performance improvement of DRAGON-up to 1.9x greater gains over standalone SLM compared to the centralized RAG, substantial reduction in per-token latency, and negligible Time to First Token (TTFT) overhead.
Peipei Song, Long Zhang, Long Lan, Weidong Chen, Dan Guo, Xun Yang, Meng Wang
Partially relevant video retrieval (PRVR) is a practical yet challenging task
in text-to-video retrieval, where videos are untrimmed and contain much
background content. The pursuit here is of both effective and efficient
solutions to capture the partial correspondence between text queries and
untrimmed videos. Existing PRVR methods, which typically focus on modeling
multi-scale clip representations, however, suffer from content independence and
information redundancy, impairing retrieval performance. To overcome these
limitations, we propose a simple yet effective approach with active moment
discovering (AMDNet). We are committed to discovering video moments that are
semantically consistent with their queries. By using learnable span anchors to
capture distinct moments and applying masked multi-moment attention to
emphasize salient moments while suppressing redundant backgrounds, we achieve
more compact and informative video representations. To further enhance moment
modeling, we introduce a moment diversity loss to encourage different moments
of distinct regions and a moment relevance loss to promote semantically
query-relevant moments, which cooperate with a partially relevant retrieval
loss for end-to-end optimization. Extensive experiments on two large-scale
video datasets (\ie, TVR and ActivityNet Captions) demonstrate the superiority
and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller
(\#parameters) while 6.0 points higher (SumR) than the up-to-date method
GMMFormer on TVR.
Authors' comments: Accepted by IEEE Transactions on Multimedia (TMM) on January 19,
2025. The code is available at https://github.com/songpipi/AMDNet
Borong Zhang, Qin Li, Zichao Wendy Di
We introduce MAGPIE (Multilevel-Adaptive-Guided Ptychographic Iterative Engine), a stochastic multigrid solver for the ptychographic phase-retrieval problem. The ptychographic phase-retrieval problem is inherently nonconvex and ill-posed. To address these challenges, we reformulate the original nonlinear and nonconvex inverse problem as the iterative minimization of a quadratic surrogate model that majorizes the original objective. This surrogate not only ensures favorable convergence properties but also generalizes the Ptychographic Iterative Engine (PIE) family of algorithms. By solving the surrogate model using a multigrid method, MAGPIE achieves substantial gains in convergence speed and reconstruction quality over traditional approaches.
Zhichao Xu, Aosong Feng, Yijun Tian, Haibo Ding, Lin Lee Cheong
In recent years, dense retrieval has been the focus of information retrieval (IR) research. While effective, dense retrieval produces uninterpretable dense vectors, and suffers from the drawback of large index size. Learned sparse retrieval (LSR) has emerged as promising alternative, achieving competitive retrieval performance while also being able to leverage the classical inverted index data structure for efficient retrieval. However, limited works have explored scaling LSR beyond BERT scale. In this work, we identify two challenges in training large language models (LLM) for LSR: (1) training instability during the early stage of contrastive training; (2) suboptimal performance due to pre-trained LLM's unidirectional attention. To address these challenges, we propose two corresponding techniques: (1) a lightweight adaptation training phase to eliminate training instability; (2) two model variants to enable bidirectional information. With these techniques, we are able to train LSR models with 8B scale LLM, and achieve competitive retrieval performance with reduced index size. Furthermore, we are among the first to analyze the performance-efficiency tradeoff of LLM-based LSR model through the lens of model quantization. Our findings provide insights into adapting LLMs for efficient retrieval modeling.
The Tien Mai
This paper addresses the problem of sparse phase retrieval, a fundamental inverse problem in applied mathematics, physics, and engineering, where a signal need to be reconstructed using only the magnitude of its transformation while phase information remains inaccessible. Leveraging the inherent sparsity of many real-world signals, we introduce a novel sparse quasi-Bayesian approach and provide the first theoretical guarantees for such an approach. Specifically, we employ a scaled Student distribution as a continuous shrinkage prior to enforce sparsity and analyze the method using the PAC-Bayesian inequality framework. Our results establish that the proposed Bayesian estimator achieves minimax-optimal convergence rates under sub-exponential noise, matching those of state-of-the-art frequentist methods. To ensure computational feasibility, we develop an efficient Langevin Monte Carlo sampling algorithm. Through numerical experiments, we demonstrate that our method performs comparably to existing frequentist techniques, highlighting its potential as a principled alternative for sparse phase retrieval in noisy settings.
Pei Liu, Xin Liu, Ruoyu Yao, Junming Liu, Siyuan Meng, Ding Wang, Jun Ma
While Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge, conventional single-agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. The framework is composed of three-tiered architecture with specialized agents: a Decomposition Agent that dissects complex queries into contextually coherent sub-tasks via semantic-aware query rewriting and schema-guided context augmentation; Multi-source Retrieval Agents that carry out parallel, modality-specific retrieval using plug-and-play modules designed for vector, graph, and web-based databases; and a Decision Agent that uses consistency voting to integrate multi-source answers and resolve discrepancies in retrieval results through Expert Model Refinement. This architecture attains comprehensive query understanding by combining textual, graph-relational, and web-derived evidence, resulting in a remarkable 12.95% improvement in answer accuracy and a 3.56% boost in question classification accuracy over baseline RAG systems on the ScienceQA and CrisisMMD benchmarks. Notably, HM-RAG establishes state-of-the-art results in zero-shot settings on both datasets. Its modular architecture ensures seamless integration of new data modalities while maintaining strict data governance, marking a significant advancement in addressing the critical challenges of multimodal reasoning and knowledge synthesis in RAG systems. Code is available at https://github.com/ocean-luna/HMRAG.
Jasper Linders, Jakub M. Tomczak
Large Language Models (LLMs) and Knowledge Graphs (KGs) offer a promising approach to robust and explainable Question Answering (QA). While LLMs excel at natural language understanding, they suffer from knowledge gaps and hallucinations. KGs provide structured knowledge but lack natural language interaction. Ideally, an AI system should be both robust to missing facts as well as easy to communicate with. This paper proposes such a system that integrates LLMs and KGs without requiring training, ensuring adaptability across different KGs with minimal human effort. The resulting approach can be classified as a specific form of a Retrieval Augmented Generation (RAG) with a KG, thus, it is dubbed Knowledge Graph-extended Retrieval Augmented Generation (KG-RAG). It includes a question decomposition module to enhance multi-hop information retrieval and answer explainability. Using In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting, it generates explicit reasoning chains processed separately to improve truthfulness. Experiments on the MetaQA benchmark show increased accuracy for multi-hop questions, though with a slight trade-off in single-hop performance compared to LLM with KG baselines. These findings demonstrate KG-RAG's potential to improve transparency in QA by bridging unstructured language understanding with structured knowledge retrieval.
Gregor Donabauer, Udo Kruschwitz
Legal retrieval is a widely studied area in Information Retrieval (IR) and a
key task in this domain is retrieving relevant cases based on a given query
case, often done by applying language models as encoders to model case
similarity. Recently, Tang et al. proposed CaseLink, a novel graph-based method
for legal case retrieval, which models both cases and legal charges as nodes in
a network, with edges representing relationships such as references and shared
semantics. This approach offers a new perspective on the task by capturing
higher-order relationships of cases going beyond the stand-alone level of
documents. However, while this shift in approaching legal case retrieval is a
promising direction in an understudied area of graph-based legal IR, challenges
in reproducing novel results have recently been highlighted, with multiple
studies reporting difficulties in reproducing previous findings. Thus, in this
work we reproduce CaseLink, a graph-based legal case retrieval method, to
support future research in this area of IR. In particular, we aim to assess its
reliability and generalizability by (i) first reproducing the original study
setup and (ii) applying the approach to an additional dataset. We then build
upon the original implementations by (iii) evaluating the approach's
performance when using a more sophisticated graph data representation and (iv)
using an open large language model (LLM) in the pipeline to address limitations
that are known to result from using closed models accessed via an API. Our
findings aim to improve the understanding of graph-based approaches in legal IR
and contribute to improving reproducibility in the field. To achieve this, we
share all our implementations and experimental artifacts with the community.
Authors' comments: Preprint accepted at SIGIR 2025
Arman Khaledian, Amirreza Ghadiridehkordi, Nariman Khaledian
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for
grounding large language models in external knowledge sources, improving the
precision of agents responses. However, high-dimensional language model
embeddings, often in the range of hundreds to thousands of dimensions, can
present scalability challenges in terms of storage and latency, especially when
processing massive financial text corpora. This paper investigates the use of
Principal Component Analysis (PCA) to reduce embedding dimensionality, thereby
mitigating computational bottlenecks without incurring large accuracy losses.
We experiment with a real-world dataset and compare different similarity and
distance metrics under both full-dimensional and PCA-compressed embeddings. Our
results show that reducing vectors from 3,072 to 110 dimensions provides a
sizeable (up to $60\times$) speedup in retrieval operations and a $\sim
28.6\times$ reduction in index size, with only moderate declines in correlation
metrics relative to human-annotated similarity scores. These findings
demonstrate that PCA-based compression offers a viable balance between
retrieval fidelity and resource efficiency, essential for real-time systems
such as Zanista AI's \textit{Newswitch} platform. Ultimately, our study
underscores the practicality of leveraging classical dimensionality reduction
techniques to scale RAG architectures for knowledge-intensive applications in
finance and trading, where speed, memory efficiency, and accuracy must jointly
be optimized.
Authors' comments: 19 pages
Alireza Salemi, Chris Samarinas, Hamed Zamani
This paper studies the limitations of (retrieval-augmented) large language models (LLMs) in generating diverse and comprehensive responses, and introduces the Plan-and-Refine (P&R) framework based on a two phase system design. In the global exploration phase, P&R generates a diverse set of plans for the given input, where each plan consists of a list of diverse query aspects with corresponding additional descriptions. This phase is followed by a local exploitation phase that generates a response proposal for the input query conditioned on each plan and iteratively refines the proposal for improving the proposal quality. Finally, a reward model is employed to select the proposal with the highest factuality and coverage. We conduct our experiments based on the ICAT evaluation methodology--a recent approach for answer factuality and comprehensiveness evaluation. Experiments on the two diverse information seeking benchmarks adopted from non-factoid question answering and TREC search result diversification tasks demonstrate that P&R significantly outperforms baselines, achieving up to a 13.1% improvement on the ANTIQUE dataset and a 15.41% improvement on the TREC dataset. Furthermore, a smaller scale user study confirms the substantial efficacy of the P&R framework.
Björn Engelmann, Fabian Haak, Philipp Schaer, Mani Erfanian Abdoust, Linus Netze, Meik Bittkowski
Retrieval test collections are essential for evaluating information retrieval systems, yet they often lack generalizability across tasks. To overcome this limitation, we introduce REANIMATOR, a versatile framework designed to enable the repurposing of existing test collections by enriching them with extracted and synthetic resources. REANIMATOR enhances test collections from PDF files by parsing full texts and machine-readable tables, as well as related contextual information. It then employs state-of-the-art large language models to produce synthetic relevance labels. Including an optional human-in-the-loop step can help validate the resources that have been extracted and generated. We demonstrate its potential with a revitalized version of the TREC-COVID test collection, showcasing the development of a retrieval-augmented generation system and evaluating the impact of tables on retrieval-augmented generation. REANIMATOR enables the reuse of test collections for new applications, lowering costs and broadening the utility of legacy resources.
João Alberto de Oliveira Lima
Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for
generating contextually accurate answers by integrating Large Language Models
(LLMs) with retrieval mechanisms. However, in legal contexts, users frequently
reference norms by their labels or nicknames (e.g., Article 5 of the
Constitution or Consumer Defense Code (CDC)), rather than by their content,
posing challenges for traditional RAG approaches that rely solely on semantic
embeddings of text. Furthermore, legal texts themselves heavily rely on
explicit cross-references (e.g., "pursuant to Article 34") that function as
pointers. Both scenarios pose challenges for traditional RAG approaches that
rely solely on semantic embeddings of text, often failing to retrieve the
necessary referenced content. This paper introduces Poly-Vector Retrieval, a
method assigning multiple distinct embeddings to each legal provision: one
embedding captures the content (the full text), another captures the label (the
identifier or proper name), and optionally additional embeddings capture
alternative denominations. Inspired by Frege's distinction between Sense and
Reference, this poly-vector retrieval approach treats labels, identifiers and
reference markers as rigid designators and content embeddings as carriers of
semantic substance. Experiments on the Brazilian Federal Constitution
demonstrate that Poly-Vector Retrieval significantly improves retrieval
accuracy for label-centric queries and potential to resolve internal and
external cross-references, without compromising performance on purely semantic
queries. The study discusses philosophical and practical implications of
explicitly separating reference from content in vector embeddings and proposes
future research directions for applying this approach to broader legal datasets
and other domains characterized by explicit reference identifiers.
Authors' comments: 39 pages, 5 figures
Manuel Camúñez, Enrique García-Sánchez, David de Hevia
This note provides a characterization of the subspaces of a complex Banach lattice which do stable phase retrieval, in the spirit of the characterization of real stable phase retrieval established by D. Freeman, T. Oikhberg, B. Pineau and M. A. Taylor.