Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Dinh Phung, Chen Gong, Shirui Pan
Retrieval-augmented generation (RAG) has proven effective in integrating
knowledge into large language models (LLMs). However, conventional RAGs
struggle to capture complex relationships between pieces of knowledge, limiting
their performance in intricate reasoning that requires integrating knowledge
from multiple sources. Recently, graph-enhanced retrieval augmented generation
(GraphRAG) builds graph structure to explicitly model these relationships,
enabling more effective and efficient retrievers. Nevertheless, its performance
is still hindered by the noise and incompleteness within the graph structure.
To address this, we introduce GFM-RAG, a novel graph foundation model (GFM) for
retrieval augmented generation. GFM-RAG is powered by an innovative graph
neural network that reasons over graph structure to capture complex
query-knowledge relationships. The GFM with 8M parameters undergoes a two-stage
training process on large-scale datasets, comprising 60 knowledge graphs with
over 14M triples and 700k documents. This results in impressive performance and
generalizability for GFM-RAG, making it the first graph foundation model
applicable to unseen datasets for retrieval without any fine-tuning required.
Extensive experiments on three multi-hop QA datasets and seven domain-specific
RAG datasets demonstrate that GFM-RAG achieves state-of-the-art performance
while maintaining efficiency and alignment with neural scaling laws,
highlighting its potential for further improvement.
Authors' comments: 19 pages, 6 figures
Orion Weller, Benjamin Chang, Eugene Yang, Mahsa Yarmohammadi, Sam Barham, Sean MacAvaney, Arman Cohan, Luca Soldaini et al.
Retrieval systems generally focus on web-style queries that are short and
underspecified. However, advances in language models have facilitated the
nascent rise of retrieval models that can understand more complex queries with
diverse intents. However, these efforts have focused exclusively on English;
therefore, we do not yet understand how they work across languages. We
introduce mFollowIR, a multilingual benchmark for measuring
instruction-following ability in retrieval models. mFollowIR builds upon the
TREC NeuCLIR narratives (or instructions) that span three diverse languages
(Russian, Chinese, Persian) giving both query and instruction to the retrieval
models. We make small changes to the narratives and isolate how well retrieval
models can follow these nuanced changes. We present results for both
multilingual (XX-XX) and cross-lingual (En-XX) performance. We see strong
cross-lingual performance with English-based retrievers that trained using
instructions, but find a notable drop in performance in the multilingual
setting, indicating that more work is needed in developing data for
instruction-based multilingual retrievers.
Authors' comments: Accepted to ECIR 2025
Ali Naseh, Yuefeng Peng, Anshuman Suri, Harsh Chaudhari, Alina Oprea, Amir Houmansadr
Retrieval-Augmented Generation (RAG) enables Large Language Models (LLMs) to generate grounded responses by leveraging external knowledge databases without altering model parameters. Although the absence of weight tuning prevents leakage via model parameters, it introduces the risk of inference adversaries exploiting retrieved documents in the model's context. Existing methods for membership inference and data extraction often rely on jailbreaking or carefully crafted unnatural queries, which can be easily detected or thwarted with query rewriting techniques common in RAG systems. In this work, we present Interrogation Attack (IA), a membership inference technique targeting documents in the RAG datastore. By crafting natural-text queries that are answerable only with the target document's presence, our approach demonstrates successful inference with just 30 queries while remaining stealthy; straightforward detectors identify adversarial prompts from existing methods up to ~76x more frequently than those generated by our attack. We observe a 2x improvement in TPR@1%FPR over prior inference attacks across diverse RAG configurations, all while costing less than $0.02 per document inference.
Niklas Freymuth, Dong Liu, Thomas Ricatte, Saab Mansour
Dense retrieval methods typically target unstructured text data represented as flat strings. However, e-commerce catalogs often include structured information across multiple fields, such as brand, title, and description, which contain important information potential for retrieval systems. We present Cascading Hierarchical Attention Retrieval Model (CHARM), a novel framework designed to encode structured product data into hierarchical field-level representations with progressively finer detail. Utilizing a novel block-triangular attention mechanism, our method captures the interdependencies between product fields in a specified hierarchy, yielding field-level representations and aggregated vectors suitable for fast and efficient retrieval. Combining both representations enables a two-stage retrieval pipeline, in which the aggregated vectors support initial candidate selection, while more expressive field-level representations facilitate precise fine-tuning for downstream ranking. Experiments on publicly available large-scale e-commerce datasets demonstrate that CHARM matches or outperforms state-of-the-art baselines. Our analysis highlights the framework's ability to align different queries with appropriate product fields, enhancing retrieval accuracy and explainability.
Jan Luca Scheerer, Matei Zaharia, Christopher Potts, Gustavo Alonso, Omar Khattab
Multi-vector retrieval methods such as ColBERT and its recent variant, the
ConteXtualized Token Retriever (XTR), offer high accuracy but face efficiency
challenges at scale. To address this, we present WARP, a retrieval engine that
substantially improves the efficiency of retrievers trained with the XTR
objective through three key innovations: (1) WARP$_\text{SELECT}$ for dynamic
similarity imputation; (2) implicit decompression, avoiding costly vector
reconstruction during retrieval; and (3) a two-stage reduction process for
efficient score aggregation. Combined with highly-optimized C++ kernels, our
system reduces end-to-end latency compared to XTR's reference implementation by
41x, and achieves a 3x speedup over the ColBERTv2/PLAID engine, while
preserving retrieval quality.
Authors' comments: Accepted at SIGIR 2025
Shreya Meel, Xiangliang Kong, Thomas Jacob Maranzatto, Itzhak Tamo, Sennur Ulukus
We consider the private information retrieval (PIR) problem for a multigraph-based replication system, where each set of $r$ files is stored on two of the servers according to an underlying $r$-multigraph. Our goal is to establish upper and lower bounds on the PIR capacity of the $r$-multigraph. Specifically, we first propose a construction for multigraph-based PIR systems that leverages the symmetry of the underlying graph-based PIR scheme, deriving a capacity lower bound for such multigraphs. Then, we establish a general upper bound using linear programming, expressed as a function of the underlying graph parameters. Our bounds are demonstrated to be tight for PIR systems on multipaths for even number of vertices.
Shengyao Zhuang, Ekaterina Khramtsova, Xueguang Ma, Bevan Koopman, Jimmy Lin, Guido Zuccon
Recent advancements in dense retrieval have introduced vision-language model (VLM)-based retrievers, such as DSE and ColPali, which leverage document screenshots embedded as vectors to enable effective search and offer a simplified pipeline over traditional text-only methods. In this study, we propose three pixel poisoning attack methods designed to compromise VLM-based retrievers and evaluate their effectiveness under various attack settings and parameter configurations. Our empirical results demonstrate that injecting even a single adversarial screenshot into the retrieval corpus can significantly disrupt search results, poisoning the top-10 retrieved documents for 41.9% of queries in the case of DSE and 26.4% for ColPali. These vulnerability rates notably exceed those observed with equivalent attacks on text-only retrievers. Moreover, when targeting a small set of known queries, the attack success rate raises, achieving complete success in certain cases. By exposing the vulnerabilities inherent in vision-language models, this work highlights the potential risks associated with their deployment.
Manish Acharya, Yifan Zhang, Yu Huang, Kevin Leach
Optimizing software performance through automated code refinement offers a promising avenue for enhancing execution speed and efficiency. Despite recent advancements in LLMs, a significant gap remains in their ability to perform in-depth program analysis. This study introduces AUTOPATCH, an in-context learning approach designed to bridge this gap by enabling LLMs to automatically generate optimized code. Inspired by how programmers learn and apply knowledge to optimize software, AUTOPATCH incorporates three key components: (1) an analogy-driven framework to align LLM optimization with human cognitive processes, (2) a unified approach that integrates historical code examples and CFG analysis for context-aware learning, and (3) an automated pipeline for generating optimized code through in-context prompting. Experimental results demonstrate that AUTOPATCH achieves a 7.3% improvement in execution efficiency over GPT-4o across common generated executable code, highlighting its potential to advance automated program runtime optimization.
Nadezhda Chirkova, Thibault Formal, Vassilina Nikoulina, Stéphane Clinchant
Retrieval-augmented generation improves various aspects of large language
models (LLMs) generation, but suffers from computational overhead caused by
long contexts as well as the propagation of irrelevant retrieved information
into generated responses. Context pruning deals with both aspects, by removing
irrelevant parts of retrieved contexts before LLM generation. Existing context
pruning approaches are however limited, and do not provide a universal model
that would be both efficient and robust in a wide range of scenarios, e.g.,
when contexts contain a variable amount of relevant information or vary in
length, or when evaluated on various domains. In this work, we close this gap
and introduce Provence (Pruning and Reranking Of retrieVEd relevaNt ContExts),
an efficient and robust context pruner for Question Answering, which
dynamically detects the needed amount of pruning for a given context and can be
used out-of-the-box for various domains. The three key ingredients of Provence
are formulating the context pruning task as sequence labeling, unifying context
pruning capabilities with context reranking, and training on diverse data. Our
experimental results show that Provence enables context pruning with negligible
to no drop in performance, in various domains and settings, at almost no cost
in a standard RAG pipeline. We also conduct a deeper analysis alongside various
ablations to provide insights into training context pruners for future work.
Authors' comments: Accepted to ICLR 2025
Manish Singh, Manish Shrivastava
Long-context multiple-choice question answering tasks require robust reasoning over extensive text sources. Since most of the pre-trained transformer models are restricted to processing only a few hundred words at a time, successful completion of such tasks often relies on the identification of evidence spans, such as sentences, that provide supporting evidence for selecting the correct answer. Prior research in this domain has predominantly utilized pre-trained dense retrieval models, given the absence of supervision to fine-tune the retrieval process. This paper proposes a novel method called Options Aware Dense Retrieval (OADR) to address these challenges. ORDA uses an innovative approach to fine-tuning retrieval by leveraging query-options embeddings, which aim to mimic the embeddings of the oracle query (i.e., the query paired with the correct answer) for enhanced identification of supporting evidence. Through experiments conducted on the QuALITY benchmark dataset, we demonstrate that our proposed model surpasses existing baselines in terms of performance and accuracy.
Maxime Louis, Hervé Déjean, Stéphane Clinchant
Retrieval-Augmented Generation (RAG) pipelines enhance Large Language Models (LLMs) by retrieving relevant documents, but they face scalability issues due to high inference costs and limited context size. Document compression is a practical solution, but current soft compression methods suffer from accuracy losses and require extensive pretraining. In this paper, we introduce PISCO, a novel method that achieves a 16x compression rate with minimal accuracy loss (0-3%) across diverse RAG-based question-answering (QA) tasks. Unlike existing approaches, PISCO requires no pretraining or annotated data, relying solely on sequence-level knowledge distillation from document-based questions. With the ability to fine-tune a 7-10B LLM in 48 hours on a single A100 GPU, PISCO offers a highly efficient and scalable solution. We present comprehensive experiments showing that PISCO outperforms existing compression models by 8% in accuracy.
Xiaohan Yu, Zhihan Yang, Chong Chen
Multimodal Retrieval Augmented Generation (MRAG) systems, while promising for enhancing Multimodal Large Language Models (MLLMs), often rely on rigid, single-step retrieval methods. This limitation hinders their ability to effectively address real-world scenarios that demand adaptive information acquisition and query refinement. To overcome this, we introduce the novel task of Multimodal Retrieval Augmented Generation Planning (MRAG Planning), focusing on optimizing MLLM performance while minimizing computational overhead. We present CogPlanner, a versatile framework inspired by human cognitive processes. CogPlanner iteratively refines queries and selects retrieval strategies, enabling both parallel and sequential modeling approaches. To rigorously evaluate MRAG Planning, we introduce CogBench, a new benchmark specifically designed for this task. CogBench facilitates the integration of lightweight CogPlanner with resource-efficient MLLMs. Our experimental findings demonstrate that CogPlanner surpasses existing MRAG baselines, achieving significant improvements in both accuracy and efficiency with minimal computational overhead.
Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang et al.
Retrieval-augmented generation (RAG) is extensively utilized to incorporate external, current knowledge into large language models, thereby minimizing hallucinations. A standard RAG pipeline may comprise several components, such as query rewriting, document retrieval, document filtering, and answer generation. However, these components are typically optimized separately through supervised fine-tuning, which can lead to misalignments between the objectives of individual modules and the overarching aim of generating accurate answers in question-answering (QA) tasks. Although recent efforts have explored reinforcement learning (RL) to optimize specific RAG components, these approaches often focus on overly simplistic pipelines with only two components or do not adequately address the complex interdependencies and collaborative interactions among the modules. To overcome these challenges, we propose treating the RAG pipeline as a multi-agent cooperative task, with each component regarded as an RL agent. Specifically, we present MMOA-RAG, a Multi-Module joint Optimization Algorithm for RAG, which employs multi-agent reinforcement learning to harmonize all agents' goals towards a unified reward, such as the F1 score of the final answer. Experiments conducted on various QA datasets demonstrate that MMOA-RAG improves the overall pipeline performance and outperforms existing baselines. Furthermore, comprehensive ablation studies validate the contributions of individual components and the adaptability of MMOA-RAG across different RAG components and datasets. The code of MMOA-RAG is on https://github.com/chenyiqun/MMOA-RAG.
Changhun Lee, Jun-gyu Jin, Younghyun Cho, Eunhyeok Park
In this work, we introduce a novel approach called Scaling to Emphasize
Attention for Long-context retrieval (SEAL), which enhances the retrieval
performance of large language models (LLMs) over extended contexts. Previous
studies have shown that each attention head in LLMs has a unique functionality
and collectively contributes to the overall behavior of the model. Similarly,
we observe that specific heads are closely tied to long-context retrieval,
showing positive or negative correlation with retrieval scores. Built on this
insight, we propose a learning-based mechanism using zero-shot generated data
to emphasize these heads, improving the model's performance in long-context
retrieval tasks. By applying SEAL, we can achieve significant improvements in
in-domain retrieval performance, including document QA tasks from LongBench,
and considerable improvements in out-of-domain cases. Additionally, when
combined with existing training-free context extension techniques, SEAL extends
the context limits of LLMs while maintaining highly reliable outputs, opening
new avenues for research in this field.
Authors' comments: 15 pages
Parshin Shojaee, Sai Sree Harsha, Dan Luo, Akash Maharaj, Tong Yu, Yunyao Li
Recent advancements in Large Language Models and Retrieval-Augmented Generation have boosted interest in domain-specific question-answering for enterprise products. However, AI Assistants often face challenges in multi-product QA settings, requiring accurate responses across diverse domains. Existing multi-domain RAG-QA approaches either query all domains indiscriminately, increasing computational costs and LLM hallucinations, or rely on rigid resource selection, which can limit search results. We introduce MKP-QA, a novel multi-product knowledge-augmented QA framework with probabilistic federated search across domains and relevant knowledge. This method enhances multi-domain search quality by aggregating query-domain and query-passage probabilistic relevance. To address the lack of suitable benchmarks for multi-product QAs, we also present new datasets focused on three Adobe products: Adobe Experience Platform, Target, and Customer Journey Analytics. Our experiments show that MKP-QA significantly boosts multi-product RAG-QA performance in terms of both retrieval accuracy and response quality.
Weicai Yan, Ye Wang, Wang Lin, Zirun Guo, Zhou Zhao, Tao Jin
Research on continual learning in multi-modal tasks has been receiving increasing attention. However, most existing work overlooks the explicit cross-modal and cross-task interactions. In this paper, we innovatively propose the Low-rank Prompt Interaction (LPI) to address this general problem of multi-modal understanding, which considers both cross-modal and cross-task interactions. Specifically, as for the former, we employ multi-modal correlation modules for corresponding Transformer layers. Considering that the training parameters scale to the number of layers and tasks, we propose low-rank interaction-augmented decomposition to avoid memory explosion while enhancing the cross-modal association through sharing and separating common-specific low-rank factors. In addition, due to the multi-modal semantic differences carried by the low-rank initialization, we adopt hierarchical low-rank contrastive learning to ensure training robustness. As for the latter, we initially employ a visual analysis and identify that different tasks have clear distinctions in proximity. Therefore, we introduce explicit task contrastive constraints in the prompt learning process based on task semantic distances. Experiments on two retrieval tasks show performance improvements with the introduction of a minimal number of parameters, demonstrating the effectiveness of our method. Code is available at https://github.com/Kelvin-ywc/LPI.
Libo Wang
In view of the gap in the current large language model in sharing memory
across dialogues, this research proposes a wormhole memory module (WMM) to
realize memory as a Rubik's cube that can be arbitrarily retrieved between
different dialogues. Through simulation experiments, the researcher built an
experimental framework based on the Python environment and used setting memory
barriers to simulate the current situation where memories between LLMs
dialogues are difficult to share. The CoQA development data set was imported
into the experiment, and the feasibility of its cross-dialogue memory retrieval
function was verified for WMM's nonlinear indexing and dynamic retrieval, and a
comparative analysis was conducted with the capabilities of Titans and MemGPT
memory modules. Experimental results show that WMM demonstrated the ability to
retrieve memory across dialogues and the stability of quantitative indicators
in eight experiments. It contributes new technical approaches to the
optimization of memory management of LLMs and provides experience for the
practical application in the future.
Authors' comments: The experimental process and code have been uploaded to the Github
repository, the link is:
https://github.com/brucewang123456789/GeniusTrail/tree/main/Wormhole%20Memory%20Module
Murugan Sankaradas, Ravi K. Rajendran, Srimat T. Chakradhar
Extracting real-time insights from multi-modal data streams from various
domains such as healthcare, intelligent transportation, and satellite remote
sensing remains a challenge. High computational demands and limited knowledge
scope restrict the applicability of Multi-Modal Large Language Models (MM-LLMs)
on these data streams. Traditional Retrieval-Augmented Generation (RAG) systems
address knowledge limitations of these models, but suffer from slow
preprocessing, making them unsuitable for real-time analysis. We propose
StreamingRAG, a novel RAG framework designed for streaming data. StreamingRAG
constructs evolving knowledge graphs capturing scene-object-entity
relationships in real-time. The knowledge graph achieves temporal-aware scene
representations using MM-LLMs and enables timely responses for specific events
or user queries. StreamingRAG addresses limitations in existing methods,
achieving significant improvements in real-time analysis (5-6x faster
throughput), contextual accuracy (through a temporal knowledge graph), and
reduced resource consumption (using lightweight models by 2-3x).
Authors' comments: Accepted and Presented at AI4Sys, HPDC 2024
T. Y. S. S. Santosh, Chen Jia, Patrick Goroncy, Matthias Grabmair
This paper addresses the task of legal summarization, which involves
distilling complex legal documents into concise, coherent summaries. Current
approaches often struggle with content theme deviation and inconsistent writing
styles due to their reliance solely on source documents. We propose RELexED, a
retrieval-augmented framework that utilizes exemplar summaries along with the
source document to guide the model. RELexED employs a two-stage exemplar
selection strategy, leveraging a determinantal point process to balance the
trade-off between similarity of exemplars to the query and diversity among
exemplars, with scores computed via influence functions. Experimental results
on two legal summarization datasets demonstrate that RELexED significantly
outperforms models that do not utilize exemplars and those that rely solely on
similarity-based exemplar selection.
Authors' comments: Accepted to NAACL 2025
Yoshiki Masuyama, Gordon Wichern, François G. Germain, Christopher Ick, Jonathan Le Roux
Head-related transfer functions (HRTFs) with dense spatial grids are desired
for immersive binaural audio generation, but their recording is time-consuming.
Although HRTF spatial upsampling has shown remarkable progress with neural
fields, spatial upsampling only from a few measured directions, e.g., 3 or 5
measurements, is still challenging. To tackle this problem, we propose a
retrieval-augmented neural field (RANF). RANF retrieves a subject whose HRTFs
are close to those of the target subject from a dataset. The HRTF of the
retrieved subject at the desired direction is fed into the neural field in
addition to the sound source direction itself. Furthermore, we present a neural
network that can efficiently handle multiple retrieved subjects, inspired by a
multi-channel processing technique called transform-average-concatenate. Our
experiments confirm the benefits of RANF on the SONICOM dataset, and it is a
key component in the winning solution of Task 2 of the listener acoustic
personalization challenge 2024.
Authors' comments: Accepted to ICASSP 2025