Aleksander Smywiński-Pohl, Tomer Libal, Adam Kaczmarczyk, Magdalena Król
One of the elements of legal research is looking for cases where judges have extended the meaning of a legal concept by providing interpretations of what a concept means or does not mean. This allow legal professionals to use such interpretations as precedents as well as laymen to better understand the legal concept. The state-of-the-art approach for retrieving the most relevant interpretations for these concepts currently depends on the ranking of sentences and the training of language models over annotated examples. That manual annotation process can be quite expensive and need to be repeated for each such concept, which prompted recent research in trying to automate this process. In this paper, we highlight the results of various experiments conducted to determine the volume, scope and even the need for manual annotation. First of all, we check what is the optimal number of annotations per a legal concept. Second, we check if we can draw the sentences for annotation randomly or there is a gain in the performance of the model, when only the best candidates are annotated. As the last question we check what is the outcome of automating the annotation process with the help of an LLM.
Sarthak Chaturvedi, Anurag Acharya, Rounak Meyur, Koby Hayashi, Sai Munikoti, Sameera Horawalavithana
Evaluation benchmark characteristics may distort the true benefits of domain adaptation in retrieval models. This creates misleading assessments that influence deployment decisions in specialized domains. We show that two benchmarks with drastically different features such as topic diversity, boundary overlap, and semantic complexity can influence the perceived benefits of fine-tuning. Using environmental regulatory document retrieval as a case study, we fine-tune ColBERTv2 model on Environmental Impact Statements (EIS) from federal agencies. We evaluate these models across two benchmarks with different semantic structures. Our findings reveal that identical domain adaptation approaches show very different perceived benefits depending on evaluation methodology. On one benchmark, with clearly separated topic boundaries, domain adaptation shows small improvements (maximum 0.61% NDCG gain). However, on the other benchmark with overlapping semantic structures, the same models demonstrate large improvements (up to 2.22% NDCG gain), a 3.6-fold difference in the performance benefit. We compare these benchmarks through topic diversity metrics, finding that the higher-performing benchmark shows 11% higher average cosine distances between contexts and 23% lower silhouette scores, directly contributing to the observed performance difference. These results demonstrate that benchmark selection strongly determines assessments of retrieval system effectiveness in specialized domains. Evaluation frameworks with well-separated topics regularly underestimate domain adaptation benefits, while those with overlapping semantic boundaries reveal improvements that better reflect real-world regulatory document complexity. Our findings have important implications for developing and deploying AI systems for interdisciplinary domains that integrate multiple topics.
Safayat Bin Hakim, Muhammad Adil, Alvaro Velasquez, Houbing Herbert Song
Retrieval-Augmented Generation (RAG) systems address factual inconsistencies in Large Language Models by grounding generation in external knowledge, yet they face a fundamental efficiency problem: simple queries consume computational resources equivalent to complex multi-hop reasoning tasks. We present SymRAG, a neuro-symbolic framework that introduces adaptive query routing based on real-time complexity and system load assessments. SymRAG dynamically selects symbolic, neural, or hybrid processing paths to align resource use with query demands. Evaluated on 2,000 queries from HotpotQA and DROP using Llama-3.2-3B and Mistral-7B models, SymRAG achieves 97.6--100.0% exact match accuracy with significantly lower CPU utilization (3.6--6.2%) and processing time (0.985--3.165s). Disabling adaptive logic results in 169--1151% increase in processing time, highlighting the framework's impact. These results underscore the potential of adaptive neuro-symbolic routing for scalable, sustainable AI systems.
Jiale Zhang, Jiaxiang Chen, Zhucong Li, Jie Ding, Kui Zhao, Zenglin Xu, Xin Pang, Yinghui Xu
Retrieval-Augmented Generation (RAG) enhances language models by incorporating external knowledge at inference time. However, graph-based RAG systems often suffer from structural overhead and imprecise retrieval: they require costly pipelines for entity linking and relation extraction, yet frequently return subgraphs filled with loosely related or tangential content. This stems from a fundamental flaw -- semantic similarity does not imply semantic relevance. We introduce SlimRAG, a lightweight framework for retrieval without graphs. SlimRAG replaces structure-heavy components with a simple yet effective entity-aware mechanism. At indexing time, it constructs a compact entity-to-chunk table based on semantic embeddings. At query time, it identifies salient entities, retrieves and scores associated chunks, and assembles a concise, contextually relevant input -- without graph traversal or edge construction. To quantify retrieval efficiency, we propose Relative Index Token Utilization (RITU), a metric measuring the compactness of retrieved content. Experiments across multiple QA benchmarks show that SlimRAG outperforms strong flat and graph-based baselines in accuracy while reducing index size and RITU (e.g., 16.31 vs. 56+), highlighting the value of structure-free, entity-centric context selection. The code will be released soon. https://github.com/continue-ai-company/SlimRAG
Zhuocheng Zhang, Yang Feng, Min Zhang
Retrieval-Augmented Generation (RAG) plays a pivotal role in modern large
language model applications, with numerous existing frameworks offering a wide
range of functionalities to facilitate the development of RAG systems. However,
we have identified several persistent challenges in these frameworks, including
difficulties in algorithm reproduction and sharing, lack of new techniques, and
high system overhead. To address these limitations, we introduce
\textbf{FlexRAG}, an open-source framework specifically designed for research
and prototyping. FlexRAG supports text-based, multimodal, and network-based
RAG, providing comprehensive lifecycle support alongside efficient asynchronous
processing and persistent caching capabilities. By offering a robust and
flexible solution, FlexRAG enables researchers to rapidly develop, deploy, and
share advanced RAG systems. Our toolkit and resources are available at
\href{https://github.com/ictnlp/FlexRAG}{https://github.com/ictnlp/FlexRAG}.
Authors' comments: Accepted by ACL 2025 Demo
Evan Becker, Benjamin Bowman, Matthew Trager, Tian Yu Liu, Luca Zancato, Wei Xia, Stefano Soatto
Given a query and dataset, the optimal way of answering the query is to make use all the information available. Modern LLMs exhibit impressive ability to memorize training data, but data not deemed important during training is forgotten, and information outside that training set cannot be made use of. Processing an entire dataset at inference time is infeasible due to the bounded nature of model resources (e.g. context size in transformers or states in state space models), meaning we must resort to external memory. This constraint naturally leads to the following problem: How can we decide based on the present query and model, what among a virtually unbounded set of known data matters for inference? To minimize model uncertainty for a particular query at test-time, we introduce Retrieval In-Context Optimization (RICO), a retrieval method that uses gradients from the LLM itself to learn the optimal mixture of documents for answer generation. Unlike traditional retrieval-augmented generation (RAG), which relies on external heuristics for document retrieval, our approach leverages direct feedback from the model. Theoretically, we show that standard top-$k$ retrieval with model gradients can approximate our optimization procedure, and provide connections to the leave-one-out loss. We demonstrate empirically that by minimizing an unsupervised loss objective in the form of question perplexity, we can achieve comparable retriever metric performance to BM25 with \emph{no finetuning}. Furthermore, when evaluated on quality of the final prediction, our method often outperforms fine-tuned dense retrievers such as E5.
Prajwal Niraula, Julien de Wit, Robert Hargreaves, Iouli E. Gordon, Clara Sousa-Silva
Cassini's observations of Titan's atmosphere are exemplary benchmarks for
exoplanet atmospheric studies owing to (1) their precision and (2) our
independent knowledge of Titan. Leveraging these observations, we perform
retrievals (i.e., analyses) of Titan's transmission spectrum to investigate the
strengths/limitations of exoplanet atmospheric retrievals with a particular
focus on the underlying assumptions regarding the molecular species included in
the retrieval. We find that multiple hydrocarbons can be ``retrieved''
depending on the selection made ahead of a retrieval. More importantly, we find
that the estimates of other parameters such as the abundance of key absorbers
like methane can be biased by $\sim$0.5 dex (by a factor of $\sim$3) due to
such choices. This shows that beyond the possible misidentification of a
molecular feature (e.g., current debate surrounding dimethyl sulfide, DMS, in
K2-18 b), the implicit molecular detections made pre-retrieval to avoid
retrieving for hundreds of molecules at a time can bias a large range of
parameters. We thus recommend sensitivity analysis to assess the dependencies
of atmospheric inferences on such selections in tandem with complementary
information (e.g., chemistry models) to support any pre-retrieval selection.
Finally, we introduce an independent path to constrain the dominant atmospheric
constituent, even when lacking observable absorption feature (e.g., H$_2$ and
N$_2$) through the scale height.
Authors' comments: Comments welcome
Yu Wang, Shiwan Zhao, Ming Fan, Zhihu Wang, Yubo Zhang, Xicheng Zhang, Zhengfan Wang, Heyuan Huang et al.
The integration of external knowledge through Retrieval-Augmented Generation (RAG) has become foundational in enhancing large language models (LLMs) for knowledge-intensive tasks. However, existing RAG paradigms often overlook the cognitive step of applying knowledge, leaving a gap between retrieved facts and task-specific reasoning. In this work, we introduce RAG+, a principled and modular extension that explicitly incorporates application-aware reasoning into the RAG pipeline. RAG+ constructs a dual corpus consisting of knowledge and aligned application examples, created either manually or automatically, and retrieves both jointly during inference. This design enables LLMs not only to access relevant information but also to apply it within structured, goal-oriented reasoning processes. Experiments across mathematical, legal, and medical domains, conducted on multiple models, demonstrate that RAG+ consistently outperforms standard RAG variants, achieving average improvements of 3-5%, and peak gains up to 7.5% in complex scenarios. By bridging retrieval with actionable application, RAG+ advances a more cognitively grounded framework for knowledge integration, representing a step toward more interpretable and capable LLMs.
Xiaohan Yu, Pu Jian, Chong Chen
Retrieval-Augmented Generation (RAG) has demonstrated considerable
effectiveness in open-domain question answering. However, when applied to
heterogeneous documents, comprising both textual and tabular components,
existing RAG approaches exhibit critical limitations. The prevailing practice
of flattening tables and chunking strategies disrupts the intrinsic tabular
structure, leads to information loss, and undermines the reasoning capabilities
of LLMs in multi-hop, global queries. To address these challenges, we propose
TableRAG, an hybrid framework that unifies textual understanding and complex
manipulations over tabular data. TableRAG iteratively operates in four steps:
context-sensitive query decomposition, text retrieval, SQL programming and
execution, and compositional intermediate answer generation. We also develop
HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous
reasoning capabilities. Experimental results demonstrate that TableRAG
consistently outperforms existing baselines on both public datasets and our
HeteQA, establishing a new state-of-the-art for heterogeneous document question
answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.
Authors' comments: Under review. Codes are available at
https://github.com/yxh-y/TableRAG/tree/main
Linlin Wang, Tianqing Zhu, Laiqiao Qin, Longxiang Gao, Wanlei Zhou
In Large Language Models, Retrieval-Augmented Generation (RAG) systems can significantly enhance the performance of large language models by integrating external knowledge. However, RAG also introduces new security risks. Existing research focuses mainly on how poisoning attacks in RAG systems affect model output quality, overlooking their potential to amplify model biases. For example, when querying about domestic violence victims, a compromised RAG system might preferentially retrieve documents depicting women as victims, causing the model to generate outputs that perpetuate gender stereotypes even when the original query is gender neutral. To show the impact of the bias, this paper proposes a Bias Retrieval and Reward Attack (BRRA) framework, which systematically investigates attack pathways that amplify language model biases through a RAG system manipulation. We design an adversarial document generation method based on multi-objective reward functions, employ subspace projection techniques to manipulate retrieval results, and construct a cyclic feedback mechanism for continuous bias amplification. Experiments on multiple mainstream large language models demonstrate that BRRA attacks can significantly enhance model biases in dimensions. In addition, we explore a dual stage defense mechanism to effectively mitigate the impacts of the attack. This study reveals that poisoning attacks in RAG systems directly amplify model output biases and clarifies the relationship between RAG system security and model fairness. This novel potential attack indicates that we need to keep an eye on the fairness issues of the RAG system.
Jiaqi Samantha Zhan, Crystina Zhang, Shengyao Zhuang, Xueguang Ma, Jimmy Lin
Effective video retrieval remains challenging due to the complexity of integrating visual, auditory, and textual modalities. In this paper, we explore unified retrieval methods using OmniEmbed, a powerful multimodal embedding model from the Tevatron 2.0 toolkit, in the context of the MAGMaR shared task. Evaluated on the comprehensive MultiVENT 2.0 dataset, OmniEmbed generates unified embeddings for text, images, audio, and video, enabling robust multimodal retrieval. By finetuning OmniEmbed with the combined multimodal data--visual frames, audio tracks, and textual descriptions provided in MultiVENT 2.0, we achieve substantial improvements in complex, multilingual video retrieval tasks. Our submission achieved the highest score on the MAGMaR shared task leaderboard among public submissions as of May 20th, 2025, highlighting the practical effectiveness of our unified multimodal retrieval approach. Model checkpoint in this work is opensourced.
Val Andrei Fajardo, David B. Emerson, Amandeep Singh, Veronica Chatrath, Marcelo Lotif, Ravi Theja, Alex Cheung, Izuki Matsubi
Retrieval-augmented generation (RAG) systems have been shown to be effective
in addressing many of the drawbacks of relying solely on the parametric memory
of large language models. Recent work has demonstrated that RAG systems can be
improved via fine-tuning of their retriever and generator models. In this work,
we introduce FedRAG, a framework for fine-tuning RAG systems across centralized
and federated architectures. FedRAG supports state-of-the-art fine-tuning
methods, offering a simple and intuitive interface and a seamless conversion
from centralized to federated training tasks. FedRAG is also deeply integrated
with the modern RAG ecosystem, filling a critical gap in available tools.
Authors' comments: 9 pages, 4 figures, 2 tables. Accepted for the CODEML Workshop at
ICML 2025. Framework code available at
https://github.com/VectorInstitute/fed-rag
Arnav Yayavaram, Siddharth Yayavaram, Simran Khanuja, Michael Saxon, Graham Neubig
As text-to-image models become increasingly prevalent, ensuring their
equitable performance across diverse cultural contexts is critical. Efforts to
mitigate cross-cultural biases have been hampered by trade-offs, including a
loss in performance, factual inaccuracies, or offensive outputs. Despite
widespread recognition of these challenges, an inability to reliably measure
these biases has stalled progress. To address this gap, we introduce CAIRe, a
novel evaluation metric that assesses the degree of cultural relevance of an
image, given a user-defined set of labels. Our framework grounds entities and
concepts in the image to a knowledge base and uses factual information to give
independent graded judgments for each culture label. On a manually curated
dataset of culturally salient but rare items built using language models, CAIRe
surpasses all baselines by 28% F1 points. Additionally, we construct two
datasets for culturally universal concept, one comprising of T2I-generated
outputs and another retrieved from naturally occurring data. CAIRe achieves
Pearson's correlations of 0.56 and 0.66 with human ratings on these sets, based
on a 5-point Likert scale of cultural relevance. This demonstrates its strong
alignment with human judgment across diverse image sources.
Authors' comments: Preprint, under review
Fan Xu, Luis A. Leiva
Different machine learning models can represent the same underlying concept in different ways. This variability is particularly valuable for in-the-wild multimodal retrieval, where the objective is to identify the corresponding representation in one modality given another modality as input. This challenge can be effectively framed as a feature alignment problem. For example, given a sentence encoded by a language model, retrieve the most semantically aligned image based on features produced by an image encoder, or vice versa. In this work, we first investigate the geometric relationships between visual and textual embeddings derived from both vision-language models and combined unimodal models. We then align these representations using four standard similarity metrics as well as two learned ones, implemented via neural networks. Our findings indicate that the Wasserstein distance can serve as an informative measure of the modality gap, while cosine similarity consistently outperforms alternative metrics in feature alignment tasks. Furthermore, we observe that conventional architectures such as multilayer perceptrons are insufficient for capturing the complex interactions between image and text representations. Our study offers novel insights and practical considerations for researchers working in multimodal information retrieval, particularly in real-world, cross-modal applications.
Ke Wang, Bo Pan, Yingchaojie Feng, Yuwei Wu, Jieyi Chen, Minfeng Zhu, Wei Chen
Graph-based Retrieval-Augmented Generation (RAG) has shown great capability
in enhancing Large Language Model (LLM)'s answer with an external knowledge
base. Compared to traditional RAG, it introduces a graph as an intermediate
representation to capture better structured relational knowledge in the corpus,
elevating the precision and comprehensiveness of generation results. However,
developers usually face challenges in analyzing the effectiveness of GraphRAG
on their dataset due to GraphRAG's complex information processing pipeline and
the overwhelming amount of LLM invocations involved during graph construction
and query, which limits GraphRAG interpretability and accessibility. This
research proposes a visual analysis framework that helps RAG developers
identify critical recalls of GraphRAG and trace these recalls through the
GraphRAG pipeline. Based on this framework, we develop XGraphRAG, a prototype
system incorporating a set of interactive visualizations to facilitate users'
analysis process, boosting failure cases collection and improvement
opportunities identification. Our evaluation demonstrates the effectiveness and
usability of our approach. Our work is open-sourced and available at
https://github.com/Gk0Wk/XGraphRAG.
Authors' comments: Accepted to IEEE Pacific Visualization Conference 2025
Animesh Bhandari
Fusion frames are extensively studied due to their effectiveness in
recovering signals from large-scale data. They are applicable in distributed
processing, wireless sensor networks, and packet encoding systems due to their
robustness and redundancy. Motivated by the foundational work of Bemrose et
al.\cite{Be16} and Balan\cite{Ba13}, this paper investigates the theoretical
properties and characterizations of phase retrievable weaving fusion frames.
These frames offer enhanced redundancy and stability in signal reconstruction.
We present key results that deepen the understanding of their structure and
behaviour. Lastly, an application involving probabilistic erasure is explored
to demonstrate their practical utility.
Authors' comments: arXiv admin note: text overlap with arXiv:2409.01288
Abdellah Ghassel, Ian Robinson, Gabriel Tanase, Hal Cooper, Bryan Thompson, Zhen Han, Vassilis N. Ioannidis, Soji Adeshina et al.
Retrieval-Augmented Generation (RAG) grounds large language models in
external evidence, yet it still falters when answers must be pieced together
across semantically distant documents. We close this gap with the Hierarchical
Lexical Graph (HLG), a three-tier index that (i) traces every atomic
proposition to its source, (ii) clusters propositions into latent topics, and
(iii) links entities and relations to expose cross-document paths. On top of
HLG we build two complementary, plug-and-play retrievers: StatementGraphRAG,
which performs fine-grained entity-aware beam search over propositions for
high-precision factoid questions, and TopicGraphRAG, which selects coarse
topics before expanding along entity links to supply broad yet relevant context
for exploratory queries. Additionally, existing benchmarks lack the complexity
required to rigorously evaluate multi-hop summarization systems, often focusing
on single-document queries or limited datasets. To address this, we introduce a
synthetic dataset generation pipeline that curates realistic, multi-document
question-answer pairs, enabling robust evaluation of multi-hop retrieval
systems. Extensive experiments across five datasets demonstrate that our
methods outperform naive chunk-based RAG achieving an average relative
improvement of 23.1% in retrieval recall and correctness. Open-source Python
library is available at https://github.com/awslabs/graphrag-toolkit.
Authors' comments: KDD '25
Jinbao Zhu, Xiaohu Tang
The problem of $T$-colluding private information retrieval (PIR) enables the
user to retrieve one out of $M$ files from a distributed storage system with
$N$ servers without revealing anything about the index of the desired file to
any group of up to $T$ colluding servers. In the considered storage system, the
$M$ files are stored across the $N$ distributed servers in an $X$-secure
$K$-coded manner such that any group of up to $X$ colluding servers learns
nothing about the files; the storage overhead at each server is reduced by a
factor of $\frac{1}{K}$ compared to the total size of the files; and the files
can be reconstructed from any $K+X$ servers. However, in practical scenarios,
when the user retrieves the desired file from the distributed system, some
servers may respond to the user very slowly or not respond at all. These
servers are referred to as \emph{stragglers}, and particularly their identities
and numbers are unknown in advance and may change over time. This paper
considers the adaptive PIR problem that can be capable of tolerating the
presence of a varying number of stragglers. We propose a general coding method
for designing adaptive PIR schemes by introducing the concept of a
\emph{feasible PIR coding framework}. We demonstrate that any \emph{feasible
PIR coding framework} over a finite field $\mathbb{F}_q$ with size $q$ can be
used to construct an adaptive PIR scheme that achieves a retrieval rate of
$1-\frac{K+X+T-1}{N-S}$ simultaneously for all numbers of stragglers $0\leq
S\leq N-(K+X+T)$ over the same finite field. Additionally, we provide an
implementation of the \emph{feasible PIR coding framework}, ensuring that the
adaptive PIR scheme operates over any finite field $\mathbb{F}_q$ with size
$q\geq N+\max\{K, N-(K+X+T-1)\}$.
Authors' comments: Accepted by IEEE TIT
Jingyuan Qi, Zhiyang Xu, Qifan Wang, Lifu Huang
We introduce Autoregressive Retrieval Augmentation (AR-RAG), a novel paradigm
that enhances image generation by autoregressively incorporating knearest
neighbor retrievals at the patch level. Unlike prior methods that perform a
single, static retrieval before generation and condition the entire generation
on fixed reference images, AR-RAG performs context-aware retrievals at each
generation step, using prior-generated patches as queries to retrieve and
incorporate the most relevant patch-level visual references, enabling the model
to respond to evolving generation needs while avoiding limitations (e.g.,
over-copying, stylistic bias, etc.) prevalent in existing methods. To realize
AR-RAG, we propose two parallel frameworks: (1) Distribution-Augmentation in
Decoding (DAiD), a training-free plug-and-use decoding strategy that directly
merges the distribution of model-predicted patches with the distribution of
retrieved patches, and (2) Feature-Augmentation in Decoding (FAiD), a
parameter-efficient fine-tuning method that progressively smooths the features
of retrieved patches via multi-scale convolution operations and leverages them
to augment the image generation process. We validate the effectiveness of
AR-RAG on widely adopted benchmarks, including Midjourney-30K, GenEval and
DPG-Bench, demonstrating significant performance gains over state-of-the-art
image generation models.
Authors' comments: Image Generation, Retrieval Augmented Generation
David Wan, Han Wang, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
Online video web content is richly multimodal: a single video blends vision,
speech, ambient audio, and on-screen text. Retrieval systems typically treat
these modalities as independent retrieval sources, which can lead to noisy and
subpar retrieval. We explore multimodal video content retrieval, where
relevance can be scored from one particular modality or jointly across multiple
modalities simultaneously. Consequently, an effective retriever must
dynamically choose which modality (or set of modalities) best addresses the
query. We introduce CLaMR, a multimodal, late-interaction retriever that
jointly indexes 4 modalities: video frames, transcribed speech, on-screen text,
and metadata. CLaMR jointly encodes all modalities with a unified multimodal
backbone for improved contextualization and is trained to enhance dynamic
modality selection via two key innovations. First, given the lack of training
data for multimodal retrieval, we introduce MultiVENT 2.0++, a large-scale
synthetic training dataset built on MultiVENT 2.0 (event-centric videos in
various languages paired with queries) with modality-targeted queries. Next, we
propose a modality-aware loss that jointly trains according to a standard
contrastive objective alongside an objective for learning correct modality
usage. On the test sets of MultiVENT 2.0++ and MSRVTT, conventional aggregation
strategies, such as averaging similarities for baseline retrievers, degrade
performance by introducing noise from irrelevant modalities. In contrast, CLaMR
consistently outperforms existing retrievers: on MultiVENT 2.0++, CLaMR
improves nDCG@10 by 25.6 over the best single-modality retriever and by 35.4
over the best multi-modality retriever. We illustrate CLaMR's downstream
utility on long-video QA, retrieving relevant frames and obtaining a 3.50%
boost over LanguageBind on Video-MME and 1.42% over dense sampling on
LongVideoBench.
Authors' comments: 18 pages. Code and data: https://github.com/meetdavidwan/clamr