Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Sedigheh Eslami, Scott Martens, Bo Wang et al.
We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding
model that unifies text and image representations through a novel architecture
supporting both single-vector and multi-vector embeddings in the late
interaction style. The model incorporates task-specific Low-Rank Adaptation
(LoRA) adapters to optimize performance across diverse retrieval scenarios,
including query-based information retrieval, cross-modal semantic similarity,
and programming code search. Comprehensive evaluations demonstrate that
jina-embeddings-v4 achieves state-of-the-art performance on both single- modal
and cross-modal retrieval tasks, with particular strength in processing
visually rich content such as tables, charts, diagrams, and mixed-media
formats. To facilitate evaluation of this capability, we also introduce
Jina-VDR, a novel benchmark specifically designed for visually rich image
retrieval.
Authors' comments: 22 pages, 1-10 main, 14-22 experimental results, benchmark tables
Mihailo Stojnic
We study theoretical limits of \emph{descending} phase retrieval algorithms. Utilizing \emph{Random duality theory} (RDT) we develop a generic program that allows statistical characterization of various algorithmic performance metrics. Through these we identify the concepts of \emph{parametric manifold} and its \emph{funneling points} as key mathematical objects that govern the underlying algorithms' behavior. An isomorphism between single funneling point manifolds and global convergence of descending algorithms is established. The structure and shape of the parametric manifold as well as its dependence on the sample complexity are studied through both plain and lifted RDT. Emergence of a phase transition is observed. Namely, as sample complexity increases, parametric manifold transitions from a multi to a single funneling point structure. This in return corresponds to a transition from the scenarios where descending algorithms generically fail to the scenarios where they succeed in solving phase retrieval. We also develop and implement a practical algorithmic variant that in a hybrid alternating fashion combines a barrier and a plain gradient descent. Even though the theoretical results are obtained for infinite dimensional scenarios (and consequently non-jittery parametric manifolds), we observe a strong agrement between theoretical and simulated phase transitions predictions for fairly small dimensions on the order of a few hundreds.
Ines Besrour, Jingbo He, Tobias Schreieder, Michael Färber
We present RAGentA, a multi-agent retrieval-augmented generation (RAG)
framework for attributed question answering (QA). With the goal of trustworthy
answer generation, RAGentA focuses on optimizing answer correctness, defined by
coverage and relevance to the question and faithfulness, which measures the
extent to which answers are grounded in retrieved documents. RAGentA uses a
multi-agent architecture that iteratively filters retrieved documents,
generates attributed answers with in-line citations, and verifies completeness
through dynamic refinement. Central to the framework is a hybrid retrieval
strategy that combines sparse and dense methods, improving Recall@20 by 12.5%
compared to the best single retrieval model, resulting in more correct and
well-supported answers. Evaluated on a synthetic QA dataset derived from the
FineWeb index, RAGentA outperforms standard RAG baselines, achieving gains of
1.09% in correctness and 10.72% in faithfulness. These results demonstrate the
effectiveness of the multi-agent architecture and hybrid retrieval in advancing
trustworthy QA.
Authors' comments: Accepted at SIGIR 2025
Xin Jiang, Meiqi Cao, Hao Tang, Fei Shen, Zechao Li
Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to overfit the training data while forgetting the knowledge gained from large-scale pre-training, thus reducing their generalization ability. In this paper, we propose a Dual-Vision Adaptation (DVA) approach for FGIR, which guides the frozen pre-trained model to perform FGIR through collaborative sample and feature adaptation. Specifically, we design Object-Perceptual Adaptation, which modifies input samples to help the pre-trained model perceive critical objects and elements within objects that are helpful for category prediction. Meanwhile, we propose In-Context Adaptation, which introduces a small set of parameters for feature adaptation without modifying the pre-trained parameters. This makes the FGIR task using these adjusted features closer to the task solved during the pre-training. Additionally, to balance retrieval efficiency and performance, we propose Discrimination Perception Transfer to transfer the discriminative knowledge in the object-perceptual adaptation to the image encoder using the knowledge distillation mechanism. Extensive experiments show that DVA has fewer learnable parameters and performs well on three in-distribution and three out-of-distribution fine-grained datasets.
Le Vu Anh, Nguyen Viet Anh, Mehmet Dik, Luong Van Nghia
Retrieval-augmented generation (RAG) has become a common strategy for
updating large language model (LLM) responses with current, external
information. However, models may still rely on memorized training data, bypass
the retrieved evidence, and produce contaminated outputs. We introduce
Retrieval-Path Contamination Scoring (RePCS), a diagnostic method that detects
such behavior without requiring model access or retraining. RePCS compares two
inference paths: (i) a parametric path using only the query, and (ii) a
retrieval-augmented path using both the query and retrieved context by
computing the Kullback-Leibler (KL) divergence between their output
distributions. A low divergence suggests that the retrieved context had minimal
impact, indicating potential memorization. This procedure is model-agnostic,
requires no gradient or internal state access, and adds only a single
additional forward pass. We further derive PAC-style guarantees that link the
KL threshold to user-defined false positive and false negative rates. On the
Prompt-WNQA benchmark, RePCS achieves a ROC-AUC of 0.918. This result
outperforms the strongest prior method by 6.5 percentage points while keeping
latency overhead below 4.7% on an NVIDIA T4 GPU. RePCS offers a lightweight,
black-box safeguard to verify whether a RAG system meaningfully leverages
retrieval, making it especially valuable in safety-critical applications.
Authors' comments: 11 pages, 7 figures, 5 tables
Abu Hanif Muhammad Syarubany, Chang Dong Yoo
Enterprise deployments of large-language model (LLM) demand continuously changing document collections with sub-second latency and predictable GPU cost requirements that classical Retrieval-Augmented Generation (RAG) pipelines only partially satisfy. We present PentaRAG, a five-layer module that routes each query through two instant caches (fixed key-value and semantic), a memory-recall mode that exploits the LLM's own weights, an adaptive session memory, and a conventional retrieval-augmentation layer. Implemented with Mistral-8B, Milvus and vLLM, the system can answer most repeated or semantically similar questions from low-latency caches while retaining full retrieval for novel queries. On the TriviaQA domain, LoRA fine-tuning combined with the memory-recall layer raises answer similarity by approximately 8% and factual correctness by approximately 16% over the base model. Under a nine-session runtime simulation, cache warming reduces mean latency from several seconds to well below one second and shifts traffic toward the fast paths. Resource-efficiency tests show that PentaRAG cuts average GPU time to 0.248 seconds per query, roughly half that of a naive RAG baseline, and sustains an aggregate throughput of approximately 100,000 queries per second on our setup. These results demonstrate that a layered routing strategy can deliver freshness, speed, and efficiency simultaneously in production-grade RAG systems.
Authors' comments: Annual Conference of The Institute of Electronics and Information Engineers
Yanzhen Zou, Xianlin Zhao, Xinglu Pan, Bing Xie
Issue reports have been recognized to contain rich information for
retrieval-augmented code comment generation. However, how to minimize
hallucinations in the generated comments remains significant challenges. In
this paper, we propose IsComment, an issue-based LLM retrieval and verification
approach for generating method's design rationale, usage directives, and so on
as supplementary code comments. We first identify five main types of code
supplementary information that issue reports can provide through
code-comment-issue analysis. Next, we retrieve issue sentences containing these
types of supplementary information and generate candidate code comments. To
reduce hallucinations, we filter out those candidate comments that are
irrelevant to the code or unverifiable by the issue report, making the code
comment generation results more reliable. Our experiments indicate that
compared with LLMs, IsComment increases the coverage of manual supplementary
comments from 33.6% to 72.2% for ChatGPT, from 35.8% to 88.4% for GPT-4o, and
from 35.0% to 86.2% for DeepSeek-V3. Compared with existing work, IsComment can
generate richer and more useful supplementary code comments for programming
understanding, which is quantitatively evaluated through the MESIA metric on
both methods with and without manual code comments.
Authors' comments: 12 pages, 8 figures
Dong Xu, Zhangfan Yang, Ka-chun Wong, Zexuan Zhu, Jiangqiang Li, Junkai Ji
Breakthroughs in high-accuracy protein structure prediction, such as
AlphaFold, have established receptor-based molecule design as a critical driver
for rapid early-phase drug discovery. However, most approaches still struggle
to balance pocket-specific geometric fit with strict valence and synthetic
constraints. To resolve this trade-off, a Retrieval-Enhanced Aligned Diffusion
termed READ is introduced, which is the first to merge molecular
Retrieval-Augmented Generation with an SE(3)-equivariant diffusion model.
Specifically, a contrastively pre-trained encoder aligns atom-level
representations during training, then retrieves graph embeddings of
pocket-matched scaffolds to guide each reverse-diffusion step at inference.
This single mechanism can inject real-world chemical priors exactly where
needed, producing valid, diverse, and shape-complementary ligands. Experimental
results demonstrate that READ can achieve very competitive performance in
CBGBench, surpassing state-of-the-art generative models and even native
ligands. That suggests retrieval and diffusion can be co-optimized for faster,
more reliable structure-based drug design.
Authors' comments: 13 pages, 5 figures
Aleksander Smywiński-Pohl, Tomer Libal, Adam Kaczmarczyk, Magdalena Król
One of the elements of legal research is looking for cases where judges have extended the meaning of a legal concept by providing interpretations of what a concept means or does not mean. This allow legal professionals to use such interpretations as precedents as well as laymen to better understand the legal concept. The state-of-the-art approach for retrieving the most relevant interpretations for these concepts currently depends on the ranking of sentences and the training of language models over annotated examples. That manual annotation process can be quite expensive and need to be repeated for each such concept, which prompted recent research in trying to automate this process. In this paper, we highlight the results of various experiments conducted to determine the volume, scope and even the need for manual annotation. First of all, we check what is the optimal number of annotations per a legal concept. Second, we check if we can draw the sentences for annotation randomly or there is a gain in the performance of the model, when only the best candidates are annotated. As the last question we check what is the outcome of automating the annotation process with the help of an LLM.
Sarthak Chaturvedi, Anurag Acharya, Rounak Meyur, Koby Hayashi, Sai Munikoti, Sameera Horawalavithana
Evaluation benchmark characteristics may distort the true benefits of domain adaptation in retrieval models. This creates misleading assessments that influence deployment decisions in specialized domains. We show that two benchmarks with drastically different features such as topic diversity, boundary overlap, and semantic complexity can influence the perceived benefits of fine-tuning. Using environmental regulatory document retrieval as a case study, we fine-tune ColBERTv2 model on Environmental Impact Statements (EIS) from federal agencies. We evaluate these models across two benchmarks with different semantic structures. Our findings reveal that identical domain adaptation approaches show very different perceived benefits depending on evaluation methodology. On one benchmark, with clearly separated topic boundaries, domain adaptation shows small improvements (maximum 0.61% NDCG gain). However, on the other benchmark with overlapping semantic structures, the same models demonstrate large improvements (up to 2.22% NDCG gain), a 3.6-fold difference in the performance benefit. We compare these benchmarks through topic diversity metrics, finding that the higher-performing benchmark shows 11% higher average cosine distances between contexts and 23% lower silhouette scores, directly contributing to the observed performance difference. These results demonstrate that benchmark selection strongly determines assessments of retrieval system effectiveness in specialized domains. Evaluation frameworks with well-separated topics regularly underestimate domain adaptation benefits, while those with overlapping semantic boundaries reveal improvements that better reflect real-world regulatory document complexity. Our findings have important implications for developing and deploying AI systems for interdisciplinary domains that integrate multiple topics.
Safayat Bin Hakim, Muhammad Adil, Alvaro Velasquez, Houbing Herbert Song
Retrieval-Augmented Generation (RAG) systems address factual inconsistencies in Large Language Models by grounding generation in external knowledge, yet they face a fundamental efficiency problem: simple queries consume computational resources equivalent to complex multi-hop reasoning tasks. We present SymRAG, a neuro-symbolic framework that introduces adaptive query routing based on real-time complexity and system load assessments. SymRAG dynamically selects symbolic, neural, or hybrid processing paths to align resource use with query demands. Evaluated on 2,000 queries from HotpotQA and DROP using Llama-3.2-3B and Mistral-7B models, SymRAG achieves 97.6--100.0% exact match accuracy with significantly lower CPU utilization (3.6--6.2%) and processing time (0.985--3.165s). Disabling adaptive logic results in 169--1151% increase in processing time, highlighting the framework's impact. These results underscore the potential of adaptive neuro-symbolic routing for scalable, sustainable AI systems.
Jiale Zhang, Jiaxiang Chen, Zhucong Li, Jie Ding, Kui Zhao, Zenglin Xu, Xin Pang, Yinghui Xu
Retrieval-Augmented Generation (RAG) enhances language models by incorporating external knowledge at inference time. However, graph-based RAG systems often suffer from structural overhead and imprecise retrieval: they require costly pipelines for entity linking and relation extraction, yet frequently return subgraphs filled with loosely related or tangential content. This stems from a fundamental flaw -- semantic similarity does not imply semantic relevance. We introduce SlimRAG, a lightweight framework for retrieval without graphs. SlimRAG replaces structure-heavy components with a simple yet effective entity-aware mechanism. At indexing time, it constructs a compact entity-to-chunk table based on semantic embeddings. At query time, it identifies salient entities, retrieves and scores associated chunks, and assembles a concise, contextually relevant input -- without graph traversal or edge construction. To quantify retrieval efficiency, we propose Relative Index Token Utilization (RITU), a metric measuring the compactness of retrieved content. Experiments across multiple QA benchmarks show that SlimRAG outperforms strong flat and graph-based baselines in accuracy while reducing index size and RITU (e.g., 16.31 vs. 56+), highlighting the value of structure-free, entity-centric context selection. The code will be released soon. https://github.com/continue-ai-company/SlimRAG
Zhuocheng Zhang, Yang Feng, Min Zhang
Retrieval-Augmented Generation (RAG) plays a pivotal role in modern large
language model applications, with numerous existing frameworks offering a wide
range of functionalities to facilitate the development of RAG systems. However,
we have identified several persistent challenges in these frameworks, including
difficulties in algorithm reproduction and sharing, lack of new techniques, and
high system overhead. To address these limitations, we introduce
\textbf{FlexRAG}, an open-source framework specifically designed for research
and prototyping. FlexRAG supports text-based, multimodal, and network-based
RAG, providing comprehensive lifecycle support alongside efficient asynchronous
processing and persistent caching capabilities. By offering a robust and
flexible solution, FlexRAG enables researchers to rapidly develop, deploy, and
share advanced RAG systems. Our toolkit and resources are available at
\href{https://github.com/ictnlp/FlexRAG}{https://github.com/ictnlp/FlexRAG}.
Authors' comments: Accepted by ACL 2025 Demo
Evan Becker, Benjamin Bowman, Matthew Trager, Tian Yu Liu, Luca Zancato, Wei Xia, Stefano Soatto
Given a query and dataset, the optimal way of answering the query is to make use all the information available. Modern LLMs exhibit impressive ability to memorize training data, but data not deemed important during training is forgotten, and information outside that training set cannot be made use of. Processing an entire dataset at inference time is infeasible due to the bounded nature of model resources (e.g. context size in transformers or states in state space models), meaning we must resort to external memory. This constraint naturally leads to the following problem: How can we decide based on the present query and model, what among a virtually unbounded set of known data matters for inference? To minimize model uncertainty for a particular query at test-time, we introduce Retrieval In-Context Optimization (RICO), a retrieval method that uses gradients from the LLM itself to learn the optimal mixture of documents for answer generation. Unlike traditional retrieval-augmented generation (RAG), which relies on external heuristics for document retrieval, our approach leverages direct feedback from the model. Theoretically, we show that standard top-$k$ retrieval with model gradients can approximate our optimization procedure, and provide connections to the leave-one-out loss. We demonstrate empirically that by minimizing an unsupervised loss objective in the form of question perplexity, we can achieve comparable retriever metric performance to BM25 with \emph{no finetuning}. Furthermore, when evaluated on quality of the final prediction, our method often outperforms fine-tuned dense retrievers such as E5.
Prajwal Niraula, Julien de Wit, Robert Hargreaves, Iouli E. Gordon, Clara Sousa-Silva
Cassini's observations of Titan's atmosphere are exemplary benchmarks for
exoplanet atmospheric studies owing to (1) their precision and (2) our
independent knowledge of Titan. Leveraging these observations, we perform
retrievals (i.e., analyses) of Titan's transmission spectrum to investigate the
strengths/limitations of exoplanet atmospheric retrievals with a particular
focus on the underlying assumptions regarding the molecular species included in
the retrieval. We find that multiple hydrocarbons can be ``retrieved''
depending on the selection made ahead of a retrieval. More importantly, we find
that the estimates of other parameters such as the abundance of key absorbers
like methane can be biased by $\sim$0.5 dex (by a factor of $\sim$3) due to
such choices. This shows that beyond the possible misidentification of a
molecular feature (e.g., current debate surrounding dimethyl sulfide, DMS, in
K2-18 b), the implicit molecular detections made pre-retrieval to avoid
retrieving for hundreds of molecules at a time can bias a large range of
parameters. We thus recommend sensitivity analysis to assess the dependencies
of atmospheric inferences on such selections in tandem with complementary
information (e.g., chemistry models) to support any pre-retrieval selection.
Finally, we introduce an independent path to constrain the dominant atmospheric
constituent, even when lacking observable absorption feature (e.g., H$_2$ and
N$_2$) through the scale height.
Authors' comments: Comments welcome
Yu Wang, Shiwan Zhao, Ming Fan, Zhihu Wang, Yubo Zhang, Xicheng Zhang, Zhengfan Wang, Heyuan Huang et al.
The integration of external knowledge through Retrieval-Augmented Generation (RAG) has become foundational in enhancing large language models (LLMs) for knowledge-intensive tasks. However, existing RAG paradigms often overlook the cognitive step of applying knowledge, leaving a gap between retrieved facts and task-specific reasoning. In this work, we introduce RAG+, a principled and modular extension that explicitly incorporates application-aware reasoning into the RAG pipeline. RAG+ constructs a dual corpus consisting of knowledge and aligned application examples, created either manually or automatically, and retrieves both jointly during inference. This design enables LLMs not only to access relevant information but also to apply it within structured, goal-oriented reasoning processes. Experiments across mathematical, legal, and medical domains, conducted on multiple models, demonstrate that RAG+ consistently outperforms standard RAG variants, achieving average improvements of 3-5%, and peak gains up to 7.5% in complex scenarios. By bridging retrieval with actionable application, RAG+ advances a more cognitively grounded framework for knowledge integration, representing a step toward more interpretable and capable LLMs.
Xiaohan Yu, Pu Jian, Chong Chen
Retrieval-Augmented Generation (RAG) has demonstrated considerable
effectiveness in open-domain question answering. However, when applied to
heterogeneous documents, comprising both textual and tabular components,
existing RAG approaches exhibit critical limitations. The prevailing practice
of flattening tables and chunking strategies disrupts the intrinsic tabular
structure, leads to information loss, and undermines the reasoning capabilities
of LLMs in multi-hop, global queries. To address these challenges, we propose
TableRAG, an hybrid framework that unifies textual understanding and complex
manipulations over tabular data. TableRAG iteratively operates in four steps:
context-sensitive query decomposition, text retrieval, SQL programming and
execution, and compositional intermediate answer generation. We also develop
HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous
reasoning capabilities. Experimental results demonstrate that TableRAG
consistently outperforms existing baselines on both public datasets and our
HeteQA, establishing a new state-of-the-art for heterogeneous document question
answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.
Authors' comments: Under review. Codes are available at
https://github.com/yxh-y/TableRAG/tree/main
Linlin Wang, Tianqing Zhu, Laiqiao Qin, Longxiang Gao, Wanlei Zhou
In Large Language Models, Retrieval-Augmented Generation (RAG) systems can significantly enhance the performance of large language models by integrating external knowledge. However, RAG also introduces new security risks. Existing research focuses mainly on how poisoning attacks in RAG systems affect model output quality, overlooking their potential to amplify model biases. For example, when querying about domestic violence victims, a compromised RAG system might preferentially retrieve documents depicting women as victims, causing the model to generate outputs that perpetuate gender stereotypes even when the original query is gender neutral. To show the impact of the bias, this paper proposes a Bias Retrieval and Reward Attack (BRRA) framework, which systematically investigates attack pathways that amplify language model biases through a RAG system manipulation. We design an adversarial document generation method based on multi-objective reward functions, employ subspace projection techniques to manipulate retrieval results, and construct a cyclic feedback mechanism for continuous bias amplification. Experiments on multiple mainstream large language models demonstrate that BRRA attacks can significantly enhance model biases in dimensions. In addition, we explore a dual stage defense mechanism to effectively mitigate the impacts of the attack. This study reveals that poisoning attacks in RAG systems directly amplify model output biases and clarifies the relationship between RAG system security and model fairness. This novel potential attack indicates that we need to keep an eye on the fairness issues of the RAG system.
Jiaqi Samantha Zhan, Crystina Zhang, Shengyao Zhuang, Xueguang Ma, Jimmy Lin
Effective video retrieval remains challenging due to the complexity of integrating visual, auditory, and textual modalities. In this paper, we explore unified retrieval methods using OmniEmbed, a powerful multimodal embedding model from the Tevatron 2.0 toolkit, in the context of the MAGMaR shared task. Evaluated on the comprehensive MultiVENT 2.0 dataset, OmniEmbed generates unified embeddings for text, images, audio, and video, enabling robust multimodal retrieval. By finetuning OmniEmbed with the combined multimodal data--visual frames, audio tracks, and textual descriptions provided in MultiVENT 2.0, we achieve substantial improvements in complex, multilingual video retrieval tasks. Our submission achieved the highest score on the MAGMaR shared task leaderboard among public submissions as of May 20th, 2025, highlighting the practical effectiveness of our unified multimodal retrieval approach. Model checkpoint in this work is opensourced.
Val Andrei Fajardo, David B. Emerson, Amandeep Singh, Veronica Chatrath, Marcelo Lotif, Ravi Theja, Alex Cheung, Izuki Matsubi
Retrieval-augmented generation (RAG) systems have been shown to be effective
in addressing many of the drawbacks of relying solely on the parametric memory
of large language models. Recent work has demonstrated that RAG systems can be
improved via fine-tuning of their retriever and generator models. In this work,
we introduce FedRAG, a framework for fine-tuning RAG systems across centralized
and federated architectures. FedRAG supports state-of-the-art fine-tuning
methods, offering a simple and intuitive interface and a seamless conversion
from centralized to federated training tasks. FedRAG is also deeply integrated
with the modern RAG ecosystem, filling a critical gap in available tools.
Authors' comments: 9 pages, 4 figures, 2 tables. Accepted for the CODEML Workshop at
ICML 2025. Framework code available at
https://github.com/VectorInstitute/fed-rag