Elizabeth Jenkins, Aneesh V. Manohar, Mark B. Wise
In the large $N_c$ limit, the $\Lambda_b$ and $\Lambda_c$ can be treated as
bound states of chiral solitons and mesons containing a heavy quark. We show
that the soliton and heavy meson are bound in an attractive harmonic oscillator
potential. The Isgur-Wise function for $\Lambda_b\rightarrow\Lambda_c\,
e^-\,\bar\nu_e$ decay is computed in the large $N_c$ limit. Corrections to the
form factor which depend on $m_N/ m_Q$ can be summed exactly ($m_N$ and $m_Q$
are the nucleon and heavy quark masses). We find that this symmetry breaking
correction at zero recoil is only $1\%$.
Authors' comments: 18 pages, 0 figures, uses harvmac
B. Blok, M. Shifman
We discuss the Isgur-Wise function $\xi (y)$ in the small velocity (SV) limit
within the QCD sum rule method. The behavior of $\xi (y)$ in the SV limit is
sensitive to the particular form of the duality relations used to decontaminate
the sum rule predictions from the continuum contribution. Peculiarities of the
duality relations in the problem at hand are revealed. It is shown that the
proper requirements of duality and angular isotropy for S wave states lead to
an unambiguous form of the sum rules for the Isgur-Wise function. We illustrate
the constraints due to these requirements using a toy model of the harmonic
oscillator. The slope parameter and the shape of $\xi (y)$ are determined.
Authors' comments: 32 pages + 8 figs
Deniz Qian, Hung-Ting Chen, Eunsol Choi
Comprehensively retrieving diverse documents is crucial to address queries that admit a wide range of valid answers. We introduce retrieve-verify-retrieve (RVR), a multi-round retrieval framework designed to maximize answer coverage. Initially, a retriever takes the original query and returns a candidate document set, followed by a verifier that identifies a high-quality subset. For subsequent rounds, the query is augmented with previously verified documents to uncover answers that are not yet covered in previous rounds. RVR is effective even with off-the-shelf retrievers, and fine-tuning retrievers for our inference procedure brings further gains. Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI). We also see consistent gains on two out-of-domain datasets (QUEST and WebQuestionsSP) across different base retrievers. Our work presents a promising iterative approach for comprehensive answer recall leveraging a verifier and adapting retrievers to a new inference scenario.
Authors' comments: 18 pages, 12 figures, 12 tables
Hiren Madhu, Ngoc Bui, Ali Maatouk, Leandros Tassiulas, Smita Krishnaswamy, Menglin Yang, Sukanta Ganguly, Kiran Srinivasan et al.
Embedding geometry plays a fundamental role in retrieval quality, yet dense retrievers for retrieval-augmented generation (RAG) remain largely confined to Euclidean space. However, natural language exhibits hierarchical structure from broad topics to specific entities that Euclidean embeddings fail to preserve, causing semantically distant documents to appear spuriously similar and increasing hallucination risk. To address these limitations, we introduce hyperbolic dense retrieval, developing two model variants in the Lorentz model of hyperbolic space: HyTE-FH, a fully hyperbolic transformer, and HyTE-H, a hybrid architecture projecting pre-trained Euclidean embeddings into hyperbolic space. To prevent representational collapse during sequence aggregation, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that provably preserves hierarchical structure. On MTEB, HyTE-FH outperforms equivalent Euclidean baselines, while on RAGBench, HyTE-H achieves up to 29% gains over Euclidean baselines in context relevance and answer relevance using substantially smaller models than current state-of-the-art retrievers. Our analysis also reveals that hyperbolic representations encode document specificity through norm-based separation, with over 20% radial increase from general to specific concepts, a property absent in Euclidean embeddings, underscoring the critical role of geometric inductive bias in faithful RAG systems.
Fangzheng Tian, Debasis Ganguly, Craig Macdonald
The quality of answers generated by large language models (LLMs) in retrieval-augmented generation (RAG) is largely influenced by the contextual information contained in the retrieved documents. A key challenge for improving RAG is to predict both the utility of retrieved documents -- quantified as the performance gain from using context over generation without context -- and the quality of the final answers in terms of correctness and relevance. In this paper, we define two prediction tasks within RAG. The first is retrieval performance prediction (RPP), which estimates the utility of retrieved documents. The second is generation performance prediction (GPP), which estimates the final answer quality. We hypothesise that in RAG, the topical relevance of retrieved documents correlates with their utility, suggesting that query performance prediction (QPP) approaches can be adapted for RPP and GPP. Beyond these retriever-centric signals, we argue that reader-centric features, such as the LLM's perplexity of the retrieved context conditioned on the input query, can further enhance prediction accuracy for both RPP and GPP. Finally, we propose that features reflecting query-agnostic document quality and readability can also provide useful signals to the predictions. We train linear regression models with the above categories of predictors for both RPP and GPP. Experiments on the Natural Questions (NQ) dataset show that combining predictors from multiple feature categories yields the most accurate estimates of RAG performance.
Authors' comments: 18 pages (including reference), 3 figures, 2 table, 61 references; this paper has been accepted by ECIR'26 as a full paper
Xin Sun, Zhongqi Chen, Qiang Liu, Shu Wu, Bowen Song, Weiqiang Wang, Zilei Wang, Liang Wang
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for enhancing large language models' question-answering capabilities through the integration of external knowledge. However, when adapting RAG systems to specialized domains, challenges arise from distribution shifts, resulting in suboptimal generalization performance. In this work, we propose TTARAG, a test-time adaptation method that dynamically updates the language model's parameters during inference to improve RAG system performance in specialized domains. Our method introduces a simple yet effective approach where the model learns to predict retrieved content, enabling automatic parameter adjustment to the target domain. Through extensive experiments across six specialized domains, we demonstrate that TTARAG achieves substantial performance improvements over baseline RAG systems. Code available at https://github.com/sunxin000/TTARAG.
Eugene Yang, Andrew Yates, Dawn Lawrie, James Mayfield, Trevor Adriaanse
Retrieval models are key components of Retrieval-Augmented Generation (RAG) systems, which generate search queries, process the documents returned, and generate a response. RAG systems are often dynamic and may involve multiple rounds of retrieval. While many state-of-the-art retrieval methods are available through academic IR platforms, these platforms are typically designed for the Cranfield paradigm in which all queries are known up front and can be batch processed offline. This simplification accelerates research but leaves state-of-the-art retrieval models unable to support downstream applications that require online services, such as arbitrary dynamic RAG pipelines that involve looping, feedback, or even self-organizing agents. In this work, we introduce RoutIR, a Python package that provides a simple and efficient HTTP API that wraps arbitrary retrieval methods, including first stage retrieval, reranking, query expansion, and result fusion. By providing a minimal JSON configuration file specifying the retrieval models to serve, RoutIR can be used to construct and query retrieval pipelines on-the-fly using any permutation of available models (e.g., fusing the results of several first-stage retrieval methods followed by reranking). The API automatically performs asynchronous query batching and caches results by default. While many state-of-the-art retrieval methods are already supported by the package, RoutIR is also easily expandable by implementing the Engine abstract class. The package is open-sourced and publicly available on GitHub: http://github.com/hltcoe/routir.
Authors' comments: 17 pages, 1 figure
Rishita Agarwal, Himanshu Singhal, Peter Baile Chen, Manan Roy Choudhury, Dan Roth, Vivek Gupta
Answering natural language queries over relational data often requires
retrieving and reasoning over multiple tables, yet most retrievers optimize
only for query-table relevance and ignore table table compatibility. We
introduce REAR (Retrieve, Expand and Refine), a three-stage, LLM-free framework
that separates semantic relevance from structural joinability for efficient,
high-fidelity multi-table retrieval. REAR (i) retrieves query-aligned tables,
(ii) expands these with structurally joinable tables via fast, precomputed
column-embedding comparisons, and (iii) refines them by pruning noisy or weakly
related candidates. Empirically, REAR is retriever-agnostic and consistently
improves dense/sparse retrievers on complex table QA datasets (BIRD, MMQA, and
Spider) by improving both multi-table retrieval quality and downstream SQL
execution. Despite being LLM-free, it delivers performance competitive with
state-of-the-art LLM-augmented retrieval systems (e.g.,ARM) while achieving
much lower latency and cost. Ablations confirm complementary gains from
expansion and refinement, underscoring REAR as a practical, scalable building
block for table-based downstream tasks (e.g., Text-to-SQL).
Authors' comments: 13 pages, 2 figures, 8 tables
Yixiao Zeng, Tianyu Cao, Danqing Wang, Xinran Zhao, Zimeng Qiu, Morteza Ziyadi, Tongshuang Wu, Lei Li
Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 527 expert-level time-sensitive finance, economics, and policy documents and 48295 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model's ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our findings reveal that RAG systems are unexpectedly sensitive to perturbations. Moreover, they consistently demonstrate lower robustness on multi-hop queries compared to single-hop queries across all domains.
Chenghao Zhang, Guanting Dong, Xinyu Yang, Zhicheng Dou
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for
enhancing large language models (LLMs) by retrieving relevant documents from an
external corpus. However, existing RAG systems primarily focus on unimodal text
documents, and often fall short in real-world scenarios where both queries and
documents may contain mixed modalities (such as text and images). In this
paper, we address the challenge of Universal Retrieval-Augmented Generation
(URAG), which involves retrieving and reasoning over mixed-modal information to
improve vision-language generation. To this end, we propose Nyx, a unified
mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate
the scarcity of realistic mixed-modal data, we introduce a four-stage automated
pipeline for generation and filtering, leveraging web documents to construct
NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that
better reflect real-world information needs. Building on this high-quality
dataset, we adopt a two-stage training framework for Nyx: we first perform
pre-training on NyxQA along with a variety of open-source retrieval datasets,
followed by supervised fine-tuning using feedback from downstream
vision-language models (VLMs) to align retrieval outputs with generative
preferences. Experimental results demonstrate that Nyx not only performs
competitively on standard text-only RAG benchmarks, but also excels in the more
general and realistic URAG setting, significantly improving generation quality
in vision-language tasks.
Authors' comments: This work is in progress
Helia Hashemi, Victor Rühle, Saravan Rajmohan
Reasoning models have gained significant attention due to their strong performance, particularly when enhanced with retrieval augmentation. However, these models often incur high computational costs, as both retrieval and reasoning tokens contribute substantially to the overall resource usage. In this work, we make the following contributions: (1) we propose a retrieval-augmented reasoning model that dynamically adjusts the length of the retrieved document list based on the query and retrieval results; (2) we develop a cost-aware advantage function for training of efficient retrieval-augmented reasoning models through reinforcement learning; and (3) we explore both memory- and latency-bound implementations of the proposed cost-aware framework for both proximal and group relative policy optimization algorithms. We evaluate our approach on seven public question answering datasets and demonstrate significant efficiency gains, without compromising effectiveness. In fact, we observed that the model latency decreases by ~16-20% across datasets, while its effectiveness increases by ~5% on average, in terms of exact match.
Yingchen Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng
Generative retrieval (GR) has emerged as a new paradigm in neural information retrieval, offering an alternative to dense retrieval (DR) by directly generating identifiers of relevant documents. In this paper, we theoretically and empirically investigate how GR fundamentally diverges from DR in both learning objectives and representational capacity. GR performs globally normalized maximum-likelihood optimization and encodes corpus and relevance information directly in the model parameters, whereas DR adopts locally normalized objectives and represents the corpus with external embeddings before computing similarity via a bilinear interaction. Our analysis suggests that, under scaling, GR can overcome the inherent limitations of DR, yielding two major benefits. First, with larger corpora, GR avoids the sharp performance degradation caused by the optimization drift induced by DR's local normalization. Second, with larger models, GR's representational capacity scales with parameter size, unconstrained by the global low-rank structure that limits DR. We validate these theoretical insights through controlled experiments on the Natural Questions and MS MARCO datasets, across varying negative sampling strategies, embedding dimensions, and model scales. But despite its theoretical advantages, GR does not universally outperform DR in practice. We outline directions to bridge the gap between GR's theoretical potential and practical performance, providing guidance for future research in scalable and robust generative retrieval.
Kaishuai Xu, Wenjun Hou, Yi Cheng, Wenjie Li
Large Language Models (LLMs) have shown promising performance on diverse medical benchmarks, highlighting their potential in supporting real-world clinical tasks. Retrieval-Augmented Generation (RAG) has emerged as a key approach for mitigating knowledge gaps and hallucinations by incorporating external medical information. However, RAG still struggles with complex medical questions that require intensive reasoning, as surface-level input often fails to reflect the true knowledge needs of the task. Existing methods typically focus on refining queries without explicitly modeling the reasoning process, limiting their ability to retrieve and integrate clinically relevant knowledge. In this work, we propose RAR$^2$, a joint learning framework that improves both Reasoning-Augmented Retrieval and Retrieval-Augmented Reasoning. RAR$^2$ constructs a thought process to uncover implicit knowledge requirements and uses it to guide retrieval and answer generation. We build a training dataset of mixed preference pairs and apply Direct Preference Optimization (DPO) to train the model. Moreover, we design two test-time scaling strategies to explore the boundaries of our framework. Experiments demonstrate the effectiveness of RAR$^2$ across several biomedical question answering datasets, outperforming RAG baselines with or without fine-tuning.
Authors' comments: Accepted by EMNLP 2025 Findings
Julian Killingback, Hamed Zamani
Large language models (LLMs) are incredible and versatile tools for text-based tasks that have enabled countless, previously unimaginable, applications. Retrieval models, in contrast, have not yet seen such capable general-purpose models emerge. To achieve this goal, retrieval models must be able to perform complex retrieval tasks, where queries contain multiple parts, constraints, or requirements in natural language. These tasks represent a natural progression from the simple, single-aspect queries that are used in the vast majority of existing, commonly used evaluation sets. Complex queries naturally arise as people expect search systems to handle more specific and often ambitious information requests, as is demonstrated by how people use LLM-based information systems. Despite the growing desire for retrieval models to expand their capabilities in complex retrieval tasks, there exist limited resources to assess the ability of retrieval models on a comprehensive set of diverse complex tasks. The few resources that do exist feature a limited scope and often lack realistic settings making it hard to know the true capabilities of retrieval models on complex real-world retrieval tasks. To address this shortcoming and spur innovation in next-generation retrieval models, we construct a diverse and realistic set of complex retrieval tasks and benchmark a representative set of state-of-the-art retrieval models. Additionally, we explore the impact of LLM-based query expansion and rewriting on retrieval quality. Our results show that even the best models struggle to produce high-quality retrieval results with the highest average nDCG@10 of only 0.346 and R@100 of only 0.587 across all tasks. Although LLM augmentation can help weaker models, the strongest model has decreased performance across all metrics with all rewriting techniques.
Dinh-Khoi Vo, Van-Loc Nguyen, Minh-Triet Tran, Trung-Nghia Le
Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit causality, temporal context, or contain long, complex narratives. To tackle these issues, we introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection, followed by caption-guided semantic matching and rank-aware selection. We leverage Qwen3 for article search, Qwen3-Reranker for contextual alignment, and Qwen2-VL for precise image scoring. To further enhance performance and robustness, we fuse outputs from multiple configurations using Reciprocal Rank Fusion (RRF). Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, demonstrating the effectiveness of combining language-based reasoning and multimodal retrieval for complex, real-world image understanding. The code is available at https://github.com/vdkhoi20/EVENT-Retriever.
Authors' comments: ACM Multimedia 2025
Leqian Li, Dianxi Shi, Jialu Zhou, Xinyu Wei, Mingyue Yang, Songchang Jin, Shaowu Yang
Large Language Models (LLMs) have shown remarkable capabilities across diverse tasks, yet they face inherent limitations such as constrained parametric knowledge and high retraining costs. Retrieval-Augmented Generation (RAG) augments the generation process by retrieving externally stored knowledge absent from the models internal parameters. However, RAG methods face challenges such as information loss and redundant retrievals during multi-round queries, accompanying the difficulties in precisely characterizing knowledge gaps for complex tasks. To address these problems, we propose Retrieval Feedback and Memory Retrieval Augmented Generation(RFM-RAG), which transforms the stateless retrieval of previous methods into stateful continuous knowledge management by constructing a dynamic evidence pool. Specifically, our method generates refined queries describing the models knowledge gaps using relational triples from questions and evidence from the dynamic evidence pool; Retrieves critical external knowledge to iteratively update this evidence pool; Employs a R-Feedback Model to evaluate evidence completeness until convergence. Compared to traditional RAG methods, our approach enables persistent storage of retrieved passages and effectively distills key information from passages to construct clearly new queries. Experiments on three public QA benchmarks demonstrate that RFM-RAG outperforms previous methods and improves overall system accuracy.
Julien Guinot, Elio Quinton, György Fazekas
Multimodal contrastive models have achieved strong performance in text-audio
retrieval and zero-shot settings, but improving joint embedding spaces remains
an active research area. Less attention has been given to making these systems
controllable and interactive for users. In text-music retrieval, the ambiguity
of freeform language creates a many-to-many mapping, often resulting in
inflexible or unsatisfying results.
We introduce Generative Diffusion Retriever (GDR), a novel framework that
leverages diffusion models to generate queries in a retrieval-optimized latent
space. This enables controllability through generative tools such as negative
prompting and denoising diffusion implicit models (DDIM) inversion, opening a
new direction in retrieval control. GDR improves retrieval performance over
contrastive teacher models and supports retrieval in audio-only latent spaces
using non-jointly trained encoders. Finally, we demonstrate that GDR enables
effective post-hoc manipulation of retrieval behavior, enhancing interactive
control for text-music retrieval tasks.
Authors' comments: Accepted to ISMIR 2025
Chaeeun Kim, Seungone Kim
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in
multi-step reasoning and calling search engines at appropriate steps. However,
existing retrieval-augmented reasoning approaches rely on separate retrieval
models, limiting the LRM's role in retrieval to deciding when to retrieve and
how to query. This separation not only increases hardware and operational costs
but also leads to errors in the retrieval process due to the representation
bottleneck, a phenomenon where the retriever's embedding space is not
expressive enough to meet the generator's requirements. To address this, we
shift our perspective from sequence-to-sequence matching to locating the
answer-containing paths within the corpus, and propose a novel framework called
FREESON (Retriever-FREE Retrieval-Augmented ReaSONing). This framework enables
LRMs to retrieve relevant knowledge on their own by acting as both a generator
and retriever. To achieve this, we introduce a variant of the MCTS algorithm
specialized for the retrieval task, which we call CT-MCTS (Corpus-Traversing
Monte Carlo Tree Search). In this algorithm, LRMs traverse through the corpus
toward answer-containing regions. Our results on five open-domain QA
benchmarks, including single-hop and multi-hop questions, show that FREESON
achieves an average improvement of 14.4% in EM and F1 over four multi-step
reasoning models with a separate retriever, and it also performs comparably to
the strongest baseline, surpassing it by 3% on PopQA and 2WikiMultihopQA.
Authors' comments: Work In Progress
Arthur Satouf, Gabriel Ben Zenou, Benjamin Piwowarski, Habiboulaye Amadou Boubacar, Pablo Piantanida
Current sparse neural information retrieval (IR) methods, and to a lesser
extent more traditional models such as BM25, do not take into account the
document collection and the complex interplay between different term weights
when representing a single document. In this paper, we show how the Rational
Speech Acts (RSA), a linguistics framework used to minimize the number of
features to be communicated when identifying an object in a set, can be adapted
to the IR case -- and in particular to the high number of potential features
(here, tokens). RSA dynamically modulates token-document interactions by
considering the influence of other documents in the dataset, better contrasting
document representations. Experiments show that incorporating RSA consistently
improves multiple sparse retrieval models and achieves state-of-the-art
performance on out-of-domain datasets from the BEIR benchmark.
https://github.com/arthur-75/Rational-Retrieval-Acts
Authors' comments: 6 pages - 2 figures - conference: accepted at SIGIR 2025
Parishad BehnamGhader, Nicholas Meade, Siva Reddy
Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the safety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for >50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.