Bruno Korbar, Andrew Zisserman
The goal of this paper is to be able to retrieve images using a compound
query that combines object instance information from an image, with a natural
text description of what that object is doing or where it is. For example, to
retrieve an image of "Fluffy the unicorn (specified by an image) on someone's
head". To achieve this we design a mapping network that can "translate" from a
local image embedding (of the object instance) to a text token, such that the
combination of the token and a natural language query is suitable for CLIP
style text encoding, and image retrieval. Generating a text token in this
manner involves a simple training procedure, that only needs to be performed
once for each object instance. We show that our approach of using a trainable
mapping network, termed pi-map, together with frozen CLIP text and image
encoders, improves the state of the art on two benchmarks designed to assess
personalized retrieval.
Authors' comments: Published as an oral in CBMI2025
Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz et al.
Large language models (LLMs) often fail to scale their performance on
long-context tasks performance in line with the context lengths they support.
This gap is commonly attributed to retrieval failures -- the models' inability
to identify relevant information in the long inputs. Accordingly, recent
efforts often focus on evaluating and improving LLMs' retrieval performance: if
retrieval is perfect, a model should, in principle, perform just as well on a
long input as it does on a short one -- or should it? This paper presents
findings that the answer to this question may be negative. Our systematic
experiments across 5 open- and closed-source LLMs on math, question answering,
and coding tasks reveal that, even when models can perfectly retrieve all
relevant information, their performance still degrades substantially
(13.9%--85%) as input length increases but remains well within the models'
claimed lengths. This failure occurs even when the irrelevant tokens are
replaced with minimally distracting whitespace, and, more surprisingly, when
they are all masked and the models are forced to attend only to the relevant
tokens. A similar performance drop is observed when all relevant evidence is
placed immediately before the question. Our findings reveal a
previously-unrealized limitation: the sheer length of the input alone can hurt
LLM performance, independent of retrieval quality and without any distraction.
They motivate our simple, model-agnostic mitigation strategy that transforms a
long-context task into a short-context one by prompting the model to recite the
retrieved evidence before attempting to solve the problem. On RULER, we observe
a consistent improvement of GPT-4o up to 4% on an already strong baseline.
Authors' comments: 18 pages (9 pages of main content), 5 figures, accepted at the
Findings of EMNLP 2025
Omri Uzan, Asaf Yehudai, Roi pony, Eyal Shnarch, Ariel Gera
Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual query tokens directly to image patches and achieving state-of-the-art performance on public benchmarks. Recent models relying on this paradigm have massively scaled the sizes of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited by modern vision-language models. In this work, we connect these challenges to the paradigm of hybrid retrieval, investigating whether a lightweight dense text retriever can enhance a stronger vision-centric model. Existing hybrid methods, which rely on coarse-grained fusion of ranks or scores, fail to exploit the rich interactions within each model's representation space. To address this, we introduce Guided Query Refinement (GQR), a novel test-time optimization method that refines a primary retriever's query embedding using guidance from a complementary retriever's scores. Through extensive experiments on visual document retrieval benchmarks, we demonstrate that GQR allows vision-centric models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. Our findings show that GQR effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval. We release our code at https://github.com/IBM/test-time-hybrid-retrieval
Yongan Yu, Xianda Du, Qingchen Hu, Jiahao Liang, Jingwei Ni, Dan Qiang, Kaiyu Huang, Grant McKenzie et al.
Historical archives on weather events are collections of enduring primary source records that offer rich, untapped narratives of how societies have experienced and responded to extreme weather events. These qualitative accounts provide insights into societal vulnerability and resilience that are largely absent from meteorological records, making them valuable for climate scientists to understand societal responses. However, their vast scale, noisy digitized quality, and archaic language make it difficult to transform them into structured knowledge for climate research. To address this challenge, we introduce WeatherArchive-Bench, the first benchmark for evaluating retrieval-augmented generation (RAG) systems on historical weather archives. WeatherArchive-Bench comprises two tasks: WeatherArchive-Retrieval, which measures a system's ability to locate historically relevant passages from over one million archival news segments, and WeatherArchive-Assessment, which evaluates whether Large Language Models (LLMs) can classify societal vulnerability and resilience indicators from extreme weather narratives. Extensive experiments across sparse, dense, and re-ranking retrievers, as well as a diverse set of LLMs, reveal that dense retrievers often fail on historical terminology, while LLMs frequently misinterpret vulnerability and resilience concepts. These findings highlight key limitations in reasoning about complex societal indicators and provide insights for designing more robust climate-focused RAG systems from archival contexts. The constructed dataset and evaluation framework are publicly available at https://anonymous.4open.science/r/WeatherArchive-Bench/.
Faisal Hamman, Chenyang Zhu, Anoop Kumar, Xujun Peng, Sanghamitra Dutta, Daben Liu, Alfy Samuel
RAG systems are increasingly deployed in high-stakes domains where users
expect outputs to be consistent across semantically equivalent queries.
However, existing systems often exhibit significant inconsistencies due to
variability in both the retriever and generator (LLM), undermining trust and
reliability. In this work, we focus on information consistency, i.e., the
requirement that outputs convey the same core content across semantically
equivalent inputs. We introduce a principled evaluation framework that
decomposes RAG consistency into retriever-level, generator-level, and
end-to-end components, helping identify inconsistency sources. To improve
consistency, we propose Paraphrased Set Group Relative Policy Optimization
(PS-GRPO), an RL approach that leverages multiple rollouts across paraphrased
set to assign group similarity rewards. We leverage PS-GRPO to achieve
Information Consistent RAG (Con-RAG), training the generator to produce
consistent outputs across paraphrased queries and remain robust to
retrieval-induced variability. Because exact reward computation over paraphrase
sets is computationally expensive, we also introduce a scalable approximation
method that retains effectiveness while enabling efficient, large-scale
training. Empirical evaluations across short-form, multi-hop, and long-form QA
benchmarks demonstrate that Con-RAG significantly improves both consistency and
accuracy over strong baselines, even in the absence of explicit ground-truth
supervision. Our work provides practical solutions for evaluating and building
reliable RAG systems for safety-critical deployments.
Authors' comments: Accepted at NeurIPS 2025 Workshop on Reliable ML from Unreliable Data
Lingnan Xu, Chong Feng, Kaiyuan Zhang, Liu Zhengyong, Wenqiang Xu, Fanqing Meng
While large language models (LLMs) demonstrate impressive capabilities, their
reliance on parametric knowledge often leads to factual inaccuracies.
Retrieval-Augmented Generation (RAG) mitigates this by leveraging external
documents, yet existing approaches treat retrieved passages as isolated chunks,
ignoring valuable structure that is crucial for document organization.
Motivated by this gap, we propose Retrieve-DocumentRoute-Read (RDR2), a novel
framework that explicitly incorporates structural information throughout the
RAG process. RDR2 employs an LLM-based router to dynamically navigate document
structure trees, jointly evaluating content relevance and hierarchical
relationships to assemble optimal evidence. Our key innovation lies in
formulating document routing as a trainable task, with automatic action
curation and structure-aware passage selection inspired by human reading
strategies. Through comprehensive evaluation on five challenging datasets, RDR2
achieves state-of-the-art performance, demonstrating that explicit structural
awareness significantly enhances RAG systems' ability to acquire and utilize
knowledge, particularly in complex scenarios requiring multi-document
synthesis.
Authors' comments: EMNLP2025 Findings
Simon Lupart, Daniël van Dijk, Eric Langezaal, Ian van Dort, Mohammad Aliannejadi
Personalized Conversational Information Retrieval (CIR) has seen rapid
progress in recent years, driven by the development of Large Language Models
(LLMs). Personalized CIR aims to enhance document retrieval by leveraging
user-specific information, such as preferences, knowledge, or constraints, to
tailor responses to individual needs. A key resource for this task is the TREC
iKAT 2023 dataset, designed to evaluate personalization in CIR pipelines.
Building on this resource, Mo et al. explored several strategies for
incorporating Personal Textual Knowledge Bases (PTKB) into LLM-based query
reformulation. Their findings suggested that personalization from PTKBs could
be detrimental and that human annotations were often noisy. However, these
conclusions were based on single-run experiments using the GPT-3.5 Turbo model,
raising concerns about output variability and repeatability. In this
reproducibility study, we rigorously reproduce and extend their work, focusing
on LLM output variability and model generalization. We apply the original
methods to the new TREC iKAT 2024 dataset and evaluate a diverse range of
models, including Llama (1B-70B), Qwen-7B, GPT-4o-mini. Our results show that
human-selected PTKBs consistently enhance retrieval performance, while
LLM-based selection methods do not reliably outperform manual choices. We
further compare variance across datasets and observe higher variability on iKAT
than on CAsT, highlighting the challenges of evaluating personalized CIR.
Notably, recall-oriented metrics exhibit lower variance than precision-oriented
ones, a critical insight for first-stage retrievers. Finally, we underscore the
need for multi-run evaluations and variance reporting when assessing LLM-based
CIR systems. By broadening evaluation across models, datasets, and metrics, our
study contributes to more robust and generalizable practices for personalized
CIR.
Authors' comments: 11 pages, 5 figures, SIGIR-AP'25 Proceedings of the 2025 Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval in the Asia Pacific Region (SIGIR-AP 2025), December 7--10, 2025,
Xi'an, China
Aneesha Sampath, Oya Aran, Emily Mower Provost
We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to test Large Language Models' (LLMs) ability to identify the specific spans of text that express emotion. Unlike traditional emotion recognition tasks that assign a single label to an entire sentence, SEER targets the underexplored task of emotion evidence detection: pinpointing which exact phrases convey emotion. This span-level approach is crucial for applications like empathetic dialogue and clinical support, which need to know how emotion is expressed, not just what the emotion is. SEER includes two tasks: identifying emotion evidence within a single sentence, and identifying evidence across a short passage of five consecutive sentences. It contains new annotations for both emotion and emotion evidence on 1200 real-world sentences. We evaluate 14 open-source LLMs and find that, while some models approach average human performance on single-sentence inputs, their accuracy degrades in longer passages. Our error analysis reveals key failure modes, including overreliance on emotion keywords and false positives in neutral text.
Luca Collorone, Matteo Gioia, Massimiliano Pappa, Paolo Leoni, Giovanni Ficarra, Or Litany, Indro Spinelli, Fabio Galasso
Intention drives human movement in complex environments, but such movement can only happen if the surrounding context supports it. Despite the intuitive nature of this mechanism, existing research has not yet provided tools to evaluate the alignment between skeletal movement (motion), intention (text), and the surrounding context (scene). In this work, we introduce MonSTeR, the first MOtioN-Scene-TExt Retrieval model. Inspired by the modeling of higher-order relations, MonSTeR constructs a unified latent space by leveraging unimodal and cross-modal representations. This allows MonSTeR to capture the intricate dependencies between modalities, enabling flexible but robust retrieval across various tasks. Our results show that MonSTeR outperforms trimodal models that rely solely on unimodal representations. Furthermore, we validate the alignment of our retrieval scores with human preferences through a dedicated user study. We demonstrate the versatility of MonSTeR's latent space on zero-shot in-Scene Object Placement and Motion Captioning. Code and pre-trained models are available at github.com/colloroneluca/MonSTeR.
Kaito Hori, Chihiro Tsutake, Keita Takahashi, Toshiaki Fujii
To time-efficiently and stably acquire the intensity information for phase retrieval under a coherent illumination, we leverage an event-based vision sensor (EVS) that can detect changes in logarithmic intensity at the pixel level with a wide dynamic range. In our optical system, we translate the EVS along the optical axis, where the EVS records the intensity changes induced by defocus as events. To recover phase distributions, we formulate a partial differential equation, referred to as the transport of event equation, which presents a linear relationship between the defocus events and the phase distribution. We demonstrate through experiments that the EVS is more advantageous than the conventional image sensor for rapidly and stably detecting the intensity information, defocus events, which enables accurate phase retrieval, particularly under low-lighting conditions.
Yu He, Yifei Chen, Yiming Li, Shuo Shao, Leyi Qi, Boheng Li, Dacheng Tao, Zhan Qin
In recent years, RAG has emerged as a key paradigm for enhancing large language models (LLMs). By integrating externally retrieved information, RAG alleviates issues like outdated knowledge and, crucially, insufficient domain expertise. While effective, RAG introduces new risks of external data extraction attacks (EDEAs), where sensitive or copyrighted data in its knowledge base may be extracted verbatim. These risks are particularly acute when RAG is used to customize specialized LLM applications with private knowledge bases. Despite initial studies exploring these risks, they often lack a formalized framework, robust attack performance, and comprehensive evaluation, leaving critical questions about real-world EDEA feasibility unanswered. In this paper, we present the first comprehensive study to formalize EDEAs against retrieval-augmented LLMs. We first formally define EDEAs and propose a unified framework decomposing their design into three components: extraction instruction, jailbreak operator, and retrieval trigger, under which prior attacks can be considered instances within our framework. Guided by this framework, we develop SECRET: a Scalable and EffeCtive exteRnal data Extraction aTtack. Specifically, SECRET incorporates (1) an adaptive optimization process using LLMs as optimizers to generate specialized jailbreak prompts for EDEAs, and (2) cluster-focused triggering, an adaptive strategy that alternates between global exploration and local exploitation to efficiently generate effective retrieval triggers. Extensive evaluations across 4 models reveal that SECRET significantly outperforms previous attacks, and is highly effective against all 16 tested RAG instances. Notably, SECRET successfully extracts 35% of the data from RAG powered by Claude 3.7 Sonnet for the first time, whereas other attacks yield 0% extraction. Our findings call for attention to this emerging threat.
Hung-Ting Chen, Fangyuan Xu, Shane Arora, Eunsol Choi
How retrieved documents are used in language models (LMs) for long-form generation task is understudied. We present two controlled studies on retrieval-augmented LM for long-form question answering (LFQA): one fixing the LM and varying evidence documents and the other fixing evidence documents and varying the LMs. We study various attributes of generated answers (e.g., fluency, length, variance), with an emphasis on the attribution of generated answers to in-context evidence documents. We collect a dataset (SALAD) containing human annotations of sentence-level answer attribution in LFQA and evaluate existing methods for automatically judging attribution. We find that while LMs can leverage relevant in-context documents, the generated answer is only partially attributable towards the documents, especially for LMs trained without retrieval augmentation. Together, our analysis reveals how retrieval augmentation impacts long knowledge-rich text generation and provide directions for future work.
Authors' comments: COLM 2024 Camera Ready Version
Daniel Gwon, Nour Jedidi, Jimmy Lin
Promptagator demonstrated that Large Language Models (LLMs) with few-shot
prompts can be used as task-specific query generators for fine-tuning
domain-specialized dense retrieval models. However, the original Promptagator
approach relied on proprietary and large-scale LLMs which users may not have
access to or may be prohibited from using with sensitive data. In this work, we
study the impact of open-source LLMs at accessible scales ($\leq$14B
parameters) as an alternative. Our results demonstrate that open-source LLMs as
small as 3B parameters can serve as effective Promptagator-style query
generators. We hope our work will inform practitioners with reliable
alternatives for synthetic data generation and give insights to maximize
fine-tuning results for domain-specific applications.
Authors' comments: CIKM 2025 short research paper
Linh Tran, Yulong Li, Radu Florian, Wei Sun
The strong zero-shot and long-context capabilities of recent Large Language Models (LLMs) have paved the way for highly effective re-ranking systems. Attention-based re-rankers leverage attention weights from transformer heads to produce relevance scores, but not all heads are created equally: many contribute noise and redundancy, thus limiting performance. To address this, we introduce CoRe heads, a small set of retrieval heads identified via a contrastive scoring metric that explicitly rewards high attention heads that correlate with relevant documents, while downplaying nodes with higher attention that correlate with irrelevant documents. This relative ranking criterion isolates the most discriminative heads for re-ranking and yields a state-of-the-art list-wise re-ranker. Extensive experiments with three LLMs show that aggregated signals from CoRe heads, constituting less than 1% of all heads, substantially improve re-ranking accuracy over strong baselines. We further find that CoRe heads are concentrated in middle layers, and pruning the computation of final 50% of model layers preserves accuracy while significantly reducing inference time and memory usage.
Yifan Xu, Vipul Gupta, Rohit Aggarwal, Varsha Mahadevan, Bhaskar Krishnamachari
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by pulling in external material, document, code, manuals, from vast and ever-growing corpora, to effectively answer user queries. The effectiveness of RAG depends significantly on aligning the number of retrieved documents with query characteristics: narrowly focused queries typically require fewer, highly relevant documents, whereas broader or ambiguous queries benefit from retrieving more extensive supporting information. However, the common static top-k retrieval approach fails to adapt to this variability, resulting in either insufficient context from too few documents or redundant information from too many. Motivated by these challenges, we introduce Cluster-based Adaptive Retrieval (CAR), an algorithm that dynamically determines the optimal number of documents by analyzing the clustering patterns of ordered query-document similarity distances. CAR detects the transition point within similarity distances, where tightly clustered, highly relevant documents shift toward less pertinent candidates, establishing an adaptive cut-off that scales with query complexity. On Coinbase's CDP corpus and the public MultiHop-RAG benchmark, CAR consistently picks the optimal retrieval depth and achieves the highest TES score, outperforming every fixed top-k baseline. In downstream RAG evaluations, CAR cuts LLM token usage by 60%, trims end-to-end latency by 22%, and reduces hallucinations by 10% while fully preserving answer relevance. Since integrating CAR into Coinbase's virtual assistant, we've seen user engagement jump by 200%.
Thong Nguyen, Yibin Lei, Jia-Huei Ju, Eugene Yang, Andrew Yates
Learned Sparse Retrieval (LSR) combines the efficiency of bi-encoders with the transparency of lexical matching, but existing approaches struggle to scale beyond English. We introduce MILCO, an LSR architecture that maps queries and documents from different languages into a shared English lexical space via a multilingual connector. MILCO is trained with a specialized two-stage regime that combines Sparse Alignment Pretraining with contrastive training to provide representation transparency and effectiveness while mitigating semantic collapse. Motivated by the observation that uncommon entities are often lost when projected into English, we propose a new LexEcho head, which enhances robustness by augmenting the English lexical representation with a source-language view obtained through a special [ECHO] token. MILCO achieves state-of-the-art multilingual and cross-lingual LSR performance, outperforming leading dense, sparse, and multi-vector baselines such as BGE-M3 and Qwen3-Embed on standard multilingual benchmarks, while supporting dynamic efficiency through post-hoc pruning. Notably, when using mass-based pruning to reduce document representations to only 30 active dimensions on average, MILCO 560M outperforms the similarly-sized Qwen3-Embed 0.6B with 1024 dimensions.
Leon Garza, Anantaa Kotal, Michael A. Grasso, Emre Umucu
The increasing complexity of clinical decision-making, alongside the rapid expansion of electronic health records (EHR), presents both opportunities and challenges for delivering data-informed care. This paper proposes a clinical decision support system powered by Large Language Models (LLMs) to assist prescribing clinicians. The system generates therapeutic suggestions by analyzing historical EHR data, including patient demographics, presenting complaints, clinical symptoms, diagnostic information, and treatment histories. The framework integrates natural language processing with structured clinical inputs to produce contextually relevant recommendations. Rather than replacing clinician judgment, it is designed to augment decision-making by retrieving and synthesizing precedent cases with comparable characteristics, drawing on local datasets or federated sources where applicable. At its core, the system employs a retrieval-augmented generation (RAG) pipeline that harmonizes unstructured narratives and codified data to support LLM-based inference. We outline the system's technical components, including representation representation alignment and generation strategies. Preliminary evaluations, conducted with de-identified and synthetic clinical datasets, examine the clinical plausibility and consistency of the model's outputs. Early findings suggest that LLM-based tools may provide valuable decision support in prescribing workflows when appropriately constrained and rigorously validated. This work represents an initial step toward integration of generative AI into real-world clinical decision-making with an emphasis on transparency, safety, and alignment with established practices.
Nima Sheikholeslami, Erfan Hosseini, Patrice Bechard, Srivatsava Daruru, Sai Rajeswar
Dual-encoder retrievers depend on the principle that relevant documents should score higher than irrelevant ones for a given query. Yet the dominant Noise Contrastive Estimation (NCE) objective, which underpins Contrastive Loss, optimizes a softened ranking surrogate that we rigorously prove is fundamentally oblivious to score separation quality and unrelated to AUC. This mismatch leads to poor calibration and suboptimal performance in downstream tasks like retrieval-augmented generation (RAG). To address this fundamental limitation, we introduce the MW loss, a new training objective that maximizes the Mann-Whitney U statistic, which is mathematically equivalent to the Area under the ROC Curve (AUC). MW loss encourages each positive-negative pair to be correctly ranked by minimizing binary cross entropy over score differences. We provide theoretical guarantees that MW loss directly upper-bounds the AoC, better aligning optimization with retrieval goals. We further promote ROC curves and AUC as natural threshold free diagnostics for evaluating retriever calibration and ranking quality. Empirically, retrievers trained with MW loss consistently outperform contrastive counterparts in AUC and standard retrieval metrics. Our experiments show that MW loss is an empirically superior alternative to Contrastive Loss, yielding better-calibrated and more discriminative retrievers for high-stakes applications like RAG.
Andrei C. Coman, Ionut-Teodor Sorodoc, Leonardo F. R. Ribeiro, Bill Byrne, James Henderson, Adrià de Gispert
Existing Reward Models (RMs), typically trained on general preference data,
struggle in Retrieval Augmented Generation (RAG) settings, which require
judging responses for faithfulness to retrieved context, relevance to the user
query, appropriate refusals when context is insufficient, completeness and
conciseness of information. To address the lack of publicly available
RAG-centric preference datasets and specialised RMs, we introduce RAGferee, a
methodology that repurposes question-answering (QA) datasets into preference
pairs that prioritise groundedness over stylistic features, enabling the
training of contextual RMs better suited to judging RAG responses. Using
RAGferee, we curate a small preference dataset of 4K samples and fine-tune RMs
ranging from 7B to 24B parameters. Our RAG-centric RMs achieve state-of-the-art
performance on ContextualJudgeBench, surpassing existing 70B+ RMs trained on
much larger (up to 2.4M samples) general corpora, with an absolute improvement
of +15.5%.
Authors' comments: Accepted to EMNLP 2025 Main
Boyoung Kim, Dosung Lee, Sumin An, Jinseong Jeong, Paul Hongsuck Seo
Recent advances in question answering have led to substantial progress in
tasks such as multi-hop reasoning. However, global sensemaking-answering
questions by synthesizing information from an entire corpus remains a
significant challenge. A prior graph-based approach to global sensemaking lacks
retrieval mechanisms, topic specificity, and incurs high inference costs. To
address these limitations, we propose ReTAG, a Retrieval-Enhanced,
Topic-Augmented Graph framework that constructs topic-specific subgraphs and
retrieves the relevant summaries for response generation. Experiments show that
ReTAG improves response quality while significantly reducing inference time
compared to the baseline. Our code is available at
https://github.com/bykimby/retag.
Authors' comments: 9 pages, 5 figures, EMNLP 2025 Findings