Xuan Lu, Sifan Liu, Bochao Yin, Yongqi Li, Xinghao Chen, Hui Su, Yaohui Jin, Wenjun Zeng et al.
In this paper, we introduce MultiConIR, the first benchmark designed to evaluate retrieval models in multi-condition scenarios. Unlike existing datasets that primarily focus on single-condition queries from search engines, MultiConIR captures real-world complexity by incorporating five diverse domains: books, movies, people, medical cases, and legal documents. We propose three tasks to systematically assess retrieval and reranking models on multi-condition robustness, monotonic relevance ranking, and query format sensitivity. Our findings reveal that existing retrieval and reranking models struggle with multi-condition retrieval, with rerankers suffering severe performance degradation as query complexity increases. We further investigate the performance gap between retrieval and reranking models, exploring potential reasons for these discrepancies, and analysis the impact of different pooling strategies on condition placement sensitivity. Finally, we highlight the strengths of GritLM and Nv-Embed, which demonstrate enhanced adaptability to multi-condition queries, offering insights for future retrieval models. The code and datasets are available at https://github.com/EIT-NLP/MultiConIR.
Jianhui Wang, Zhifei Yang, Yangfan He, Huixiong Zhang, Yuxuan Chen, Jingwei Huang
Accurate material retrieval is critical for creating realistic 3D assets. Existing methods rely on datasets that capture shape-invariant and lighting-varied representations of materials, which are scarce and face challenges due to limited diversity and inadequate real-world generalization. Most current approaches adopt traditional image search techniques. They fall short in capturing the unique properties of material spaces, leading to suboptimal performance in retrieval tasks. Addressing these challenges, we introduce MaRI, a framework designed to bridge the feature space gap between synthetic and real-world materials. MaRI constructs a shared embedding space that harmonizes visual and material attributes through a contrastive learning strategy by jointly training an image and a material encoder, bringing similar materials and images closer while separating dissimilar pairs within the feature space. To support this, we construct a comprehensive dataset comprising high-quality synthetic materials rendered with controlled shape variations and diverse lighting conditions, along with real-world materials processed and standardized using material transfer techniques. Extensive experiments demonstrate the superior performance, accuracy, and generalization capabilities of MaRI across diverse and complex material retrieval tasks, outperforming existing methods.
Justus-Jonas Erker, Nils Reimers, Iryna Gurevych
Decomposition-based multi-hop retrieval methods rely on many autoregressive
steps to break down complex queries, which breaks end-to-end differentiability
and is computationally expensive. Decomposition-free methods tackle this, but
current decomposition-free approaches struggle with longer multi-hop problems
and generalization to out-of-distribution data. To address these challenges, we
introduce GRITHopper-7B, a novel multi-hop dense retrieval model that achieves
state-of-the-art performance on both in-distribution and out-of-distribution
benchmarks. GRITHopper combines generative and representational instruction
tuning by integrating causal language modeling with dense retrieval training.
Through controlled studies, we find that incorporating additional context after
the retrieval process, referred to as post-retrieval language modeling,
enhances dense retrieval performance. By including elements such as final
answers during training, the model learns to better contextualize and retrieve
relevant information. GRITHopper-7B offers a robust, scalable, and
generalizable solution for multi-hop dense retrieval, and we release it to the
community for future research and applications requiring multi-hop reasoning
and retrieval capabilities.
Authors' comments: Under Review at ACL Rolling Review (ARR)
Meng Zheng, Jiajin Zhang, Benjamin Planche, Zhongpai Gao, Terrence Chen, Ziyan Wu
Image-Text Retrieval (ITR) finds broad applications in healthcare, aiding
clinicians and radiologists by automatically retrieving relevant patient cases
in the database given the query image and/or report, for more efficient
clinical diagnosis and treatment, especially for rare diseases. However
conventional ITR systems typically only rely on global image or text
representations for measuring patient image/report similarities, which overlook
local distinctiveness across patient cases. This often results in suboptimal
retrieval performance. In this paper, we propose an Anatomical
Location-Conditioned Image-Text Retrieval (ALC-ITR) framework, which, given a
query image and the associated suspicious anatomical region(s), aims to
retrieve similar patient cases exhibiting the same disease or symptoms in the
same anatomical region. To perform location-conditioned multimodal retrieval,
we learn a medical Relevance-Region-Aligned Vision Language (RRA-VL) model with
semantic global-level and region-/word-level alignment to produce
generalizable, well-aligned multi-modal representations. Additionally, we
perform location-conditioned contrastive learning to further utilize cross-pair
region-level contrastiveness for improved multi-modal retrieval. We show that
our proposed RRA-VL achieves state-of-the-art localization performance in
phase-grounding tasks, and satisfying multi-modal retrieval performance with or
without location conditioning. Finally, we thoroughly investigate the
generalizability and explainability of our proposed ALC-ITR system in providing
explanations and preliminary diagnosis reports given retrieved patient cases
(conditioned on anatomical regions), with proper off-the-shelf LLM prompts.
Authors' comments: 16 pages, 10 figures
Gabriele Bizzarri, Miranda Parisi, Mylenne Manrique, Ilaria Gianani, Andrea Chiuri, Matteo Rosati, Vittorio Giovannetti, Matteo G. A. Paris et al.
The description of complex systems requires a progressively larger number of parameters. However, in practice it often happens that a small subset of parameters suffice to describe the dynamics of the system itself in terms of stiff combinations, while the remaining sloppy combinations provide no information on the system. While this effect can reduce model complexity, it can also limit the estimation precision when the stiff and sloppy combinations are unknown to the experimenter, and one is forced to estimate the potentially sloppy model parameters. We explored how such a sloppy behavior can be controlled and counteracted via quantum weak measurements in the estimation of two sequential phases. We showed that the introduction of a weak measurement of variable strength in-between the two phases allows to switch from a fully sloppy setup to a fully determined one where both phases can be estimated with quantum-limited precision. Our work provides an important insight of sloppiness detection in quantum systems, with promising applications in quantum metrology and imaging, as well as to quantum security and quantum monitoring.
Quan Mai, Susan Gauch, Douglas Adams
We present Boolean-aware attention, a novel attention mechanism that dynamically adjusts token focus based on Boolean operators (e.g., and, or, not). Our model employs specialized Boolean experts, each tailored to amplify or suppress attention for operator-specific contexts. A predefined gating mechanism activates the corresponding experts based on the detected Boolean type. Experiments on Boolean retrieval datasets demonstrate that integrating BoolAttn with BERT greatly enhances the model's capability to process Boolean queries.
Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, Mohammed Ali, Adam Jatowt
Knowledge-intensive tasks, particularly open-domain question answering
(ODQA), document reranking, and retrieval-augmented language modeling, require
a balance between retrieval accuracy and generative flexibility. Traditional
retrieval models such as BM25 and Dense Passage Retrieval (DPR), efficiently
retrieve from large corpora but often lack semantic depth. Generative models
like GPT-4-o provide richer contextual understanding but face challenges in
maintaining factual consistency. In this work, we conduct a systematic
evaluation of retrieval-based, generation-based, and hybrid models, with a
primary focus on their performance in ODQA and related retrieval-augmented
tasks. Our results show that dense retrievers, particularly DPR, achieve strong
performance in ODQA with a top-1 accuracy of 50.17\% on NQ, while hybrid models
improve nDCG@10 scores on BEIR from 43.42 (BM25) to 52.59, demonstrating their
strength in document reranking. Additionally, we analyze language modeling
tasks using WikiText-103, showing that retrieval-based approaches like BM25
achieve lower perplexity compared to generative and hybrid methods,
highlighting their utility in retrieval-augmented generation. By providing
detailed comparisons and practical insights into the conditions where each
approach excels, we aim to facilitate future optimizations in retrieval,
reranking, and generative models for ODQA and related knowledge-intensive
applications.
Authors' comments: work on progress
Rachid Guerraoui, Anne-Marie Kermarrec, Diana Petrescu, Rafael Pires, Mathis Randl, Martijn de Vos
Large language models (LLMs) have demonstrated remarkable capabilities across
various domains but remain susceptible to hallucinations and inconsistencies,
limiting their reliability. Retrieval-augmented generation (RAG) mitigates
these issues by grounding model responses in external knowledge sources.
Existing RAG workflows often leverage a single vector database, which is
impractical in the common setting where information is distributed across
multiple repositories. We introduce RAGRoute, a novel mechanism for federated
RAG search. RAGRoute dynamically selects relevant data sources at query time
using a lightweight neural network classifier. By not querying every data
source, this approach significantly reduces query overhead, improves retrieval
efficiency, and minimizes the retrieval of irrelevant information. We evaluate
RAGRoute using the MIRAGE and MMLU benchmarks and demonstrate its effectiveness
in retrieving relevant documents while reducing the number of queries. RAGRoute
reduces the total number of queries up to 77.5% and communication volume up to
76.2%.
Authors' comments: To appear in the proceedings of EuroMLSys'25
Laura Perez-Beltrachini, Mirella Lapata
Retrieval augmented Question Answering (QA) helps QA models overcome knowledge gaps by incorporating retrieved evidence, typically a set of passages, alongside the question at test time. Previous studies show that this approach improves QA performance and reduces hallucinations, without, however, assessing whether the retrieved passages are indeed useful at answering correctly. In this work, we propose to quantify the uncertainty of a QA model via estimating the utility of the passages it is provided with. We train a lightweight neural model to predict passage utility for a target QA model and show that while simple information theoretic metrics can predict answer correctness up to a certain extent, our approach efficiently approximates or outperforms more expensive sampling-based methods. Code and data are available at https://github.com/lauhaide/ragu.
Gabriele Berton, Carlo Masone
Retrieving images from the same location as a given query is an important
component of multiple computer vision tasks, like Visual Place Recognition,
Landmark Retrieval, Visual Localization, 3D reconstruction, and SLAM. However,
existing solutions are built to specifically work for one of these tasks, and
are known to fail when the requirements slightly change or when they meet
out-of-distribution data. In this paper we combine a variety of existing
methods, training techniques, and datasets to train a retrieval model, called
MegaLoc, that is performant on multiple tasks. We find that MegaLoc (1)
achieves state of the art on a large number of Visual Place Recognition
datasets, (2) impressive results on common Landmark Retrieval datasets, and (3)
sets a new state of the art for Visual Localization on the LaMAR datasets,
where we only changed the retrieval method to the existing localization
pipeline. The code for MegaLoc is available at
https://github.com/gmberton/MegaLoc
Authors' comments: Tech Report
Milan Gritta, Huiyin Xue, Gerasimos Lampouras
Speculative decoding (SD) accelerates Large Language Model (LLM) generation
by using an efficient draft model to propose the next few tokens, which are
verified by the LLM in a single forward call, reducing latency while preserving
its outputs. We focus on retrieval-based SD where the draft model retrieves the
next tokens from a non-parametric datastore. Sparse retrieval (REST), which
operates on the surface form of strings, is currently the dominant paradigm due
to its simplicity and scalability. However, its effectiveness is limited due to
the usage of short contexts and exact string matching. Instead, we introduce
Dense Retrieval for Speculative Decoding (DReSD), a novel framework that uses
approximate nearest neighbour search with contextualised token embeddings to
retrieve the most semantically relevant token sequences for SD. Extensive
experiments show that DReSD achieves (on average) 87% higher acceptance rates,
65% longer accepted tokens and 19% faster generation speeds compared to sparse
retrieval (REST).
Authors' comments: Under Review
Coen van den Elsen, Francien Barkhof, Thijmen Nijdam, Simon Lupart, Mohammad Aliannejadi
Negation is a fundamental aspect of human communication, yet it remains a
challenge for Language Models (LMs) in Information Retrieval (IR). Despite the
heavy reliance of modern neural IR systems on LMs, little attention has been
given to their handling of negation. In this study, we reproduce and extend the
findings of NevIR, a benchmark study that revealed most IR models perform at or
below the level of random ranking when dealing with negation. We replicate
NevIR's original experiments and evaluate newly developed state-of-the-art IR
models. Our findings show that a recently emerging category-listwise Large
Language Model (LLM) re-rankers-outperforms other models but still
underperforms human performance. Additionally, we leverage ExcluIR, a benchmark
dataset designed for exclusionary queries with extensive negation, to assess
the generalisability of negation understanding. Our findings suggest that
fine-tuning on one dataset does not reliably improve performance on the other,
indicating notable differences in their data distributions. Furthermore, we
observe that only cross-encoders and listwise LLM re-rankers achieve reasonable
performance across both negation tasks.
Authors' comments: 9 pages, 4 figures. Accepted at SIGIR 2025 as a reproducibility paper
Xuemeng Song, Haoqiang Lin, Haokun Wen, Bohan Hou, Mingzhu Xu, Liqiang Nie
Composed Image Retrieval (CIR) is an emerging yet challenging task that allows users to search for target images using a multimodal query, comprising a reference image and a modification text specifying the user's desired changes to the reference image. Given its significant academic and practical value, CIR has become a rapidly growing area of interest in the computer vision and machine learning communities, particularly with the advances in deep learning. To the best of our knowledge, there is currently no comprehensive review of CIR to provide a timely overview of this field. Therefore, we synthesize insights from over 120 publications in top conferences and journals, including ACM TOIS, SIGIR, and CVPR In particular, we systematically categorize existing supervised CIR and zero-shot CIR models using a fine-grained taxonomy. For a comprehensive review, we also briefly discuss approaches for tasks closely related to CIR, such as attribute-based CIR and dialog-based CIR. Additionally, we summarize benchmark datasets for evaluation and analyze existing supervised and zero-shot CIR methods by comparing experimental results across multiple datasets. Furthermore, we present promising future directions in this field, offering practical insights for researchers interested in further exploration. The curated collection of related works is maintained and continuously updated in https://github.com/haokunwen/Awesome-Composed-Image-Retrieval.
Magdalen Dobson Manohar, Taekseung Kim, Guy E. Blelloch
Retrieving points based on proximity in a high-dimensional vector space is a crucial step in information retrieval applications. The approximate nearest neighbor search (ANNS) problem, which identifies the $k$ nearest neighbors for a query (approximately, since exactly is hard), has been extensively studied in recent years. However, comparatively little attention has been paid to the related problem of finding all points within a given distance of a query, the range retrieval problem, despite its applications in areas such as duplicate detection, plagiarism checking, and facial recognition. In this paper, we present a set of algorithms for range retrieval on graph-based vector indices, which are known to achieve excellent performance on ANNS queries. Since a range query may have anywhere from no matching results to thousands of matching results in the database, we introduce a set of range retrieval algorithms based on modifications of the standard graph search that adapt to terminate quickly on queries in the former group, and to put more resources into finding results for the latter group. Due to the lack of existing benchmarks for range retrieval, we also undertake a comprehensive study of range characteristics of existing embedding datasets, and select a suitable range retrieval radius for eight existing datasets with up to 100 million points in addition to the one existing benchmark. We test our algorithms on these datasets, and find up to 100x improvement in query throughput over a naive baseline approach, with 5-10x improvement on average, and strong performance up to 100 million data points.
Giorgos Kordopatis-Zilos, Vladan Stojnić, Anna Manko, Pavel Šuma, Nikolaos-Antonios Ypsilantis, Nikos Efthymiadis, Zakaria Laskar, Jiří Matas et al.
This work introduces ILIAS, a new test dataset for Instance-Level Image
retrieval At Scale. It is designed to evaluate the ability of current and
future foundation models and retrieval techniques to recognize particular
objects. The key benefits over existing datasets include large scale, domain
diversity, accurate ground truth, and a performance that is far from saturated.
ILIAS includes query and positive images for 1,000 object instances, manually
collected to capture challenging conditions and diverse domains. Large-scale
retrieval is conducted against 100 million distractor images from YFCC100M. To
avoid false negatives without extra annotation effort, we include only query
objects confirmed to have emerged after 2014, i.e. the compilation date of
YFCC100M. An extensive benchmarking is performed with the following
observations: i) models fine-tuned on specific domains, such as landmarks or
products, excel in that domain but fail on ILIAS ii) learning a linear
adaptation layer using multi-domain class supervision results in performance
improvements, especially for vision-language models iii) local descriptors in
retrieval re-ranking are still a key ingredient, especially in the presence of
severe background clutter iv) the text-to-image performance of the
vision-language foundation models is surprisingly close to the corresponding
image-to-image case. website: https://vrg.fel.cvut.cz/ilias/
Authors' comments: CVPR 2025
Papa Abdou Karim Karou Diallo, Amal Zouaq
Recent advancements in Natural Language Processing have significantly improved the extraction of structured semantic representations from unstructured text, especially through Frame Semantic Role Labeling (FSRL). Despite this progress, the potential of Retrieval-Augmented Generation (RAG) models for frame detection remains under-explored. In this paper, we present the first RAG-based approach for frame detection called RCIF (Retrieve Candidates and Identify Frames). RCIF is also the first approach to operate without the need for explicit target span and comprises three main stages: (1) generation of frame embeddings from various representations ; (2) retrieval of candidate frames given an input text; and (3) identification of the most suitable frames. We conducted extensive experiments across multiple configurations, including zero-shot, few-shot, and fine-tuning settings. Our results show that our retrieval component significantly reduces the complexity of the task by narrowing the search space thus allowing the frame identifier to refine and complete the set of candidates. Our approach achieves state-of-the-art performance on FrameNet 1.5 and 1.7, demonstrating its robustness in scenarios where only raw text is provided. Furthermore, we leverage the structured representation obtained through this method as a proxy to enhance generalization across lexical variations in the task of translating natural language questions into SPARQL queries.
Maximilian Jaritz, Matthieu Guillaumin, Sabine Sternig, Loris Bazzani
Multimodal retrieval methods have limitations in handling complex, compositional queries that require reasoning about the visual content of both the query and the retrieved entities. On the other hand, Large Multimodal Models (LMMs) can answer with language to more complex visual questions, but without the inherent ability to retrieve relevant entities to support their answers. We aim to address these limitations with UniCoRN, a Unified Commented Retrieval Network that combines the strengths of composed multimodal retrieval methods and generative language approaches, going beyond Retrieval-Augmented Generation (RAG). We introduce an entity adapter module to inject the retrieved multimodal entities back into the LMM, so it can attend to them while generating answers and comments. By keeping the base LMM frozen, UniCoRN preserves its original capabilities while being able to perform both retrieval and text generation tasks under a single integrated framework. To assess these new abilities, we introduce the Commented Retrieval task (CoR) and a corresponding dataset, with the goal of retrieving an image that accurately answers a given question and generate an additional textual response that provides further clarification and details about the visual information. We demonstrate the effectiveness of UniCoRN on several datasets showing improvements of +4.5% recall over the state of the art for composed multimodal retrieval and of +14.9% METEOR / +18.4% BEM over RAG for commenting in CoR.
Ruin Yan, Zheng Liu, Defu Lian
The growing power of large language models (LLMs) has revolutionized how people access and utilize information. Notably, the LLMs excel at performing fine-grained data representation, which facilitates precise retrieval of information. They also generate high-quality answers based on external references, enabling the production of useful knowledge. The recent introduction of reasoning models, like OpenAI O1 and DeepSeek R1, marks another leap forward, highlighting LLMs' ability to think progressively before delivering final answers. This breakthrough significantly improves the ability to address complex tasks, e.g., coding and math proofs. Inspired by this progress, we aim to develop similar capabilities for retrieval models, which hold great promise for tackling critical challenges in the field, including multi-task retrieval, zero-shot retrieval, and tasks requiring intensive reasoning of complex relationships. With this motivation, we propose a novel approach called O1 Embedder, which generates useful thoughts for the input query before making retrieval for the target documents. To realize this objective, we conquer two technical difficulties. First, we design a data synthesis workflow, creating training signals for O1 Embedder by generating initial thoughts from an LLM-expert and subsequently refining them using a retrieval committee. Second, we optimize the training process, enabling a pre-trained model to be jointly fine-tuned to generate retrieval thoughts via behavior cloning and perform dense retrieval through contrastive learning. Our approach is evaluated by comprehensive experiments, where substantial improvements are achieved across 12 popular datasets, spanning both in-domain and out-of-domain scenarios. These results highlight O1 Embedder's remarkable accuracy and generalizability, paving the way for the development of next-generation IR foundation models.
Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, Wei Hu
Retrieval-augmented generation (RAG) has emerged as a promising technology
for addressing hallucination issues in the responses generated by large
language models (LLMs). Existing studies on RAG primarily focus on applying
semantic-based approaches to retrieve isolated relevant chunks, which ignore
their intrinsic relationships. In this paper, we propose a novel Knowledge
Graph-Guided Retrieval Augmented Generation (KG$^2$RAG) framework that utilizes
knowledge graphs (KGs) to provide fact-level relationships between chunks,
improving the diversity and coherence of the retrieved results. Specifically,
after performing a semantic-based retrieval to provide seed chunks, KG$^2$RAG
employs a KG-guided chunk expansion process and a KG-based chunk organization
process to deliver relevant and important knowledge in well-organized
paragraphs. Extensive experiments conducted on the HotpotQA dataset and its
variants demonstrate the advantages of KG$^2$RAG compared to existing RAG-based
approaches, in terms of both response quality and retrieval quality.
Authors' comments: Accepted in the 2025 Annual Conference of the Nations of the Americas
Chapter of the ACL (NAACL 2025)
Sagnik Anupam, Alexander Shypula, Osbert Bastani
With the advent of large language models (LLMs), there has been a great deal of interest in applying them to solve difficult programming tasks. Recent work has demonstrated their potential at program optimization, a key challenge in programming languages research. We propose a blackbox adaptation method called Retrieval Augmented Search (RAS) that performs beam search over candidate optimizations; at each step, it retrieves in-context examples from a given training dataset of slow-fast program pairs to guide the LLM. Critically, we find that performing contextual retrieval based on an LLM-generated natural language description significantly outperforms retrieval based on the source code. In addition, we propose a method called AEGIS for improving interpretability by decomposing training examples into "atomic edits" that are significantly more incremental in nature. We show that RAS performs 1.8$\times$ better than prior state-of-the-art blackbox adaptation strategies, and that AEGIS performs 1.37$\times$ better while performing significantly smaller edits.