Coen van den Elsen, Francien Barkhof, Thijmen Nijdam, Simon Lupart, Mohammad Aliannejadi
Negation is a fundamental aspect of human communication, yet it remains a
challenge for Language Models (LMs) in Information Retrieval (IR). Despite the
heavy reliance of modern neural IR systems on LMs, little attention has been
given to their handling of negation. In this study, we reproduce and extend the
findings of NevIR, a benchmark study that revealed most IR models perform at or
below the level of random ranking when dealing with negation. We replicate
NevIR's original experiments and evaluate newly developed state-of-the-art IR
models. Our findings show that a recently emerging category-listwise Large
Language Model (LLM) re-rankers-outperforms other models but still
underperforms human performance. Additionally, we leverage ExcluIR, a benchmark
dataset designed for exclusionary queries with extensive negation, to assess
the generalisability of negation understanding. Our findings suggest that
fine-tuning on one dataset does not reliably improve performance on the other,
indicating notable differences in their data distributions. Furthermore, we
observe that only cross-encoders and listwise LLM re-rankers achieve reasonable
performance across both negation tasks.
Authors' comments: 9 pages, 4 figures. Accepted at SIGIR 2025 as a reproducibility paper
Xuemeng Song, Haoqiang Lin, Haokun Wen, Bohan Hou, Mingzhu Xu, Liqiang Nie
Composed Image Retrieval (CIR) is an emerging yet challenging task that allows users to search for target images using a multimodal query, comprising a reference image and a modification text specifying the user's desired changes to the reference image. Given its significant academic and practical value, CIR has become a rapidly growing area of interest in the computer vision and machine learning communities, particularly with the advances in deep learning. To the best of our knowledge, there is currently no comprehensive review of CIR to provide a timely overview of this field. Therefore, we synthesize insights from over 120 publications in top conferences and journals, including ACM TOIS, SIGIR, and CVPR In particular, we systematically categorize existing supervised CIR and zero-shot CIR models using a fine-grained taxonomy. For a comprehensive review, we also briefly discuss approaches for tasks closely related to CIR, such as attribute-based CIR and dialog-based CIR. Additionally, we summarize benchmark datasets for evaluation and analyze existing supervised and zero-shot CIR methods by comparing experimental results across multiple datasets. Furthermore, we present promising future directions in this field, offering practical insights for researchers interested in further exploration. The curated collection of related works is maintained and continuously updated in https://github.com/haokunwen/Awesome-Composed-Image-Retrieval.
Magdalen Dobson Manohar, Taekseung Kim, Guy E. Blelloch
Retrieving points based on proximity in a high-dimensional vector space is a crucial step in information retrieval applications. The approximate nearest neighbor search (ANNS) problem, which identifies the $k$ nearest neighbors for a query (approximately, since exactly is hard), has been extensively studied in recent years. However, comparatively little attention has been paid to the related problem of finding all points within a given distance of a query, the range retrieval problem, despite its applications in areas such as duplicate detection, plagiarism checking, and facial recognition. In this paper, we present a set of algorithms for range retrieval on graph-based vector indices, which are known to achieve excellent performance on ANNS queries. Since a range query may have anywhere from no matching results to thousands of matching results in the database, we introduce a set of range retrieval algorithms based on modifications of the standard graph search that adapt to terminate quickly on queries in the former group, and to put more resources into finding results for the latter group. Due to the lack of existing benchmarks for range retrieval, we also undertake a comprehensive study of range characteristics of existing embedding datasets, and select a suitable range retrieval radius for eight existing datasets with up to 100 million points in addition to the one existing benchmark. We test our algorithms on these datasets, and find up to 100x improvement in query throughput over a naive baseline approach, with 5-10x improvement on average, and strong performance up to 100 million data points.
Giorgos Kordopatis-Zilos, Vladan Stojnić, Anna Manko, Pavel Šuma, Nikolaos-Antonios Ypsilantis, Nikos Efthymiadis, Zakaria Laskar, Jiří Matas et al.
This work introduces ILIAS, a new test dataset for Instance-Level Image
retrieval At Scale. It is designed to evaluate the ability of current and
future foundation models and retrieval techniques to recognize particular
objects. The key benefits over existing datasets include large scale, domain
diversity, accurate ground truth, and a performance that is far from saturated.
ILIAS includes query and positive images for 1,000 object instances, manually
collected to capture challenging conditions and diverse domains. Large-scale
retrieval is conducted against 100 million distractor images from YFCC100M. To
avoid false negatives without extra annotation effort, we include only query
objects confirmed to have emerged after 2014, i.e. the compilation date of
YFCC100M. An extensive benchmarking is performed with the following
observations: i) models fine-tuned on specific domains, such as landmarks or
products, excel in that domain but fail on ILIAS ii) learning a linear
adaptation layer using multi-domain class supervision results in performance
improvements, especially for vision-language models iii) local descriptors in
retrieval re-ranking are still a key ingredient, especially in the presence of
severe background clutter iv) the text-to-image performance of the
vision-language foundation models is surprisingly close to the corresponding
image-to-image case. website: https://vrg.fel.cvut.cz/ilias/
Authors' comments: CVPR 2025
Papa Abdou Karim Karou Diallo, Amal Zouaq
Recent advancements in Natural Language Processing have significantly improved the extraction of structured semantic representations from unstructured text, especially through Frame Semantic Role Labeling (FSRL). Despite this progress, the potential of Retrieval-Augmented Generation (RAG) models for frame detection remains under-explored. In this paper, we present the first RAG-based approach for frame detection called RCIF (Retrieve Candidates and Identify Frames). RCIF is also the first approach to operate without the need for explicit target span and comprises three main stages: (1) generation of frame embeddings from various representations ; (2) retrieval of candidate frames given an input text; and (3) identification of the most suitable frames. We conducted extensive experiments across multiple configurations, including zero-shot, few-shot, and fine-tuning settings. Our results show that our retrieval component significantly reduces the complexity of the task by narrowing the search space thus allowing the frame identifier to refine and complete the set of candidates. Our approach achieves state-of-the-art performance on FrameNet 1.5 and 1.7, demonstrating its robustness in scenarios where only raw text is provided. Furthermore, we leverage the structured representation obtained through this method as a proxy to enhance generalization across lexical variations in the task of translating natural language questions into SPARQL queries.
Maximilian Jaritz, Matthieu Guillaumin, Sabine Sternig, Loris Bazzani
Multimodal retrieval methods have limitations in handling complex, compositional queries that require reasoning about the visual content of both the query and the retrieved entities. On the other hand, Large Multimodal Models (LMMs) can answer with language to more complex visual questions, but without the inherent ability to retrieve relevant entities to support their answers. We aim to address these limitations with UniCoRN, a Unified Commented Retrieval Network that combines the strengths of composed multimodal retrieval methods and generative language approaches, going beyond Retrieval-Augmented Generation (RAG). We introduce an entity adapter module to inject the retrieved multimodal entities back into the LMM, so it can attend to them while generating answers and comments. By keeping the base LMM frozen, UniCoRN preserves its original capabilities while being able to perform both retrieval and text generation tasks under a single integrated framework. To assess these new abilities, we introduce the Commented Retrieval task (CoR) and a corresponding dataset, with the goal of retrieving an image that accurately answers a given question and generate an additional textual response that provides further clarification and details about the visual information. We demonstrate the effectiveness of UniCoRN on several datasets showing improvements of +4.5% recall over the state of the art for composed multimodal retrieval and of +14.9% METEOR / +18.4% BEM over RAG for commenting in CoR.
Ruin Yan, Zheng Liu, Defu Lian
The growing power of large language models (LLMs) has revolutionized how people access and utilize information. Notably, the LLMs excel at performing fine-grained data representation, which facilitates precise retrieval of information. They also generate high-quality answers based on external references, enabling the production of useful knowledge. The recent introduction of reasoning models, like OpenAI O1 and DeepSeek R1, marks another leap forward, highlighting LLMs' ability to think progressively before delivering final answers. This breakthrough significantly improves the ability to address complex tasks, e.g., coding and math proofs. Inspired by this progress, we aim to develop similar capabilities for retrieval models, which hold great promise for tackling critical challenges in the field, including multi-task retrieval, zero-shot retrieval, and tasks requiring intensive reasoning of complex relationships. With this motivation, we propose a novel approach called O1 Embedder, which generates useful thoughts for the input query before making retrieval for the target documents. To realize this objective, we conquer two technical difficulties. First, we design a data synthesis workflow, creating training signals for O1 Embedder by generating initial thoughts from an LLM-expert and subsequently refining them using a retrieval committee. Second, we optimize the training process, enabling a pre-trained model to be jointly fine-tuned to generate retrieval thoughts via behavior cloning and perform dense retrieval through contrastive learning. Our approach is evaluated by comprehensive experiments, where substantial improvements are achieved across 12 popular datasets, spanning both in-domain and out-of-domain scenarios. These results highlight O1 Embedder's remarkable accuracy and generalizability, paving the way for the development of next-generation IR foundation models.
Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, Wei Hu
Retrieval-augmented generation (RAG) has emerged as a promising technology
for addressing hallucination issues in the responses generated by large
language models (LLMs). Existing studies on RAG primarily focus on applying
semantic-based approaches to retrieve isolated relevant chunks, which ignore
their intrinsic relationships. In this paper, we propose a novel Knowledge
Graph-Guided Retrieval Augmented Generation (KG$^2$RAG) framework that utilizes
knowledge graphs (KGs) to provide fact-level relationships between chunks,
improving the diversity and coherence of the retrieved results. Specifically,
after performing a semantic-based retrieval to provide seed chunks, KG$^2$RAG
employs a KG-guided chunk expansion process and a KG-based chunk organization
process to deliver relevant and important knowledge in well-organized
paragraphs. Extensive experiments conducted on the HotpotQA dataset and its
variants demonstrate the advantages of KG$^2$RAG compared to existing RAG-based
approaches, in terms of both response quality and retrieval quality.
Authors' comments: Accepted in the 2025 Annual Conference of the Nations of the Americas
Chapter of the ACL (NAACL 2025)
Sagnik Anupam, Alexander Shypula, Osbert Bastani
With the advent of large language models (LLMs), there has been a great deal of interest in applying them to solve difficult programming tasks. Recent work has demonstrated their potential at program optimization, a key challenge in programming languages research. We propose a blackbox adaptation method called Retrieval Augmented Search (RAS) that performs beam search over candidate optimizations; at each step, it retrieves in-context examples from a given training dataset of slow-fast program pairs to guide the LLM. Critically, we find that performing contextual retrieval based on an LLM-generated natural language description significantly outperforms retrieval based on the source code. In addition, we propose a method called AEGIS for improving interpretability by decomposing training examples into "atomic edits" that are significantly more incremental in nature. We show that RAS performs 1.8$\times$ better than prior state-of-the-art blackbox adaptation strategies, and that AEGIS performs 1.37$\times$ better while performing significantly smaller edits.
Zheng Zheng, Xinyi Ni, Pengyu Hong
A Retrieval-Augmented Generation (RAG) model powered by a large language model (LLM) provides a faster and more cost-effective solution for adapting to new data and knowledge. It also delivers more specialized responses compared to pre-trained LLMs. However, most existing approaches rely on retrieving prefix-sized chunks as references to support question-answering (Q/A). This approach is often deployed to address information needs at a single level of abstraction, as it struggles to generate answers across multiple levels of abstraction. In an RAG setting, while LLMs can summarize and answer questions effectively when provided with sufficient details, retrieving excessive information often leads to the 'lost in the middle' problem and exceeds token limitations. We propose a novel RAG approach that uses chunks of multiple abstraction levels (MAL), including multi-sentence-level, paragraph-level, section-level, and document-level. The effectiveness of our approach is demonstrated in an under-explored scientific domain of Glycoscience. Compared to traditional single-level RAG approaches, our approach improves AI evaluated answer correctness of Q/A by 25.739\% on Glyco-related papers.
Goksenin Yuksel, Jaap Kamps
Dense retrievers have demonstrated significant potential for neural information retrieval; however, they exhibit a lack of robustness to domain shifts, thereby limiting their efficacy in zero-shot settings across diverse domains. Previous research has investigated unsupervised domain adaptation techniques to adapt dense retrievers to target domains. However, these studies have not focused on explainability analysis to understand how such adaptations alter the model's behavior. In this paper, we propose utilizing the integrated gradients framework to develop an interpretability method that provides both instance-based and ranking-based explanations for dense retrievers. To generate these explanations, we introduce a novel baseline that reveals both query and document attributions. This method is used to analyze the effects of domain adaptation on input attributions for query and document tokens across two datasets: the financial question answering dataset (FIQA) and the biomedical information retrieval dataset (TREC-COVID). Our visualizations reveal that domain-adapted models focus more on in-domain terminology compared to non-adapted models, exemplified by terms such as "hedge," "gold," "corona," and "disease." This research addresses how unsupervised domain adaptation techniques influence the behavior of dense retrievers when adapted to new domains. Additionally, we demonstrate that integrated gradients are a viable choice for explaining and analyzing the internal mechanisms of these opaque neural models.
Mandeep Rathee, Sean MacAvaney, Avishek Anand
Large Language Models (LLMs) have shown strong promise as rerankers,
especially in ``listwise'' settings where an LLM is prompted to rerank several
search results at once. However, this ``cascading'' retrieve-and-rerank
approach is limited by the bounded recall problem: relevant documents not
retrieved initially are permanently excluded from the final ranking. Adaptive
retrieval techniques address this problem, but do not work with listwise
rerankers because they assume a document's score is computed independently from
other documents. In this paper, we propose an adaptation of an existing
adaptive retrieval method that supports the listwise setting and helps guide
the retrieval process itself (thereby overcoming the bounded recall problem for
LLM rerankers). Specifically, our proposed algorithm merges results both from
the initial ranking and feedback documents provided by the most relevant
documents seen up to that point. Through extensive experiments across diverse
LLM rerankers, first stage retrievers, and feedback sources, we demonstrate
that our method can improve nDCG@10 by up to 13.23% and recall by 28.02%--all
while keeping the total number of LLM inferences constant and overheads due to
the adaptive process minimal. The work opens the door to leveraging LLM-based
search in settings where the initial pool of results is limited, e.g., by
legacy systems, or by the cost of deploying a semantic first-stage.
Authors' comments: 16 pages, 2 figures, 3 tables
Ramin Farshchian, Rajab Ali Kamyabi-Gol, Fahimeh Arabyani-Neyshaburi, Fatemeh Esmaeelzadeh
This paper investigates the properties of continuous frames, with a particular focus on phase retrieval and norm retrieval in the context of Hilbert spaces. We introduce the concept of continuous near-Riesz bases and prove their invariance under invertible operators. Some equivalent conditions for phase and norm retrieval property of continuous frames are presented. We study the stability of phase retrieval under perturbations. Furthermore, tensor product frames for separable Hilbert spaces are studied, and we establish the equivalence of phase retrieval and norm retrieval properties between components and their tensor products.
Feiteng Mu, Liwen Zhang, Yong Jiang, Wenjie Li, Zhen Zhang, Pengjun Xie, Fei Huang
Query routing for retrieval-augmented generation aims to assign an input query to the most suitable search engine. Existing works rely heavily on supervised datasets that require extensive manual annotation, resulting in high costs and limited scalability, as well as poor generalization to out-of-distribution scenarios. To address these challenges, we introduce a novel unsupervised method that constructs the "upper-bound" response to evaluate the quality of retrieval-augmented responses. This evaluation enables the decision of the most suitable search engine for a given query. By eliminating manual annotations, our approach can automatically process large-scale real user queries and create training data. We conduct extensive experiments across five datasets, demonstrating that our method significantly enhances scalability and generalization capabilities.
Dong Liu, Esther Lopez Ramos
Semantic retrieval (also known as dense retrieval) based on textual data has
been extensively studied for both web search and product search application
fields, where the relevance of a query and a potential target document is
computed by their dense vector representation comparison. Product image is
crucial for e-commerce search interactions and is a key factor for customers at
product explorations. However, its impact on semantic retrieval has not been
well studied yet. In this research, we build a multimodal representation for
product items in e-commerce search in contrast to pure-text representation of
products, and investigate the impact of such representations. The models are
developed and evaluated on e-commerce datasets. We demonstrate that a
multimodal representation scheme for a product can show improvement either on
purchase recall or relevance accuracy in semantic retrieval. Additionally, we
provide numerical analysis for exclusive matches retrieved by a multimodal
semantic retrieval model versus a text-only semantic retrieval model, to
demonstrate the validation of multimodal solutions.
Authors' comments: Accepted at EReL@MIR WWW 2025
Tianyu Fan, Jingyuan Wang, Xubin Ren, Chao Huang
The growing demand for efficient and lightweight Retrieval-Augmented Generation (RAG) systems has highlighted significant challenges when deploying Small Language Models (SLMs) in existing RAG frameworks. Current approaches face severe performance degradation due to SLMs' limited semantic understanding and text processing capabilities, creating barriers for widespread adoption in resource-constrained scenarios. To address these fundamental limitations, we present MiniRAG, a novel RAG system designed for extreme simplicity and efficiency. MiniRAG introduces two key technical innovations: (1) a semantic-aware heterogeneous graph indexing mechanism that combines text chunks and named entities in a unified structure, reducing reliance on complex semantic understanding, and (2) a lightweight topology-enhanced retrieval approach that leverages graph structures for efficient knowledge discovery without requiring advanced language capabilities. Our extensive experiments demonstrate that MiniRAG achieves comparable performance to LLM-based methods even when using SLMs while requiring only 25\% of the storage space. Additionally, we contribute a comprehensive benchmark dataset for evaluating lightweight RAG systems under realistic on-device scenarios with complex queries. We fully open-source our implementation and datasets at: https://github.com/HKUDS/MiniRAG.
Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang
Retrieval-Augmented Generation (RAG) is a powerful strategy for improving the factual accuracy of models by retrieving external knowledge relevant to queries and incorporating it into the generation process. However, existing approaches primarily focus on text, with some recent advancements considering images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing contextual details more effectively than any other modality. While very recent studies explore the use of videos in response generation, they either predefine query-associated videos without retrieval or convert videos into textual descriptions losing multimodal richness. To tackle these, we introduce VideoRAG, a framework that not only dynamically retrieves videos based on their relevance with queries but also utilizes both visual and textual information. The operation of VideoRAG is powered by recent Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and the seamless integration of retrieved videos jointly with queries for response generation. Also, inspired by that the context size of LVLMs may not be sufficient to process all frames in extremely long videos and not all frames are equally important, we introduce a video frame selection mechanism to extract the most informative subset of frames, along with a strategy to extract textual information from videos (as it can aid the understanding of video content) when their subtitles are not available. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines. Code is available at https://github.com/starsuzi/VideoRAG.
Te-Lun Yang, Jyi-Shane Liu, Yuen-Hsien Tseng, Jyh-Shing Roger Jang
This study develops a question-answering system based on Retrieval-Augmented
Generation (RAG) using Chinese Wikipedia and Lawbank as retrieval sources.
Using TTQA and TMMLU+ as evaluation datasets, the system employs BGE-M3 for
dense vector retrieval to obtain highly relevant search results and
BGE-reranker to reorder these results based on query relevance. The most
pertinent retrieval outcomes serve as reference knowledge for a Large Language
Model (LLM), enhancing its ability to answer questions and establishing a
knowledge retrieval system grounded in generative AI.
The system's effectiveness is assessed through a two-stage evaluation:
automatic and assisted performance evaluations. The automatic evaluation
calculates accuracy by comparing the model's auto-generated labels with ground
truth answers, measuring performance under standardized conditions without
human intervention. The assisted performance evaluation involves 20
finance-related multiple-choice questions answered by 20 participants without
financial backgrounds. Initially, participants answer independently. Later,
they receive system-generated reference information to assist in answering,
examining whether the system improves accuracy when assistance is provided.
The main contributions of this research are: (1) Enhanced LLM Capability: By
integrating BGE-M3 and BGE-reranker, the system retrieves and reorders highly
relevant results, reduces hallucinations, and dynamically accesses authorized
or public knowledge sources. (2) Improved Data Privacy: A customized RAG
architecture enables local operation of the LLM, eliminating the need to send
private data to external servers. This approach enhances data security, reduces
reliance on commercial services, lowers operational costs, and mitigates
privacy risks.
Authors' comments: 8 pages, 13 figures, 1 table
Hanna Zubkova, Ji-Hoon Park, Seong-Whan Lee
Bearing in mind the limited parametric knowledge of Large Language Models
(LLMs), retrieval-augmented generation (RAG) which supplies them with the
relevant external knowledge has served as an approach to mitigate the issue of
hallucinations to a certain extent. However, uniformly retrieving supporting
context makes response generation source-inefficient, as triggering the
retriever is not always necessary, or even inaccurate, when a model gets
distracted by noisy retrieved content and produces an unhelpful answer.
Motivated by these issues, we introduce Semantic Uncertainty Guided Adaptive
Retrieval (SUGAR), where we leverage context-based entropy to actively decide
whether to retrieve and to further determine between single-step and multi-step
retrieval. Our empirical results show that selective retrieval guided by
semantic uncertainty estimation improves the performance across diverse
question answering tasks, as well as achieves a more efficient inference.
Authors' comments: ICASSP2025
Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A. Rossi et al.
Retrieval-augmented generation (RAG) is a powerful technique that enhances downstream task execution by retrieving additional information, such as knowledge, skills, and tools from external sources. Graph, by its intrinsic "nodes connected by edges" nature, encodes massive heterogeneous and relational information, making it a golden resource for RAG in tremendous real-world applications. As a result, we have recently witnessed increasing attention on equipping RAG with Graph, i.e., GraphRAG. However, unlike conventional RAG, where the retriever, generator, and external data sources can be uniformly designed in the neural-embedding space, the uniqueness of graph-structured data, such as diverse-formatted and domain-specific relational knowledge, poses unique and significant challenges when designing GraphRAG for different domains. Given the broad applicability, the associated design challenges, and the recent surge in GraphRAG, a systematic and up-to-date survey of its key concepts and techniques is urgently desired. Following this motivation, we present a comprehensive and up-to-date survey on GraphRAG. Our survey first proposes a holistic GraphRAG framework by defining its key components, including query processor, retriever, organizer, generator, and data source. Furthermore, recognizing that graphs in different domains exhibit distinct relational patterns and require dedicated designs, we review GraphRAG techniques uniquely tailored to each domain. Finally, we discuss research challenges and brainstorm directions to inspire cross-disciplinary opportunities. Our survey repository is publicly maintained at https://github.com/Graph-RAG/GraphRAG/.