Yilong Xu, Jinhua Gao, Xiaoming Yu, Yuanhai Xue, Baolong Bi, Huawei Shen, Xueqi Cheng
Retrieval-Augmented Language Models boost task performance, owing to the
retriever that provides external knowledge. Although crucial, the retriever
primarily focuses on semantics relevance, which may not always be effective for
generation. Thus, utility-based retrieval has emerged as a promising topic,
prioritizing passages that provides valid benefits for downstream tasks.
However, due to insufficient understanding, capturing passage utility
accurately remains unexplored. This work proposes SCARLet, a framework for
training utility-based retrievers in RALMs, which incorporates two key factors,
multi-task generalization and inter-passage interaction. First, SCARLet
constructs shared context on which training data for various tasks is
synthesized. This mitigates semantic bias from context differences, allowing
retrievers to focus on learning task-specific utility for better task
generalization. Next, SCARLet uses a perturbation-based attribution method to
estimate passage-level utility for shared context, which reflects interactions
between passages and provides more accurate feedback. We evaluate our approach
on ten datasets across various tasks, both in-domain and out-of-domain, showing
that retrievers trained by SCARLet consistently improve the overall performance
of RALMs.
Authors' comments: 20 pages, 9 figures. Code will be released after review
Hsin-Ling Hsu, Jengnan Tzeng
Hybrid retrieval techniques in Retrieval-Augmented Generation (RAG) systems enhance information retrieval by combining dense and sparse (e.g., BM25-based) retrieval methods. However, existing approaches struggle with adaptability, as fixed weighting schemes fail to adjust to different queries. To address this, we propose DAT (Dynamic Alpha Tuning), a novel hybrid retrieval framework that dynamically balances dense retrieval and BM25 for each query. DAT leverages a large language model (LLM) to evaluate the effectiveness of the top-1 results from both retrieval methods, assigning an effectiveness score to each. It then calibrates the optimal weighting factor through effectiveness score normalization, ensuring a more adaptive and query-aware weighting between the two approaches. Empirical results show that DAT consistently significantly outperforms fixed-weighting hybrid retrieval methods across various evaluation metrics. Even on smaller models, DAT delivers strong performance, highlighting its efficiency and adaptability.
Wenchao Gu, Juntao Chen, Yanlin Wang, Tianyue Jiang, Xingzhe Li, Mingwei Liu, Xilin Liu, Yuchi Ma et al.
Repository-level code generation remains challenging due to complex code dependencies and the limitations of large language models (LLMs) in processing long contexts. While retrieval-augmented generation (RAG) frameworks are widely adopted, the effectiveness of different retrieved information sources-contextual code, APIs, and similar snippets-has not been rigorously analyzed. Through an empirical study on two benchmarks, we demonstrate that in-context code and potential API information significantly enhance LLM performance, whereas retrieved similar code often introduces noise, degrading results by up to 15%. Based on the preliminary results, we propose AllianceCoder, a novel context-integrated method that employs chain-of-thought prompting to decompose user queries into implementation steps and retrieves APIs via semantic description matching. Through extensive experiments on CoderEval and RepoExec, AllianceCoder achieves state-of-the-art performance, improving Pass@1 by up to 20% over existing approaches.
Sejong Kim, Hyunseo Song, Hyunwoo Seo, Hyunjun Kim
Retrieval-Augmented Generation (RAG) has emerged as a promising framework to
mitigate hallucinations in Large Language Models (LLMs), yet its overall
performance is dependent on the underlying retrieval system. In the finance
domain, documents such as 10-K reports pose distinct challenges due to
domain-specific vocabulary and multi-hierarchical tabular data. In this work,
we introduce an efficient, end-to-end RAG pipeline that enhances retrieval for
financial documents through a three-phase approach: pre-retrieval, retrieval,
and post-retrieval. In the pre-retrieval phase, various query and corpus
preprocessing techniques are employed to enrich input data. During the
retrieval phase, we fine-tuned state-of-the-art (SOTA) embedding models with
domain-specific knowledge and implemented a hybrid retrieval strategy that
combines dense and sparse representations. Finally, the post-retrieval phase
leverages Direct Preference Optimization (DPO) training and document selection
methods to further refine the results. Evaluations on seven financial question
answering datasets-FinDER, FinQABench, FinanceBench, TATQA, FinQA, ConvFinQA,
and MultiHiertt-demonstrate substantial improvements in retrieval performance,
leading to more accurate and contextually appropriate generation. These
findings highlight the critical role of tailored retrieval techniques in
advancing the effectiveness of RAG systems for financial applications. A fully
replicable pipeline is available on GitHub:
https://github.com/seohyunwoo-0407/GAR.
Authors' comments: 15 pages, 3 figures, 11 tables. Accepted at ICLR 2025 Workshop on
Advances in Financial AI. Code available at
https://github.com/seohyunwoo-0407/GAR
Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, Zhaochun Ren
Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.
Rongzhi Zhu, Xiangyu Liu, Zequn Sun, Yiwei Wang, Wei Hu
In this paper, we identify a critical problem, "lost-in-retrieval", in
retrieval-augmented multi-hop question answering (QA): the key entities are
missed in LLMs' sub-question decomposition. "Lost-in-retrieval" significantly
degrades the retrieval performance, which disrupts the reasoning chain and
leads to the incorrect answers. To resolve this problem, we propose a
progressive retrieval and rewriting method, namely ChainRAG, which sequentially
handles each sub-question by completing missing key entities and retrieving
relevant sentences from a sentence graph for answer generation. Each step in
our retrieval and rewriting process builds upon the previous one, creating a
seamless chain that leads to accurate retrieval and answers. Finally, all
retrieved sentences and sub-question answers are integrated to generate a
comprehensive answer to the original question. We evaluate ChainRAG on three
multi-hop QA datasets - MuSiQue, 2Wiki, and HotpotQA - using three large
language models: GPT4o-mini, Qwen2.5-72B, and GLM-4-Plus. Empirical results
demonstrate that ChainRAG consistently outperforms baselines in both
effectiveness and efficiency.
Authors' comments: Accepted in the 63rd Annual Meeting of the Association for
Computational Linguistics (ACL 2025)
Yingrui Yang, Parker Carlson, Yifan Qiao, Wentai Xie, Shanxiu He, Tao Yang
This paper studies fast fusion of dense retrieval and sparse lexical
retrieval, and proposes a cluster-based selective dense retrieval method called
CluSD guided by sparse lexical retrieval. CluSD takes a lightweight
cluster-based approach and exploits the overlap of sparse retrieval results and
embedding clusters in a two-stage selection process with an LSTM model to
quickly identify relevant clusters while incurring limited extra memory space
overhead. CluSD triggers partial dense retrieval and performs cluster-based
block disk I/O if needed. This paper evaluates CluSD and compares it with
several baselines for searching in-memory and on-disk MS MARCO and BEIR
datasets.
Authors' comments: This paper is accepted by ECIR'25
Abdelrahman Abdallah, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, Adam Jatowt
Retrieval, re-ranking, and retrieval-augmented generation (RAG) are critical
components of modern applications in information retrieval, question answering,
or knowledge-based text generation. However, existing solutions are often
fragmented, lacking a unified framework that easily integrates these essential
processes. The absence of a standardized implementation, coupled with the
complexity of retrieval and re-ranking workflows, makes it challenging for
researchers to compare and evaluate different approaches in a consistent
environment. While existing toolkits such as Rerankers and RankLLM provide
general-purpose reranking pipelines, they often lack the flexibility required
for fine-grained experimentation and benchmarking. In response to these
challenges, we introduce Rankify, a powerful and modular open-source toolkit
designed to unify retrieval, re-ranking, and RAG within a cohesive framework.
Rankify supports a wide range of retrieval techniques, including dense and
sparse retrievers, while incorporating state-of-the-art re-ranking models to
enhance retrieval quality. Additionally, Rankify includes a collection of
pre-retrieved datasets to facilitate benchmarking, available at Huggingface
(https://huggingface.co/datasets/abdoelsayed/reranking-datasets-light). To
encourage adoption and ease of integration, we provide comprehensive
documentation (http://rankify.readthedocs.io/), an open-source implementation
on GitHub (https://github.com/DataScienceUIBK/rankify), and a PyPI package for
easy installation (https://pypi.org/project/rankify/). As a unified and
lightweight framework, Rankify allows researchers and practitioners to advance
retrieval and re-ranking methodologies while ensuring consistency, scalability,
and ease of use.
Authors' comments: Work in Progress
Peter Baile Chen, Yi Zhang, Michael Cafarella, Dan Roth
Real-world open-domain questions can be complicated, particularly when answering them involves information from multiple information sources. LLMs have demonstrated impressive performance in decomposing complex tasks into simpler steps, and previous work has used it for better retrieval in support of complex questions. However, LLM's decomposition of questions is unaware of what data is available and how data is organized, often leading to a sub-optimal retrieval performance. Recent effort in agentic RAG proposes to perform retrieval in an iterative fashion, where a followup query is derived as an action based on previous rounds of retrieval. While this provides one way of interacting with the data collection, agentic RAG's exploration of data is inefficient because successive queries depend on previous results rather than being guided by the organization of available data in the collection. To address this problem, we propose an LLM-based retrieval method -- ARM, that aims to better align the question with the organization of the data collection by exploring relationships among data objects beyond matching the utterance of the query, thus leading to a retrieve-all-at-once solution for complex queries. We evaluated ARM on two datasets, Bird and OTT-QA. On Bird, it outperforms standard RAG with query decomposition by up to 5.2 pt in execution accuracy and agentic RAG (ReAct) by up to 15.9 pt. On OTT-QA, it achieves up to 5.5 pt and 19.3 pt higher F1 match scores compared to these approaches.
Yiteng Tu, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieved from a knowledge base. However, its effectiveness is fundamentally constrained by the reliability of both the retriever and the knowledge base. In real-world scenarios, imperfections in these components often lead to the retrieval of noisy, irrelevant, or misleading counterfactual information, ultimately undermining the trustworthiness of RAG systems. To address this challenge, we propose Robust Fine-Tuning (RbFT), a method designed to enhance the resilience of LLMs against retrieval defects through two targeted fine-tuning tasks. Experimental results demonstrate that RbFT significantly improves the robustness of RAG systems across diverse retrieval conditions, surpassing existing methods while maintaining high inference efficiency and compatibility with other robustness techniques.
Long Nguyen, Huy Nguyen, Bao Khuu, Huy Luu, Huy Le, Tuan Nguyen, Tho Quan
Retrieving events from videos using text queries has become increasingly
challenging due to the rapid growth of multimedia content. Existing methods for
text-based video event retrieval often focus heavily on object-level
descriptions, overlooking the crucial role of contextual information. This
limitation is especially apparent when queries lack sufficient context, such as
missing location details or ambiguous background elements. To address these
challenges, we propose a novel system called RAPID (Retrieval-Augmented
Parallel Inference Drafting), which leverages advancements in Large Language
Models (LLMs) and prompt-based learning to semantically correct and enrich user
queries with relevant contextual information. These enriched queries are then
processed through parallel retrieval, followed by an evaluation step to select
the most relevant results based on their alignment with the original query.
Through extensive experiments on our custom-developed dataset, we demonstrate
that RAPID significantly outperforms traditional retrieval methods,
particularly for contextually incomplete queries. Our system was validated for
both speed and accuracy through participation in the Ho Chi Minh City AI
Challenge 2024, where it successfully retrieved events from over 300 hours of
video. Further evaluation comparing RAPID with the baseline proposed by the
competition organizers demonstrated its superior effectiveness, highlighting
the strength and robustness of our approach.
Authors' comments: Under review at SoICT'24
Hao Fang, Xiaohang Sui, Hongyao Yu, Kuofeng Gao, Jiawei Kong, Sijin Yu, Bin Chen, Hao Wu et al.
Diffusion models (DMs) have recently demonstrated remarkable generation capability. However, their training generally requires huge computational resources and large-scale datasets. To solve these, recent studies empower DMs with the advanced Retrieval-Augmented Generation (RAG) technique and propose retrieval-augmented diffusion models (RDMs). By incorporating rich knowledge from an auxiliary database, RAG enhances diffusion models' generation and generalization ability while significantly reducing model parameters. Despite the great success, RAG may introduce novel security issues that warrant further investigation. In this paper, we reveal that the RDM is susceptible to backdoor attacks by proposing a multimodal contrastive attack approach named BadRDM. Our framework fully considers RAG's characteristics and is devised to manipulate the retrieved items for given text triggers, thereby further controlling the generated contents. Specifically, we first insert a tiny portion of images into the retrieval database as target toxicity surrogates. Subsequently, a malicious variant of contrastive learning is adopted to inject backdoors into the retriever, which builds shortcuts from triggers to the toxicity surrogates. Furthermore, we enhance the attacks through novel entropy-based selection and generative augmentation strategies that can derive better toxicity surrogates. Extensive experiments on two mainstream tasks demonstrate the proposed BadRDM achieves outstanding attack effects while preserving the model's benign utility.
Gabriel de Jesus, Sérgio Nunes
Searching for information on the internet and digital platforms to satisfy an
information need requires effective retrieval solutions. However, such
solutions are not yet available for Tetun, making it challenging to find
relevant documents for text-based search queries in this language. To address
these challenges, we investigate Tetun text retrieval with a focus on the
ad-hoc retrieval task. The study begins by developing essential language
resources -- including a list of stopwords, a stemmer, and a test collection --
which serve as foundational components for solutions tailored to Tetun text
retrieval. Various strategies are investigated using both document titles and
content to evaluate retrieval effectiveness. The results demonstrate that
retrieving document titles, after removing hyphens and apostrophes without
applying stemming, significantly improves retrieval performance compared to the
baseline. Efficiency increases by 31.37%, while effectiveness achieves an
average relative gain of +9.40% in MAP@10 and +30.35% in NDCG@10 with DFR BM25.
Beyond the top-10 cutoff point, Hiemstra LM shows strong performance across
various retrieval strategies and evaluation metrics. Contributions of this work
include the development of Labadain-Stopwords (a list of 160 Tetun stopwords),
Labadain-Stemmer (a Tetun stemmer with three variants), and
Labadain-Avaliad\'or (a Tetun test collection containing 59 topics, 33,550
documents, and 5,900 qrels). We make all resources publicly accessible to
facilitate future research in Tetun information retrieval.
Authors' comments: Version 3
Weijie Chen, Ting Bai, Jinbo Su, Jian Luan, Wei Liu, Chuan Shi
Large language models with retrieval-augmented generation encounter a pivotal challenge in intricate retrieval tasks, e.g., multi-hop question answering, which requires the model to navigate across multiple documents and generate comprehensive responses based on fragmented information. To tackle this challenge, we introduce a novel Knowledge Graph-based RAG framework with a hierarchical knowledge retriever, termed KG-Retriever. The retrieval indexing in KG-Retriever is constructed on a hierarchical index graph that consists of a knowledge graph layer and a collaborative document layer. The associative nature of graph structures is fully utilized to strengthen intra-document and inter-document connectivity, thereby fundamentally alleviating the information fragmentation problem and meanwhile improving the retrieval efficiency in cross-document retrieval of LLMs. With the coarse-grained collaborative information from neighboring documents and concise information from the knowledge graph, KG-Retriever achieves marked improvements on five public QA datasets, showing the effectiveness and efficiency of our proposed RAG framework.
Suyuan Huang, Chao Zhang, Yuanyuan Wu, Haoxin Zhang, Yuan Wang, Maolin Wang, Shaosheng Cao, Tong Xu et al.
Dense retrieval in most industries employs dual-tower architectures to retrieve query-relevant documents. Due to online deployment requirements, existing real-world dense retrieval systems mainly enhance performance by designing negative sampling strategies, overlooking the advantages of scaling up. Recently, Large Language Models (LLMs) have exhibited superior performance that can be leveraged for scaling up dense retrieval. However, scaling up retrieval models significantly increases online query latency. To address this challenge, we propose ScalingNote, a two-stage method to exploit the scaling potential of LLMs for retrieval while maintaining online query latency. The first stage is training dual towers, both initialized from the same LLM, to unlock the potential of LLMs for dense retrieval. Then, we distill only the query tower using mean squared error loss and cosine similarity to reduce online costs. Through theoretical analysis and comprehensive offline and online experiments, we show the effectiveness and efficiency of ScalingNote. Our two-stage scaling method outperforms end-to-end models and verifies the scaling law of dense retrieval with LLMs in industrial scenarios, enabling cost-effective scaling of dense retrieval systems. Our online method incorporating ScalingNote significantly enhances the relevance between retrieved documents and queries.
Aniket Deroy, Subhankar Maity
Code-mixing, the integration of lexical and grammatical elements from
multiple languages within a single sentence, is a widespread linguistic
phenomenon, particularly prevalent in multilingual societies. In India, social
media users frequently engage in code-mixed conversations using the Roman
script, especially among migrant communities who form online groups to share
relevant local information. This paper focuses on the challenges of extracting
relevant information from code-mixed conversations, specifically within Roman
transliterated Bengali mixed with English. This study presents a novel approach
to address these challenges by developing a mechanism to automatically identify
the most relevant answers from code-mixed conversations. We have experimented
with a dataset comprising of queries and documents from Facebook, and Query
Relevance files (QRels) to aid in this task. Our results demonstrate the
effectiveness of our approach in extracting pertinent information from complex,
code-mixed digital conversations, contributing to the broader field of natural
language processing in multilingual and informal text environments. We use
GPT-3.5 Turbo via prompting alongwith using the sequential nature of relevant
documents to frame a mathematical model which helps to detect relevant
documents corresponding to a query.
Authors' comments: Final and Updated version
Yuhang Liu, Xueyu Hu, Shengyu Zhang, Jingyuan Chen, Fan Wu, Fei Wu
Retrieval-Augmented Generation (RAG) has proven to be an effective method for
mitigating hallucination issues inherent in large language models (LLMs).
Previous approaches typically train retrievers based on semantic similarity,
lacking optimization for RAG. More recent works have proposed aligning
retrievers with the preference signals of LLMs. However, these preference
signals are often difficult for dense retrievers, which typically have weaker
language capabilities, to understand and learn effectively. Drawing inspiration
from pedagogical theories like Guided Discovery Learning, we propose a novel
framework, FiGRet (Fine-grained Guidance for Retrievers), which leverages the
language capabilities of LLMs to construct examples from a more granular,
information-centric perspective to guide the learning of retrievers.
Specifically, our method utilizes LLMs to construct easy-to-understand examples
from samples where the retriever performs poorly, focusing on three learning
objectives highly relevant to the RAG scenario: relevance, comprehensiveness,
and purity. These examples serve as scaffolding to ultimately align the
retriever with the LLM's preferences. Furthermore, we employ a dual curriculum
learning strategy and leverage the reciprocal feedback between LLM and
retriever to further enhance the performance of the RAG system. A series of
experiments demonstrate that our proposed framework enhances the performance of
RAG systems equipped with different retrievers and is applicable to various
LLMs.
Authors' comments: 13 pages, 4 figures
Qingfei Zhao, Ruobing Wang, Xin Wang, Daren Zha, Nan Mu
Retrieval-Augmented Generation (RAG) has emerged as a reliable external
knowledge augmentation technique to mitigate hallucination issues and
parameterized knowledge limitations in Large Language Models (LLMs). Existing
Adaptive RAG (ARAG) systems struggle to effectively explore multiple retrieval
sources due to their inability to select the right source at the right time. To
address this, we propose a multi-source ARAG framework, termed MSPR, which
synergizes reasoning and preference-driven retrieval to adaptive decide "when
and what to retrieve" and "which retrieval source to use". To better adapt to
retrieval sources of differing characteristics, we also employ retrieval action
adjustment and answer feedback strategy. They enable our framework to fully
explore the high-quality primary source while supplementing it with secondary
sources at the right time. Extensive and multi-dimensional experiments
conducted on three datasets demonstrate the superiority and effectiveness of
MSPR.
Authors' comments: 5 pages, 1 figure
Ziting Wang, Haitao Yuan, Wei Dong, Gao Cong, Feifei Li
Large Language Models (LLMs) have demonstrated remarkable generation capabilities but often struggle to access up-to-date information, which can lead to hallucinations. Retrieval-Augmented Generation (RAG) addresses this issue by incorporating knowledge from external databases, enabling more accurate and relevant responses. Due to the context window constraints of LLMs, it is impractical to input the entire external database context directly into the model. Instead, only the most relevant information, referred to as chunks, is selectively retrieved. However, current RAG research faces three key challenges. First, existing solutions often select each chunk independently, overlooking potential correlations among them. Second, in practice the utility of chunks is non-monotonic, meaning that adding more chunks can decrease overall utility. Traditional methods emphasize maximizing the number of included chunks, which can inadvertently compromise performance. Third, each type of user query possesses unique characteristics that require tailored handling, an aspect that current approaches do not fully consider. To overcome these challenges, we propose a cost constrained retrieval optimization system CORAG for retrieval-augmented generation. We employ a Monte Carlo Tree Search (MCTS) based policy framework to find optimal chunk combinations sequentially, allowing for a comprehensive consideration of correlations among chunks. Additionally, rather than viewing budget exhaustion as a termination condition, we integrate budget constraints into the optimization of chunk combinations, effectively addressing the non-monotonicity of chunk utility.
Zijia Zhao, Longteng Guo, Tongtian Yue, Erdong Hu, Shuai Shao, Zehuan Yuan, Hua Huang, Jing Liu
In this paper, we investigate the task of general conversational image retrieval on open-domain images. The objective is to search for images based on interactive conversations between humans and computers. To advance this task, we curate a dataset called ChatSearch. This dataset includes a multi-round multimodal conversational context query for each target image, thereby requiring the retrieval system to find the accurate image from database. Simultaneously, we propose a generative retrieval model named ChatSearcher, which is trained end-to-end to accept/produce interleaved image-text inputs/outputs. ChatSearcher exhibits strong capability in reasoning with multimodal context and can leverage world knowledge to yield visual retrieval results. It demonstrates superior performance on the ChatSearch dataset and also achieves competitive results on other image retrieval tasks and visual conversation tasks. We anticipate that this work will inspire further research on interactive multimodal retrieval systems. Our dataset will be available at https://github.com/joez17/ChatSearch.