Martin Böckling, Heiko Paulheim, Andreea Iana
Large Language Models (LLMs) have showcased impressive reasoning abilities,
but often suffer from hallucinations or outdated knowledge. Knowledge Graph
(KG)-based Retrieval-Augmented Generation (RAG) remedies these shortcomings by
grounding LLM responses in structured external information from a knowledge
base. However, many KG-based RAG approaches struggle with (i) aligning KG and
textual representations, (ii) balancing retrieval accuracy and efficiency, and
(iii) adapting to dynamically updated KGs. In this work, we introduce
Walk&Retrieve, a simple yet effective KG-based framework that leverages
walk-based graph traversal and knowledge verbalization for corpus generation
for zero-shot RAG. Built around efficient KG walks, our method does not require
fine-tuning on domain-specific data, enabling seamless adaptation to KG
updates, reducing computational overhead, and allowing integration with any
off-the-shelf backbone LLM. Despite its simplicity, Walk&Retrieve performs
competitively, often outperforming existing RAG systems in response accuracy
and hallucination reduction. Moreover, it demonstrates lower query latency and
robust scalability to large KGs, highlighting the potential of lightweight
retrieval strategies as strong baselines for future RAG research.
Authors' comments: Accepted at the Information Retrieval's Role in RAG Systems (IR-RAG
2025) in conjunction with SIGIR 2025
Jinyu Guo, Xunlei Chen, Qiyang Xia, Zhaokun Wang, Jie Ou, Libo Qin, Shunyu Yao, Wenhong Tian
Retrieval-Augmented Generation (RAG) encounters efficiency challenges when
scaling to massive knowledge bases while preserving contextual relevance. We
propose Hash-RAG, a framework that integrates deep hashing techniques with
systematic optimizations to address these limitations. Our queries directly
learn binary hash codes from knowledgebase code, eliminating intermediate
feature extraction steps, and significantly reducing storage and computational
overhead. Building upon this hash-based efficient retrieval framework, we
establish the foundation for fine-grained chunking. Consequently, we design a
Prompt-Guided Chunk-to-Context (PGCC) module that leverages retrieved
hash-indexed propositions and their original document segments through prompt
engineering to enhance the LLM's contextual awareness. Experimental evaluations
on NQ, TriviaQA, and HotpotQA datasets demonstrate that our approach achieves a
90% reduction in retrieval time compared to conventional methods while
maintaining considerate recall performance. Additionally, The proposed system
outperforms retrieval/non-retrieval baselines by 1.4-4.3% in EM scores.
Authors' comments: Accepted at Findings of ACL 2025
Jie Ou, Jinyu Guo, Shuaihong Jiang, Zhaokun Wang, Libo Qin, Shunyu Yao, Wenhong Tian
Retrieval-augmented generation (RAG) has emerged as a pivotal method for
expanding the knowledge of large language models. To handle complex queries
more effectively, researchers developed Adaptive-RAG (A-RAG) to enhance the
generated quality through multiple interactions with external knowledge bases.
Despite its effectiveness, A-RAG exacerbates the pre-existing efficiency
challenges inherent in RAG, which are attributable to its reliance on multiple
iterations of generation. Existing A-RAG approaches process all retrieved
contents from scratch. However, they ignore the situation where there is a
significant overlap in the content of the retrieval results across rounds. The
overlapping content is redundantly represented, which leads to a large
proportion of repeated computations, thus affecting the overall efficiency. To
address this issue, this paper introduces a model-agnostic approach that can be
generally applied to A-RAG methods, which is dedicated to reducing the
redundant representation process caused by the overlapping of retrieval
results. Specifically, we use cache access and parallel generation to speed up
the prefilling and decoding stages respectively. Additionally, we also propose
an instruction-driven module to further guide the model to more effectively
attend to each part of the content in a more suitable way for LLMs. Experiments
show that our approach achieves 2.79 and 2.33 times significant acceleration on
average for prefilling and decoding respectively while maintaining equal
generation quality.
Authors' comments: Accepted at Findings of ACL 2025
Andrei-Laurentiu Bornea, Fadhel Ayed, Antonio De Domenico, Nicola Piovesan, Tareq Si Salem, Ali Maatouk
Artificial intelligence will be one of the key pillars of the next generation
of mobile networks (6G), as it is expected to provide novel added-value
services and improve network performance. In this context, large language
models have the potential to revolutionize the telecom landscape through intent
comprehension, intelligent knowledge retrieval, coding proficiency, and
cross-domain orchestration capabilities. This paper presents Telco-oRAG, an
open-source Retrieval-Augmented Generation (RAG) framework optimized for
answering technical questions in the telecommunications domain, with a
particular focus on 3GPP standards. Telco-oRAG introduces a hybrid retrieval
strategy that combines 3GPP domain-specific retrieval with web search,
supported by glossary-enhanced query refinement and a neural router for
memory-efficient retrieval. Our results show that Telco-oRAG improves the
accuracy in answering 3GPP-related questions by up to 17.6% and achieves a
10.6% improvement in lexicon queries compared to baselines. Furthermore,
Telco-oRAG reduces memory usage by 45% through targeted retrieval of relevant
3GPP series compared to baseline RAG, and enables open-source LLMs to reach
GPT-4-level accuracy on telecom benchmarks.
Authors' comments: 12 pages, 10 figures, 4 tables
Zongyuan Li, Pengfei Li, Runnan Qi, Yanan Ni, Lumin Jiang, Hui Wu, Xuebo Zhang, Kuihua Huang et al.
The lack of domain-specific data in the pre-training of Large Language Models (LLMs) severely limits LLM-based decision systems in specialized applications, while post-training a model in the scenarios requires significant computational resources. In this paper, we present Retrial-Augmented Learning (RAL), a reward-free self-supervised learning framework for LLMs that operates without model training. By developing Retrieval-Augmented Generation (RAG) into a module for organizing intermediate data, we realized a three-stage autonomous knowledge generation of proposing a hypothesis, validating the hypothesis, and generating the knowledge. The method is evaluated in the LLM-PySC2 environment, a representative decision-making platform that combines sufficient complexity with domain-specific knowledge requirements. Experiments demonstrate that the proposed method effectively reduces hallucination by generating and utilizing validated knowledge, and increases decision-making performance at an extremely low cost. Meanwhile, the approach exhibits potential in out-of-distribution(OOD) tasks, robustness, and transferability, making it a cost-friendly but effective solution for decision-making problems and autonomous knowledge generation.
Aarush Sinha
Training effective dense retrieval models often relies on hard negative (HN) examples mined from the document corpus via methods like BM25 or cross-encoders (CE), processes that can be computationally demanding and require full corpus access. This paper introduces a different approach, an end-to-end pipeline where a Large Language Model (LLM) first generates a query from a passage, and then generates a hard negative example using \emph{only} that query text. This corpus-free negative generation contrasts with standard mining techniques. We evaluated this \textsc{LLM Query $\rightarrow$ LLM HN} approach against traditional \textsc{LLM Query $\rightarrow$ BM25 HN} and \textsc{LLM Query $\rightarrow$ CE HN} pipelines using E5-Base and GTE-Base models on several BEIR benchmark datasets. Our results show the proposed all-LLM pipeline achieves performance identical to both the BM25 and the computationally intensive CE baselines across nDCG@10, Precision@10, and Recall@100 metrics. This demonstrates that our corpus-free negative generation method matches the effectiveness of complex, corpus-dependent mining techniques, offering a potentially simpler and more efficient pathway for training high-performance retrievers without sacrificing results. We make the dataset including the queries and the hard-negatives for all three methods publicly available https://huggingface.co/collections/chungimungi/arxiv-hard-negatives-68027bbc601ff6cc8eb1f449.
Quanyu Long, Jianda Chen, Zhengyuan Liu, Nancy F. Chen, Wenya Wang, Sinno Jialin Pan
Large Language Models (LLMs) have demonstrated remarkable capabilities across
numerous tasks, yet they often rely on external context to handle complex
tasks. While retrieval-augmented frameworks traditionally focus on selecting
top-ranked documents in a single pass, many real-world scenarios demand
compositional retrieval, where multiple sources must be combined in a
coordinated manner. In this work, we propose a tri-encoder sequential retriever
that models this process as a Markov Decision Process (MDP), decomposing the
probability of retrieving a set of elements into a sequence of conditional
probabilities and allowing each retrieval step to be conditioned on previously
selected examples. We train the retriever in two stages: first, we efficiently
construct supervised sequential data for initial policy training; we then
refine the policy to align with the LLM's preferences using a reward grounded
in the structural correspondence of generated programs. Experimental results
show that our method consistently and significantly outperforms baselines,
underscoring the importance of explicitly modeling inter-example dependencies.
These findings highlight the potential of compositional retrieval for tasks
requiring multiple pieces of evidence or examples.
Authors' comments: 19 pages, 8 figures
Abraham Itzhak Weinberg
Image retrieval remains a challenging task due to the complex interaction between human visual perception, memory, and computational processes. Current image search engines often struggle to efficiently retrieve images based on natural language descriptions, as they rely on time-consuming preprocessing, tagging, and machine learning pipelines. This paper introduces the Human-Oriented Retrieval Search Engine for Images (HORSE), a novel approach that leverages neuro-symbolic indexing to improve image retrieval by focusing on human-oriented indexing. By integrating cognitive science insights with advanced computational techniques, HORSE enhances the retrieval process, making it more aligned with how humans perceive, store, and recall visual information. The neuro-symbolic framework combines the strengths of neural networks and symbolic reasoning, mitigating their individual limitations. The proposed system optimizes image retrieval, offering a more intuitive and efficient solution for users. We discuss the design and implementation of HORSE, highlight its potential applications in fields such as design error detection and knowledge management, and suggest future directions for research to further refine the system's metrics and capabilities.
Ming Pang, Chunyuan Yuan, Xiaoyu He, Zheng Fang, Donghao Xie, Fanyi Qu, Xue Jiang, Changping Peng et al.
Traditional sparse and dense retrieval methods struggle to leverage general
world knowledge and often fail to capture the nuanced features of queries and
products. With the advent of large language models (LLMs), industrial search
systems have started to employ LLMs to generate identifiers for product
retrieval. Commonly used identifiers include (1) static/semantic IDs and (2)
product term sets. The first approach requires creating a product ID system
from scratch, missing out on the world knowledge embedded within LLMs. While
the second approach leverages this general knowledge, the significant
difference in word distribution between queries and products means that
product-based identifiers often do not align well with user search queries,
leading to missed product recalls. Furthermore, when queries contain numerous
attributes, these algorithms generate a large number of identifiers, making it
difficult to assess their quality, which results in low overall recall
efficiency.
To address these challenges, this paper introduces a novel e-commerce
retrieval paradigm: the Generative Retrieval and Alignment Model (GRAM). GRAM
employs joint training on text information from both queries and products to
generate shared text identifier codes, effectively bridging the gap between
queries and products. This approach not only enhances the connection between
queries and products but also improves inference efficiency. The model uses a
co-alignment strategy to generate codes optimized for maximizing retrieval
efficiency. Additionally, it introduces a query-product scoring mechanism to
compare product values across different codes, further boosting retrieval
efficiency. Extensive offline and online A/B testing demonstrates that GRAM
significantly outperforms traditional models and the latest generative
retrieval models, confirming its effectiveness and practicality.
Authors' comments: Accepted by WWW2025
Yilong Xu, Jinhua Gao, Xiaoming Yu, Yuanhai Xue, Baolong Bi, Huawei Shen, Xueqi Cheng
Retrieval-Augmented Language Models boost task performance, owing to the
retriever that provides external knowledge. Although crucial, the retriever
primarily focuses on semantics relevance, which may not always be effective for
generation. Thus, utility-based retrieval has emerged as a promising topic,
prioritizing passages that provides valid benefits for downstream tasks.
However, due to insufficient understanding, capturing passage utility
accurately remains unexplored. This work proposes SCARLet, a framework for
training utility-based retrievers in RALMs, which incorporates two key factors,
multi-task generalization and inter-passage interaction. First, SCARLet
constructs shared context on which training data for various tasks is
synthesized. This mitigates semantic bias from context differences, allowing
retrievers to focus on learning task-specific utility for better task
generalization. Next, SCARLet uses a perturbation-based attribution method to
estimate passage-level utility for shared context, which reflects interactions
between passages and provides more accurate feedback. We evaluate our approach
on ten datasets across various tasks, both in-domain and out-of-domain, showing
that retrievers trained by SCARLet consistently improve the overall performance
of RALMs.
Authors' comments: 20 pages, 9 figures. Code will be released after review
Hsin-Ling Hsu, Jengnan Tzeng
Hybrid retrieval techniques in Retrieval-Augmented Generation (RAG) systems enhance information retrieval by combining dense and sparse (e.g., BM25-based) retrieval methods. However, existing approaches struggle with adaptability, as fixed weighting schemes fail to adjust to different queries. To address this, we propose DAT (Dynamic Alpha Tuning), a novel hybrid retrieval framework that dynamically balances dense retrieval and BM25 for each query. DAT leverages a large language model (LLM) to evaluate the effectiveness of the top-1 results from both retrieval methods, assigning an effectiveness score to each. It then calibrates the optimal weighting factor through effectiveness score normalization, ensuring a more adaptive and query-aware weighting between the two approaches. Empirical results show that DAT consistently significantly outperforms fixed-weighting hybrid retrieval methods across various evaluation metrics. Even on smaller models, DAT delivers strong performance, highlighting its efficiency and adaptability.
Wenchao Gu, Juntao Chen, Yanlin Wang, Tianyue Jiang, Xingzhe Li, Mingwei Liu, Xilin Liu, Yuchi Ma et al.
Repository-level code generation remains challenging due to complex code dependencies and the limitations of large language models (LLMs) in processing long contexts. While retrieval-augmented generation (RAG) frameworks are widely adopted, the effectiveness of different retrieved information sources-contextual code, APIs, and similar snippets-has not been rigorously analyzed. Through an empirical study on two benchmarks, we demonstrate that in-context code and potential API information significantly enhance LLM performance, whereas retrieved similar code often introduces noise, degrading results by up to 15%. Based on the preliminary results, we propose AllianceCoder, a novel context-integrated method that employs chain-of-thought prompting to decompose user queries into implementation steps and retrieves APIs via semantic description matching. Through extensive experiments on CoderEval and RepoExec, AllianceCoder achieves state-of-the-art performance, improving Pass@1 by up to 20% over existing approaches.
Sejong Kim, Hyunseo Song, Hyunwoo Seo, Hyunjun Kim
Retrieval-Augmented Generation (RAG) has emerged as a promising framework to
mitigate hallucinations in Large Language Models (LLMs), yet its overall
performance is dependent on the underlying retrieval system. In the finance
domain, documents such as 10-K reports pose distinct challenges due to
domain-specific vocabulary and multi-hierarchical tabular data. In this work,
we introduce an efficient, end-to-end RAG pipeline that enhances retrieval for
financial documents through a three-phase approach: pre-retrieval, retrieval,
and post-retrieval. In the pre-retrieval phase, various query and corpus
preprocessing techniques are employed to enrich input data. During the
retrieval phase, we fine-tuned state-of-the-art (SOTA) embedding models with
domain-specific knowledge and implemented a hybrid retrieval strategy that
combines dense and sparse representations. Finally, the post-retrieval phase
leverages Direct Preference Optimization (DPO) training and document selection
methods to further refine the results. Evaluations on seven financial question
answering datasets-FinDER, FinQABench, FinanceBench, TATQA, FinQA, ConvFinQA,
and MultiHiertt-demonstrate substantial improvements in retrieval performance,
leading to more accurate and contextually appropriate generation. These
findings highlight the critical role of tailored retrieval techniques in
advancing the effectiveness of RAG systems for financial applications. A fully
replicable pipeline is available on GitHub:
https://github.com/seohyunwoo-0407/GAR.
Authors' comments: 15 pages, 3 figures, 11 tables. Accepted at ICLR 2025 Workshop on
Advances in Financial AI. Code available at
https://github.com/seohyunwoo-0407/GAR
Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, Zhaochun Ren
Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.
Rongzhi Zhu, Xiangyu Liu, Zequn Sun, Yiwei Wang, Wei Hu
In this paper, we identify a critical problem, "lost-in-retrieval", in
retrieval-augmented multi-hop question answering (QA): the key entities are
missed in LLMs' sub-question decomposition. "Lost-in-retrieval" significantly
degrades the retrieval performance, which disrupts the reasoning chain and
leads to the incorrect answers. To resolve this problem, we propose a
progressive retrieval and rewriting method, namely ChainRAG, which sequentially
handles each sub-question by completing missing key entities and retrieving
relevant sentences from a sentence graph for answer generation. Each step in
our retrieval and rewriting process builds upon the previous one, creating a
seamless chain that leads to accurate retrieval and answers. Finally, all
retrieved sentences and sub-question answers are integrated to generate a
comprehensive answer to the original question. We evaluate ChainRAG on three
multi-hop QA datasets - MuSiQue, 2Wiki, and HotpotQA - using three large
language models: GPT4o-mini, Qwen2.5-72B, and GLM-4-Plus. Empirical results
demonstrate that ChainRAG consistently outperforms baselines in both
effectiveness and efficiency.
Authors' comments: Accepted in the 63rd Annual Meeting of the Association for
Computational Linguistics (ACL 2025)
Yingrui Yang, Parker Carlson, Yifan Qiao, Wentai Xie, Shanxiu He, Tao Yang
This paper studies fast fusion of dense retrieval and sparse lexical
retrieval, and proposes a cluster-based selective dense retrieval method called
CluSD guided by sparse lexical retrieval. CluSD takes a lightweight
cluster-based approach and exploits the overlap of sparse retrieval results and
embedding clusters in a two-stage selection process with an LSTM model to
quickly identify relevant clusters while incurring limited extra memory space
overhead. CluSD triggers partial dense retrieval and performs cluster-based
block disk I/O if needed. This paper evaluates CluSD and compares it with
several baselines for searching in-memory and on-disk MS MARCO and BEIR
datasets.
Authors' comments: This paper is accepted by ECIR'25
Abdelrahman Abdallah, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, Adam Jatowt
Retrieval, re-ranking, and retrieval-augmented generation (RAG) are critical
components of modern applications in information retrieval, question answering,
or knowledge-based text generation. However, existing solutions are often
fragmented, lacking a unified framework that easily integrates these essential
processes. The absence of a standardized implementation, coupled with the
complexity of retrieval and re-ranking workflows, makes it challenging for
researchers to compare and evaluate different approaches in a consistent
environment. While existing toolkits such as Rerankers and RankLLM provide
general-purpose reranking pipelines, they often lack the flexibility required
for fine-grained experimentation and benchmarking. In response to these
challenges, we introduce Rankify, a powerful and modular open-source toolkit
designed to unify retrieval, re-ranking, and RAG within a cohesive framework.
Rankify supports a wide range of retrieval techniques, including dense and
sparse retrievers, while incorporating state-of-the-art re-ranking models to
enhance retrieval quality. Additionally, Rankify includes a collection of
pre-retrieved datasets to facilitate benchmarking, available at Huggingface
(https://huggingface.co/datasets/abdoelsayed/reranking-datasets-light). To
encourage adoption and ease of integration, we provide comprehensive
documentation (http://rankify.readthedocs.io/), an open-source implementation
on GitHub (https://github.com/DataScienceUIBK/rankify), and a PyPI package for
easy installation (https://pypi.org/project/rankify/). As a unified and
lightweight framework, Rankify allows researchers and practitioners to advance
retrieval and re-ranking methodologies while ensuring consistency, scalability,
and ease of use.
Authors' comments: Work in Progress
Peter Baile Chen, Yi Zhang, Michael Cafarella, Dan Roth
Real-world open-domain questions can be complicated, particularly when answering them involves information from multiple information sources. LLMs have demonstrated impressive performance in decomposing complex tasks into simpler steps, and previous work has used it for better retrieval in support of complex questions. However, LLM's decomposition of questions is unaware of what data is available and how data is organized, often leading to a sub-optimal retrieval performance. Recent effort in agentic RAG proposes to perform retrieval in an iterative fashion, where a followup query is derived as an action based on previous rounds of retrieval. While this provides one way of interacting with the data collection, agentic RAG's exploration of data is inefficient because successive queries depend on previous results rather than being guided by the organization of available data in the collection. To address this problem, we propose an LLM-based retrieval method -- ARM, that aims to better align the question with the organization of the data collection by exploring relationships among data objects beyond matching the utterance of the query, thus leading to a retrieve-all-at-once solution for complex queries. We evaluated ARM on two datasets, Bird and OTT-QA. On Bird, it outperforms standard RAG with query decomposition by up to 5.2 pt in execution accuracy and agentic RAG (ReAct) by up to 15.9 pt. On OTT-QA, it achieves up to 5.5 pt and 19.3 pt higher F1 match scores compared to these approaches.
Yiteng Tu, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieved from a knowledge base. However, its effectiveness is fundamentally constrained by the reliability of both the retriever and the knowledge base. In real-world scenarios, imperfections in these components often lead to the retrieval of noisy, irrelevant, or misleading counterfactual information, ultimately undermining the trustworthiness of RAG systems. To address this challenge, we propose Robust Fine-Tuning (RbFT), a method designed to enhance the resilience of LLMs against retrieval defects through two targeted fine-tuning tasks. Experimental results demonstrate that RbFT significantly improves the robustness of RAG systems across diverse retrieval conditions, surpassing existing methods while maintaining high inference efficiency and compatibility with other robustness techniques.
Long Nguyen, Huy Nguyen, Bao Khuu, Huy Luu, Huy Le, Tuan Nguyen, Tho Quan
Retrieving events from videos using text queries has become increasingly
challenging due to the rapid growth of multimedia content. Existing methods for
text-based video event retrieval often focus heavily on object-level
descriptions, overlooking the crucial role of contextual information. This
limitation is especially apparent when queries lack sufficient context, such as
missing location details or ambiguous background elements. To address these
challenges, we propose a novel system called RAPID (Retrieval-Augmented
Parallel Inference Drafting), which leverages advancements in Large Language
Models (LLMs) and prompt-based learning to semantically correct and enrich user
queries with relevant contextual information. These enriched queries are then
processed through parallel retrieval, followed by an evaluation step to select
the most relevant results based on their alignment with the original query.
Through extensive experiments on our custom-developed dataset, we demonstrate
that RAPID significantly outperforms traditional retrieval methods,
particularly for contextually incomplete queries. Our system was validated for
both speed and accuracy through participation in the Ho Chi Minh City AI
Challenge 2024, where it successfully retrieved events from over 300 hours of
video. Further evaluation comparing RAPID with the baseline proposed by the
competition organizers demonstrated its superior effectiveness, highlighting
the strength and robustness of our approach.
Authors' comments: Under review at SoICT'24