Yuguo Yin, Yuxin Xie, Wenyuan Yang, Dongchao Yang, Jinghan Ru, Xianwei Zhuang, Liming Liang, Yuexian Zou
Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to retrieve audio clips or multilingual texts from databases. However, existing ML-ATR schemes suffer from inconsistencies for instance similarity matching across languages. We theoretically analyze the inconsistency in terms of both multilingual modal alignment direction error and weight error, and propose the theoretical weight error upper bound for quantifying the inconsistency. Based on the analysis of the weight error upper bound, we find that the inconsistency problem stems from the data distribution error caused by random sampling of languages. We propose a consistent ML-ATR scheme using 1-to-k contrastive learning and audio-English co-anchor contrastive learning, aiming to mitigate the negative impact of data distribution error on recall and consistency in ML-ATR. Experimental results on the translated AudioCaps and Clotho datasets show that our scheme achieves state-of-the-art performance on recall and consistency metrics for eight mainstream languages, including English. Our code will be available at https://github.com/ATRI-ACL/ATRI-ACL.
Mingyi Jia, Zhihao Jia, Junwen Duan, Yan Song, Jianxin Wang
Retrieval-Augmented Large Language Models~(LLMs), which integrate external knowledge, have shown remarkable performance in medical domains, including clinical diagnosis. However, existing RAG methods often struggle to tailor retrieval strategies to diagnostic difficulty and input sample informativeness. This limitation leads to excessive and often unnecessary retrieval, impairing computational efficiency and increasing the risk of introducing noise that can degrade diagnostic accuracy. To address this, we propose ICA-RAG (\textbf{I}nformation \textbf{C}ompleteness Guided \textbf{A}daptive \textbf{R}etrieval-\textbf{A}ugmented \textbf{G}eneration), a novel framework for enhancing RAG reliability in disease diagnosis. ICA-RAG utilizes an adaptive control module to assess the necessity of retrieval based on the input's information completeness. By optimizing retrieval and incorporating knowledge filtering, ICA-RAG better aligns retrieval operations with clinical requirements. Experiments on three Chinese electronic medical record datasets demonstrate that ICA-RAG significantly outperforms baseline methods, highlighting its effectiveness in clinical diagnosis.
Feiyuan Zhang, Dezhi Zhu, James Ming, Yilun Jin, Di Chai, Liu Yang, Han Tian, Zhaoxin Fan et al.
Retrieval-Augmented Generation (RAG) systems have shown substantial benefits in applications such as question answering and multi-turn dialogue \citep{lewis2020retrieval}. However, traditional RAG methods, while leveraging static knowledge bases, often overlook the potential of dynamic historical information in ongoing conversations. To bridge this gap, we introduce DH-RAG, a Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue. DH-RAG is inspired by human cognitive processes that utilize both long-term memory and immediate historical context in conversational responses \citep{stafford1987conversational}. DH-RAG is structured around two principal components: a History-Learning based Query Reconstruction Module, designed to generate effective queries by synthesizing current and prior interactions, and a Dynamic History Information Updating Module, which continually refreshes historical context throughout the dialogue. The center of DH-RAG is a Dynamic Historical Information database, which is further refined by three strategies within the Query Reconstruction Module: Historical Query Clustering, Hierarchical Matching, and Chain of Thought Tracking. Experimental evaluations show that DH-RAG significantly surpasses conventional models on several benchmarks, enhancing response relevance, coherence, and dialogue quality.
Pengfei He, Shaowei Wang, Tse-Hsun Chen
Retrieval-Augmented Generation (RAG) enhances coding tasks by incorporating
retrieved code examples into prompts. However, lengthy prompts, often exceeding
tens of thousands of tokens, introduce challenges related to limited context
windows of language models (LMs) and high computational costs. Existing prompt
compression techniques focus on natural language, lacking tailored solutions
for code. To address the gap, we propose CodePromptZip, a framework that
compresses code examples before integrating into RAG workflows. Our framework
employs a type-aware, priority-driven strategy to construct training samples
for training code compression model. By using program analysis, we identify
token types (e.g., Identifier) and perform ablation analysis to rank their
removal priorities based on their impact on task performance. We then train a
small LM as the compressor on these samples, enabling flexible compression
conditioned on specified ratios while minimizing performance degradation.
Specially, the compressor is augmented with a copy mechanism, allowing tokens
to be directly copied from the original code snippets. Evaluation results show
that CodePromptZip surpasses SOTA entropy-based and distillation-based
baselines, improving by 23.4%, 28.7%, and 8.7% over the best baseline for
Assertion Generation, Bugs2Fix, and Code Suggestion, respectively.
Authors' comments: 14 pages, 14 figures
Qingfa Xiao, Jiachuan Wang, Haoyang Li, Cheng Deng, Jiaqi Tang, Shuangyin Li, Yongqi Zhang, Jun Wang et al.
Recent advances in large language models (LLMs) have showcased exceptional performance in long-context tasks, while facing significant inference efficiency challenges with limited GPU memory. Existing solutions first proposed the sliding-window approach to accumulate a set of historical \textbf{key-value} (KV) pairs for reuse, then further improvements selectively retain its subsets at each step. However, due to the sparse attention distribution across a long context, it is hard to identify and recall relevant KV pairs, as the attention is distracted by massive candidate pairs. Additionally, we found it promising to select representative tokens as probe-Query in each sliding window to effectively represent the entire context, which is an approach overlooked by existing methods. Thus, we propose \textbf{ActQKV}, a training-free, \textbf{Act}ivation-aware approach that dynamically determines probe-\textbf{Q}uery and leverages it to retrieve the relevant \textbf{KV} pairs for inference. Specifically, ActQKV monitors a token-level indicator, Activation Bias, within each context window, enabling the proper construction of probe-Query for retrieval at pre-filling stage. To accurately recall the relevant KV pairs and minimize the irrelevant ones, we design a dynamic KV cut-off mechanism guided by information density across layers at the decoding stage. Experiments on the Long-Bench and $\infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
Peter Carragher, Abhinand Jha, R Raghav, Kathleen M. Carley
Large Language Models (LLMs) demonstrate remarkable capabilities in question answering (QA), but metrics for assessing their reliance on memorization versus retrieval remain underdeveloped. Moreover, while finetuned models are state-of-the-art on closed-domain tasks, general-purpose models like GPT-4o exhibit strong zero-shot performance. This raises questions about the trade-offs between memorization, generalization, and retrieval. In this work, we analyze the extent to which multimodal retrieval-augmented VLMs memorize training data compared to baseline VLMs. Using the WebQA benchmark, we contrast finetuned models with baseline VLMs on multihop retrieval and question answering, examining the impact of finetuning on data memorization. To quantify memorization in end-to-end retrieval and QA systems, we propose several proxy metrics by investigating instances where QA succeeds despite retrieval failing. In line with existing work, we find that finetuned models rely more heavily on memorization than retrieval-augmented VLMs, and achieve higher accuracy as a result (72% vs 52% on WebQA test set). Finally, we present the first empirical comparison of the parametric effect between text and visual modalities. Here, we find that image-based questions have parametric response rates that are consistently 15-25% higher than for text-based questions in the WebQA dataset. As such, our measures pose a challenge for future work, both to account for differences in model memorization across different modalities and more generally to reconcile memorization and generalization in joint Retrieval-QA tasks.
DongGeon Lee, Hwanjo Yu
Hallucinations in large language model (LLM) outputs severely limit their
reliability in knowledge-intensive tasks such as question answering. To address
this challenge, we introduce REFIND (Retrieval-augmented Factuality
hallucINation Detection), a novel framework that detects hallucinated spans
within LLM outputs by directly leveraging retrieved documents. As part of the
REFIND, we propose the Context Sensitivity Ratio (CSR), a novel metric that
quantifies the sensitivity of LLM outputs to retrieved evidence. This
innovative approach enables REFIND to efficiently and accurately detect
hallucinations, setting it apart from existing methods. In the evaluation,
REFIND demonstrated robustness across nine languages, including low-resource
settings, and significantly outperformed baseline models, achieving superior
IoU scores in identifying hallucinated spans. This work highlights the
effectiveness of quantifying context sensitivity for hallucination detection,
thereby paving the way for more reliable and trustworthy LLM applications
across diverse languages. Our code is available at
https://github.com/oneonlee/REFIND.
Authors' comments: Accepted to SemEval@ACL 2025
Ruiming Tang, Chenxu Zhu, Bo Chen, Weipeng Zhang, Menghui Zhu, Xinyi Dai, Huifeng Guo
Tagging systems play an essential role in various information retrieval applications such as search engines and recommender systems. Recently, Large Language Models (LLMs) have been applied in tagging systems due to their extensive world knowledge, semantic understanding, and reasoning capabilities. Despite achieving remarkable performance, existing methods still have limitations, including difficulties in retrieving relevant candidate tags comprehensively, challenges in adapting to emerging domain-specific knowledge, and the lack of reliable tag confidence quantification. To address these three limitations above, we propose an automatic tagging system LLM4Tag. First, a graph-based tag recall module is designed to effectively and comprehensively construct a small-scale highly relevant candidate tag set. Subsequently, a knowledge-enhanced tag generation module is employed to generate accurate tags with long-term and short-term knowledge injection. Finally, a tag confidence calibration module is introduced to generate reliable tag confidence scores. Extensive experiments over three large-scale industrial datasets show that LLM4Tag significantly outperforms the state-of-the-art baselines and LLM4Tag has been deployed online for content tagging to serve hundreds of millions of users.
Yuhan Li, Xinni Zhang, Linhao Luo, Heng Chang, Yuxiang Ren, Irwin King, Jia Li
Explainable recommendation has demonstrated significant advantages in
informing users about the logic behind recommendations, thereby increasing
system transparency, effectiveness, and trustworthiness. To provide
personalized and interpretable explanations, existing works often combine the
generation capabilities of large language models (LLMs) with collaborative
filtering (CF) information. CF information extracted from the user-item
interaction graph captures the user behaviors and preferences, which is crucial
for providing informative explanations. However, due to the complexity of graph
structure, effectively extracting the CF information from graphs still remains
a challenge. Moreover, existing methods often struggle with the integration of
extracted CF information with LLMs due to its implicit representation and the
modality gap between graph structures and natural language explanations. To
address these challenges, we propose G-Refer, a framework using graph
retrieval-augmented large language models (LLMs) for explainable
recommendation. Specifically, we first employ a hybrid graph retrieval
mechanism to retrieve explicit CF signals from both structural and semantic
perspectives. The retrieved CF information is explicitly formulated as
human-understandable text by the proposed graph translation and accounts for
the explanations generated by LLMs. To bridge the modality gap, we introduce
knowledge pruning and retrieval-augmented fine-tuning to enhance the ability of
LLMs to process and utilize the retrieved CF information to generate
explanations. Extensive experiments show that G-Refer achieves superior
performance compared with existing methods in both explainability and
stability. Codes and data are available at https://github.com/Yuhan1i/G-Refer.
Authors' comments: Accepted by WWW 2025, research track
Sichu Liang, Linhai Zhang, Hongyu Zhu, Wenwen Wang, Yulan He, Deyu Zhou
Medical question answering requires extensive access to specialized conceptual knowledge. The current paradigm, Retrieval-Augmented Generation (RAG), acquires expertise medical knowledge through large-scale corpus retrieval and uses this knowledge to guide a general-purpose large language model (LLM) for generating answers. However, existing retrieval approaches often overlook the importance of factual knowledge, which limits the relevance of retrieved conceptual knowledge and restricts its applicability in real-world scenarios, such as clinical decision-making based on Electronic Health Records (EHRs). This paper introduces RGAR, a recurrence generation-augmented retrieval framework that retrieves both relevant factual and conceptual knowledge from dual sources (i.e., EHRs and the corpus), allowing them to interact and refine each another. Through extensive evaluation across three factual-aware medical question answering benchmarks, RGAR establishes a new state-of-the-art performance among medical RAG systems. Notably, the Llama-3.1-8B-Instruct model with RGAR surpasses the considerably larger, RAG-enhanced GPT-3.5. Our findings demonstrate the benefit of extracting factual knowledge for retrieval, which consistently yields improved generation quality.
Yifan Ji, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Shi Yu, Yishan Li, Zhiyuan Liu, Yu Gu et al.
Recent dense retrievers usually thrive on the emergency capabilities of Large Language Models (LLMs), using them to encode queries and documents into an embedding space for retrieval. These LLM-based dense retrievers have shown promising performance across various retrieval scenarios. However, relying on a single embedding to represent documents proves less effective in capturing different perspectives of documents for matching. In this paper, we propose Deliberate Thinking based Dense Retriever (DEBATER), which enhances these LLM-based retrievers by enabling them to learn more effective document representations through a step-by-step thinking process. DEBATER introduces the Chain-of-Deliberation mechanism to iteratively optimize document representations using a continuous chain of thought. To consolidate information from various thinking steps, DEBATER also incorporates the Self Distillation mechanism, which identifies the most informative thinking steps and integrates them into a unified text embedding. Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks, demonstrating superior accuracy and robustness. All codes are available at https://github.com/OpenBMB/DEBATER.
Aditya Sharma, Luis Lara, Amal Zouaq, Christopher J. Pal
The ability to generate SPARQL queries from natural language questions is crucial for ensuring efficient and accurate retrieval of structured data from knowledge graphs (KG). While large language models (LLMs) have been widely adopted for SPARQL query generation, they are often susceptible to hallucinations and out-of-distribution errors when producing KG elements like Uniform Resource Identifiers (URIs) based on internal parametric knowledge. This often results in content that appears plausible but is factually incorrect, posing significant challenges for their use in real-world information retrieval (IR) applications. This has led to increased research aimed at detecting and mitigating such errors. In this paper, we introduce PGMR (Post-Generation Memory Retrieval), a modular framework that incorporates a non-parametric memory module to retrieve KG elements and enhance LLM-based SPARQL query generation. Our experimental results indicate that PGMR consistently delivers strong performance across diverse datasets, data distributions, and LLMs. Notably, PGMR significantly mitigates URI hallucinations, nearly eliminating the problem in several scenarios.
Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, Bill Byrne
Hateful memes have become a significant concern on the Internet,
necessitating robust automated detection systems. While LMMs have shown promise
in hateful meme detection, they face notable challenges like sub-optimal
performance and limited out-of-domain generalization capabilities. Recent
studies further reveal the limitations of both SFT and in-context learning when
applied to LMMs in this setting. To address these issues, we propose a robust
adaptation framework for hateful meme detection that enhances in-domain
accuracy and cross-domain generalization while preserving the general
vision-language capabilities of LMMs. Experiments on six meme classification
datasets show that our approach achieves state-of-the-art performance,
outperforming larger agentic systems. Moreover, our method generates
higher-quality rationales for explaining hateful content compared to standard
SFT, enhancing model interpretability.
Authors' comments: Preprint. Under Review
Sha Li, Naren Ramakrishnan
Retrieval-Augmented Generation (RAG) aims to augment the capabilities of
Large Language Models (LLMs) by retrieving and incorporate external documents
or chunks prior to generation. However, even improved retriever relevance can
brings erroneous or contextually distracting information, undermining the
effectiveness of RAG in downstream tasks. We introduce a compact, efficient,
and pluggable module designed to refine retrieved chunks before using them for
generation. The module aims to extract and reorganize the most relevant and
supportive information into a concise, query-specific format. Through a
three-stage training paradigm - comprising supervised fine - tuning,
contrastive multi-task learning, and reinforcement learning-based alignment -
it prioritizes critical knowledge and aligns it with the generator's
preferences. This approach enables LLMs to produce outputs that are more
accurate, reliable, and contextually appropriate.
Authors' comments: 16 pages
Jinyan Su, Jennifer Healey, Preslav Nakov, Claire Cardie
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to mitigate large language model (LLM) hallucinations by incorporating external knowledge retrieval. However, existing RAG frameworks often apply retrieval indiscriminately,leading to inefficiencies-over-retrieving when unnecessary or failing to retrieve iteratively when required for complex reasoning. Recent adaptive retrieval strategies, though adaptively navigates these retrieval strategies, predict only based on query complexity and lacks user-driven flexibility, making them infeasible for diverse user application needs. In this paper, we introduce a novel user-controllable RAG framework that enables dynamic adjustment of the accuracy-cost trade-off. Our approach leverages two classifiers: one trained to prioritize accuracy and another to prioritize retrieval efficiency. Via an interpretable control parameter $\alpha$, users can seamlessly navigate between minimal-cost retrieval and high-accuracy retrieval based on their specific requirements. We empirically demonstrate that our approach effectively balances accuracy, retrieval cost, and user controllability, making it a practical and adaptable solution for real-world applications.
Sayantan Adak, Pauras Mangesh Meher, Paramita Das, Animesh Mukherjee
Wikipedia is an invaluable resource for factual information about a wide
range of entities. However, the quality of articles on less-known entities
often lags behind that of the well-known ones. This study proposes a novel
approach to enhancing Wikipedia's B and C category biography articles by
leveraging personal narratives such as autobiographies and biographies. By
utilizing a multi-staged retrieval-augmented generation technique -- REVerSum
-- we aim to enrich the informational content of these lesser-known articles.
Our study reveals that personal narratives can significantly improve the
quality of Wikipedia articles, providing a rich source of reliable information
that has been underutilized in previous studies. Based on crowd-based
evaluation, REVerSum generated content outperforms the best performing baseline
by 17% in terms of integrability to the original Wikipedia article and 28.5\%
in terms of informativeness. Code and Data are available at:
https://github.com/sayantan11995/wikipedia_enrichment
Authors' comments: Accepted at COLING2025 Industry Track
Joon Park, Kyohei Atarashi, Koh Takeuchi, Hisashi Kashima
This paper addresses the challenge of comprehending very long contexts in
Large Language Models (LLMs) by proposing a method that emulates Retrieval
Augmented Generation (RAG) through specialized prompt engineering and
chain-of-thought (CoT) reasoning. While recent LLMs support over 100,000 tokens
in a single prompt, simply enlarging context windows has not guaranteed robust
multi-hop reasoning when key details are scattered across massive input. Our
approach treats the model as both the retriever and the reasoner: it first tags
relevant segments within a long passage, then employs a stepwise CoT workflow
to integrate these pieces of evidence. This single-pass method thereby reduces
reliance on an external retriever, yet maintains focus on crucial segments. We
evaluate our approach on selected tasks from BABILong, which interleaves
standard bAbI QA problems with large amounts of distractor text. Compared to
baseline (no retrieval) and naive RAG pipelines, our approach more accurately
handles multi-fact questions such as object location tracking, counting, and
indefinite knowledge. Furthermore, we analyze how prompt structure, including
the order of question, relevant-text tags, and overall instructions,
significantly affects performance. These findings underscore that optimized
prompt engineering, combined with guided reasoning, can enhance LLMs'
long-context comprehension and serve as a lightweight alternative to
traditional retrieval pipelines.
Authors' comments: 11 pages, 2 figures
Kwangwook Seo, Donguk Kwon, Dongha Lee
Recent advancements in table-based reasoning have expanded beyond
factoid-level QA to address insight-level tasks, where systems should
synthesize implicit knowledge in the table to provide explainable analyses.
Although effective, existing studies remain confined to scenarios where a
single gold table is given alongside the user query, failing to address cases
where users seek comprehensive insights from multiple unknown tables. To bridge
these gaps, we propose MT-RAIG Bench, design to evaluate systems on
Retrieval-Augmented Insight Generation over Mulitple-Tables. Additionally, to
tackle the suboptimality of existing automatic evaluation methods in the table
domain, we further introduce a fine-grained evaluation framework MT-RAIG Eval,
which achieves better alignment with human quality judgments on the generated
insights. We conduct extensive experiments and reveal that even frontier LLMs
still struggle with complex multi-table reasoning, establishing our MT-RAIG
Bench as a challenging testbed for future research.
Authors' comments: Work in progress
Zhuoning Guo, Guangxing Chen, Qian Gao, Xiaochao Liao, Jianjia Zheng, Lu Shen, Hao Liu
Web recommendations provide personalized items from massive catalogs for users, which rely heavily on retrieval stages to trade off the effectiveness and efficiency of selecting a small relevant set from billion-scale candidates in online digital platforms. As one of the largest Chinese search engine and news feed providers, Baidu resorts to Deep Neural Network (DNN) and graph-based Approximate Nearest Neighbor Search (ANNS) algorithms for accurate relevance estimation and efficient search for relevant items. However, current retrieval at Baidu fails in comprehensive user-item relational understanding due to dissected interaction modeling, and performs inefficiently in large-scale graph-based ANNS because of suboptimal traversal navigation and the GPU computational bottleneck under high concurrency. To this end, we propose a GPU-accelerated Multi-relational Parallel Graph Retrieval (GMP-GR) framework to achieve effective yet efficient retrieval in web-scale recommendations. First, we propose a multi-relational user-item relevance metric learning method that unifies diverse user behaviors through multi-objective optimization and employs a self-covariant loss to enhance pathfinding performance. Second, we develop a hierarchical parallel graph-based ANNS to boost graph retrieval throughput, which conducts breadth-depth-balanced searches on a large-scale item graph and cost-effectively handles irregular neural computation via adaptive aggregation on GPUs. In addition, we integrate system optimization strategies in the deployment of GMP-GR in Baidu. Extensive experiments demonstrate the superiority of GMP-GR in retrieval accuracy and efficiency. Deployed across more than twenty applications at Baidu, GMP-GR serves hundreds of millions of users with a throughput exceeding one hundred million requests per second.
Hongyan Wu, Peijian Zeng, Weixiong Zheng, Lianxi Wang, Nankai Lin, Shengyi Jiang, Aimin Yang
Cross-modal text-molecule retrieval task bridges molecule structures and
natural language descriptions. Existing methods predominantly focus on aligning
text modality and molecule modality, yet they overlook adaptively adjusting the
learning states at different training stages and enhancing training efficiency.
To tackle these challenges, this paper proposes a Curriculum Learning-bAsed
croSS-modal text-molecule training framework (CLASS), which can be integrated
with any backbone to yield promising performance improvement. Specifically, we
quantify the sample difficulty considering both text modality and molecule
modality, and design a sample scheduler to introduce training samples via an
easy-to-difficult paradigm as the training advances, remarkably reducing the
scale of training samples at the early stage of training and improving training
efficiency. Moreover, we introduce adaptive intensity learning to increase the
training intensity as the training progresses, which adaptively controls the
learning intensity across all curriculum stages. Experimental results on the
ChEBI-20 dataset demonstrate that our proposed method gains superior
performance, simultaneously achieving prominent time savings.
Authors' comments: 12 pages