Joydeep Chandra, Aleksandr Algazinov, Satyam Kumar Navneet, Rim El Filali, Matt Laing, Andrew Hanna
As access to information becomes more open and widespread, people are increasingly using AI tools for assistance. However, many of these tools struggle to estimate the trustworthiness of the information. Although today's search engines include AI features, they often fail to offer clear indicators of data reliability. To address this gap, we introduce WebTrust, a system designed to simplify the process of finding and judging credible information online. Built on a fine-tuned version of IBM's Granite-1B model and trained on a custom dataset, WebTrust works by assigning a reliability score (from 0.1 to 1) to each statement it processes. In addition, it offers a clear justification for why a piece of information received that score. Evaluated using prompt engineering, WebTrust consistently achieves superior performance compared to other small-scale LLMs and rule-based approaches, outperforming them across all experiments on MAE, RMSE, and R2. User testing showed that when reliability scores are displayed alongside search results, people feel more confident and satisfied with the information they find. With its accuracy, transparency, and ease of use, WebTrust offers a practical solution to help combat misinformation and make trustworthy information more accessible to everyone.
Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, Jinsong Su
Graph retrieval-augmented generation (GraphRAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) with external knowledge. It leverages graphs to model the hierarchical structure between specific concepts, enabling more coherent and effective knowledge retrieval for accurate reasoning.Despite its conceptual promise, recent studies report that GraphRAG frequently underperforms vanilla RAG on many real-world tasks. This raises a critical question: Is GraphRAG really effective, and in which scenarios do graph structures provide measurable benefits for RAG systems? To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models onboth hierarchical knowledge retrieval and deep contextual reasoning. GraphRAG-Bench features a comprehensive dataset with tasks of increasing difficulty, coveringfact retrieval, complex reasoning, contextual summarization, and creative generation, and a systematic evaluation across the entire pipeline, from graph constructionand knowledge retrieval to final generation. Leveraging this novel benchmark, we systematically investigate the conditions when GraphRAG surpasses traditional RAG and the underlying reasons for its success, offering guidelines for its practical application. All related resources and analyses are collected for the community at https://github.com/GraphRAG-Bench/GraphRAG-Benchmark.
Jubin Abhishek Soni, Amit Anand, Rajesh Kumar Pandey, Aniket Abhishek Soni
Retrieval-Augmented Generation (RAG) has significantly advanced large
language models (LLMs) by grounding their outputs in external tools and
knowledge sources. However, existing RAG systems are typically constrained to
static, single-turn interactions with fixed toolsets, making them ill-suited
for dynamic domains such as healthcare and smart homes, where user intent,
available tools, and contextual factors evolve over time. We present Dynamic
Context Tuning (DCT), a lightweight framework that extends RAG to support
multi-turn dialogue and evolving tool environments without requiring
retraining. DCT integrates an attention-based context cache to track relevant
past information, LoRA-based retrieval to dynamically select domain-specific
tools, and efficient context compression to maintain inputs within LLM context
limits. Experiments on both synthetic and real-world benchmarks show that DCT
improves plan accuracy by 14% and reduces hallucinations by 37%, while matching
GPT-4 performance at significantly lower cost. Furthermore, DCT generalizes to
previously unseen tools, enabling scalable and adaptable AI assistants across a
wide range of dynamic environments.
Authors' comments: 6 pages, 5 figures, 3 tables. This manuscript has been submitted to
IEEE conference. Researchers are welcome to read and build upon this work;
please cite it appropriately. For questions or clarifications, feel free to
contact me
Lev Morozov, Aleksandr Mogilevskii, Alexander Shirnin
This paper describes EmoRAG, a system designed to detect perceived emotions
in text for SemEval-2025 Task 11, Subtask A: Multi-label Emotion Detection. We
focus on predicting the perceived emotions of the speaker from a given text
snippet, labeling it with emotions such as joy, sadness, fear, anger, surprise,
and disgust. Our approach does not require additional model training and only
uses an ensemble of models to predict emotions. EmoRAG achieves results
comparable to the best performing systems, while being more efficient,
scalable, and easier to implement.
Authors' comments: Accepted to SemEval-2025, an ACL 2025 workshop
Jan Strich, Enes Kutay Isgorur, Maximilian Trescher, Chris Biemann, Martin Semmann
While most financial documents contain a combination of textual and tabular information, robust Retrieval-Augmented Generation (RAG) systems are essential for effectively accessing and reasoning over such content to perform complex numerical tasks. This paper introduces T$^2$-RAGBench, a benchmark comprising 32,908 question-context-answer triples, designed to evaluate RAG methods on real-world financial data. Unlike typical QA datasets that operate under Oracle-context settings, where the relevant context is explicitly provided, T$^2$-RAGBench challenges models to first retrieve the correct context before conducting numerical reasoning. Existing QA datasets involving text and tables typically contain context-dependent questions, which may yield multiple correct answers depending on the provided context. To address this, we transform these datasets into a context-independent format, enabling reliable RAG evaluation. We conduct a comprehensive evaluation of popular RAG methods. Our analysis identifies Hybrid BM25, a technique that combines dense and sparse vectors, as the most effective approach for text-and-table data. However, results demonstrate that T$^2$-RAGBench remains challenging even for SOTA LLMs and RAG methods. Further ablation studies examine the impact of embedding models and corpus size on retrieval performance. T$^2$-RAGBench provides a realistic and rigorous benchmark for existing RAG methods on text-and-table data. Code and dataset are available online.
Alex Laitenberger, Christopher D. Manning, Nelson F. Liu
With the rise of long-context language models (LMs) capable of processing
tens of thousands of tokens in a single pass, do multi-stage
retrieval-augmented generation (RAG) pipelines still offer measurable benefits
over simpler, single-stage approaches? To assess this question, we conduct a
controlled evaluation for QA tasks under systematically scaled token budgets,
comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three
baselines, including DOS RAG (Document's Original Structure RAG), a simple
retrieve-then-read method that preserves original passage order. Despite its
straightforward design, DOS RAG consistently matches or outperforms more
intricate methods on multiple long-context QA benchmarks. We recommend
establishing DOS RAG as a simple yet strong baseline for future RAG
evaluations, pairing it with emerging embedding and language models to assess
trade-offs between complexity and effectiveness as model capabilities evolve.
Authors' comments: 10 pages, 5 figures, for associated source code, see
https://github.com/alex-laitenberger/stronger-baselines-rag
Yuxin Zhang, Yan Wang, Yongrui Chen, Shenyu Zhang, Xinbang Dai, Sheng Bi, Guilin Qi
Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by incorporating external retrieved information, mitigating issues such as hallucination and outdated knowledge. However, RAG systems are highly sensitive to retrieval noise prevalent in real-world scenarios. Existing benchmarks fail to emulate the complex and heterogeneous noise distributions encountered in real-world retrieval environments, undermining reliable robustness assessment. In this paper, we define four categories of retrieval noise based on linguistic properties and noise characteristics, aiming to reflect the heterogeneity of noise in real-world scenarios. Building on this, we introduce Magic Mushroom, a benchmark for replicating "magic mushroom" noise: contexts that appear relevant on the surface but covertly mislead RAG systems. Magic Mushroom comprises 7,468 single-hop and 3,925 multi-hop question-answer pairs. More importantly, Magic Mushroom enables researchers to flexibly configure combinations of retrieval noise according to specific research objectives or application scenarios, allowing for highly controlled evaluation setups. We evaluate LLM generators of varying parameter scales and classic RAG denoising strategies under diverse noise distributions to investigate their performance dynamics during progressive noise encroachment. Our analysis reveals that both generators and denoising strategies have significant room for improvement and exhibit extreme sensitivity to noise distributions. Magic Mushroom emerges as a promising tool for evaluating and advancing noise-robust RAG systems, accelerating their widespread deployment in real-world applications. The Magic Mushroom benchmark is available at the https://drive.google.com/file/d/1aP5kyPuk4L-L_uoI6T9UhxuTyt8oMqjT/view?usp=sharing.
Mahesh S. Nagargoje, Virginia Fregona, Giulia Luraghi, Francesco Migliavacca, Demitria A Poulos, Bryan C Good, Jose Felix Rodriguez Matas
Background: Mechanical Thrombectomy (MT) is a widely accepted first-line treatment for Acute Ischemic Stroke (AIS) and it has been studied using in vitro and in silico models. Thrombectomy outcomes have been performed for patient-specific cases using in silico models. However, until now, in vivo friction coefficients for stent-vessel, stent-clot, and clot-vessel interactions are unknown, but in vitro experiments have been attempted with significant standard deviations. These interactions and friction coefficients have been considered an important aspect of thrombectomy success. Objectives: In the current study, we explored the influence of variation in friction forces for stent-vessel, stent-clot, and clot-vessel interactions using virtual mechanical thrombectomy (VMT). We have performed three simulations for each interaction and varied friction coefficients around the standard deviation observed in the past in vitro studies. Results: (i) clot-vessel friction: higher friction leads to clot fragmentation and VMT failure. (ii) stent-clot friction: it is susceptible to VMT outcomes, with lower values showing the slippage of the clot while higher values lead to fragmentation. (iii) stent-vessel friction: higher friction shows compression of the stent in curved vessels and dislodgment of clot from stent retriever (SR) due to its compression, which leads to VMT failure. (iv) retrieval speed (RS): higher RS (>30 mm/s) leads to significant stent compression and unrealistic behavior of the SR. Conclusions: Analysis of results proposes the necessity for calculating accurate friction factor values and their implementation into in silico models, due to their sensitivity towards thrombectomy outcomes. Such in silico models mimic in vivo thrombectomy more closely and can be used in mechanical thrombectomy planning, management, and decision-making.
Pei-Yun Lin, Yen-lung Tsai
This research introduces ScoreRAG, an approach to enhance the quality of
automated news generation. Despite advancements in Natural Language Processing
and large language models, current news generation methods often struggle with
hallucinations, factual inconsistencies, and lack of domain-specific expertise
when producing news articles. ScoreRAG addresses these challenges through a
multi-stage framework combining retrieval-augmented generation, consistency
relevance evaluation, and structured summarization. The system first retrieves
relevant news documents from a vector database, maps them to complete news
items, and assigns consistency relevance scores based on large language model
evaluations. These documents are then reranked according to relevance, with
low-quality items filtered out. The framework proceeds to generate graded
summaries based on relevance scores, which guide the large language model in
producing complete news articles following professional journalistic standards.
Through this methodical approach, ScoreRAG aims to significantly improve the
accuracy, coherence, informativeness, and professionalism of generated news
articles while maintaining stability and consistency throughout the generation
process. The code and demo are available at:
https://github.com/peiyun2260/ScoreRAG.
Authors' comments: 11 pages, 8 figures. Code and demo available at
https://github.com/peiyun2260/ScoreRAG. Submitted to arXiv for public access;
journal submission planned
Qiming Zhu, Jialun Cao, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Shing-Chi Cheung
Current research on large language models (LLMs) with retrieval-augmented code generation (RACG) mainly focuses on single-language settings, leaving cross-lingual effectiveness and security unexplored. Multi-lingual RACG systems are valuable for migrating code-bases across programming languages (PLs), yet face risks from error (e.g. adversarial data corruption) propagation in cross-lingual transfer. We construct a dataset spanning 13 PLs with nearly 14k instances to explore utility and robustness of multi-lingual RACG systems. Our investigation reveals four key insights: (1) Effectiveness: multi-lingual RACG significantly enhances multi-lingual code LLMs generation; (2) Inequality: Java demonstrate superior cross-lingual utility over Python in RACG; (3) Robustness: Adversarial attacks degrade performance significantly in mono-lingual RACG but show mitigated impacts in cross-lingual scenarios; Counterintuitively, perturbed code may improve RACG in cross-lingual scenarios; (4) Specialization: Domain-specific code retrievers outperform significantly general text retrievers. These findings establish foundation for developing effective and secure multi-lingual code assistants.
Xinru Ying, Jiaqi Mo, Jingyang Lin, Canghong Jin, Fangfang Wang, Lina Wei
Partially Relevant Video Retrieval (PRVR) is a challenging task in the domain of multimedia retrieval. It is designed to identify and retrieve untrimmed videos that are partially relevant to the provided query. In this work, we investigate long-sequence video content understanding to address information redundancy issues. Leveraging the outstanding long-term state space modeling capability and linear scalability of the Mamba module, we introduce a multi-Mamba module with temporal fusion framework (MamFusion) tailored for PRVR task. This framework effectively captures the state-relatedness in long-term video content and seamlessly integrates it into text-video relevance understanding, thereby enhancing the retrieval process. Specifically, we introduce Temporal T-to-V Fusion and Temporal V-to-T Fusion to explicitly model temporal relationships between text queries and video moments, improving contextual awareness and retrieval accuracy. Extensive experiments conducted on large-scale datasets demonstrate that MamFusion achieves state-of-the-art performance in retrieval effectiveness. Code is available at the link: https://github.com/Vision-Multimodal-Lab-HZCU/MamFusion.
Huy Le, Nhat Chung, Tung Kieu, Anh Nguyen, Ngan Le
Text-video retrieval (TVR) systems often suffer from visual-linguistic biases
present in datasets, which cause pre-trained vision-language models to overlook
key details. To address this, we propose BiMa, a novel framework designed to
mitigate biases in both visual and textual representations. Our approach begins
by generating scene elements that characterize each video by identifying
relevant entities/objects and activities. For visual debiasing, we integrate
these scene elements into the video embeddings, enhancing them to emphasize
fine-grained and salient details. For textual debiasing, we introduce a
mechanism to disentangle text features into content and bias components,
enabling the model to focus on meaningful content while separately handling
biased information. Extensive experiments and ablation studies across five
major TVR benchmarks (i.e., MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo)
demonstrate the competitive performance of BiMa. Additionally, the model's bias
mitigation capability is consistently validated by its strong results on
out-of-distribution retrieval tasks.
Authors' comments: 22 pages, 14 figures
Pierre Lepagnol, Sahar Ghannay, Thomas Gerald, Christophe Servan, Sophie Rosset
Understanding user queries is fundamental in many applications, such as home
assistants, booking systems, or recommendations. Accordingly, it is crucial to
develop accurate Spoken Language Understanding (SLU) approaches to ensure the
reliability of the considered system. Current State-of-the-Art SLU techniques
rely on large amounts of training data; however, only limited annotated
examples are available for specific tasks or languages.
In the meantime, instruction-tuned large language models (LLMs) have shown
exceptional performance on unseen tasks in a few-shot setting when provided
with adequate prompts. In this work, we propose to explore example selection by
leveraging Information retrieval (IR) approaches to build an enhanced prompt
that is applied to an SLU task. We evaluate the effectiveness of the proposed
method on several SLU benchmarks. Experimental results show that lexical IR
methods significantly enhance performance without increasing prompt length.
Authors' comments: Conference paper accepted to INTERSPEECH 2025
Qihang Yan, Xinyu Zhang, Luming Guo, Qi Zhang, Feifan Liu
Large Language Models (LLMs) struggle with accuracy, domain-specific reasoning, and interpretability in vertical domains. Traditional preference alignment methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) often overlook the underlying knowledge sources and reasoning logic. This paper introduces RACE-Align (Retrieval-Augmented and Chain-of-Thought Enhanced Alignment), a novel framework designed to address these limitations. RACE-Align systematically constructs a binary preference dataset incorporating external knowledge support and explicit Chain-of-Thought (CoT) reasoning, then aligns LLMs using the DPO algorithm. The core innovation lies in its preference data construction strategy: it integrates AI-driven retrieval for factual grounding, enhancing knowledgeability and accuracy, and emphasizes the optimization of domain-specific CoT, treating the reasoning process itself as a key preference dimension. A multi-stage, AI-driven refinement pipeline cost-effectively generates these preference pairs. Experimental validation in Traditional Chinese Medicine (TCM) using Qwen3-1.7B as the base model demonstrates that RACE-Align significantly outperforms the original base model and a model fine-tuned only with Supervised Fine-Tuning (SFT). Improvements were observed across multiple dimensions, including answer accuracy, information richness, application of TCM thinking patterns, logicality and depth of reasoning, and interpretability. These findings suggest RACE-Align offers an effective pathway to enhance LLMs' knowledge application, reasoning reliability, and process transparency in complex vertical domains.
Mingyang Huang, Peng Zhang, Bang Zhang
Generating long-term, coherent, and realistic music-conditioned dance
sequences remains a challenging task in human motion synthesis. Existing
approaches exhibit critical limitations: motion graph methods rely on fixed
template libraries, restricting creative generation; diffusion models, while
capable of producing novel motions, often lack temporal coherence and musical
alignment. To address these challenges, we propose $\textbf{MotionRAG-Diff}$, a
hybrid framework that integrates Retrieval-Augmented Generation (RAG) with
diffusion-based refinement to enable high-quality, musically coherent dance
generation for arbitrary long-term music inputs. Our method introduces three
core innovations: (1) A cross-modal contrastive learning architecture that
aligns heterogeneous music and dance representations in a shared latent space,
establishing unsupervised semantic correspondence without paired data; (2) An
optimized motion graph system for efficient retrieval and seamless
concatenation of motion segments, ensuring realism and temporal coherence
across long sequences; (3) A multi-condition diffusion model that jointly
conditions on raw music signals and contrastive features to enhance motion
quality and global synchronization. Extensive experiments demonstrate that
MotionRAG-Diff achieves state-of-the-art performance in motion quality,
diversity, and music-motion synchronization accuracy. This work establishes a
new paradigm for music-driven dance generation by synergizing retrieval-based
template fidelity with diffusion-based creative enhancement.
Authors' comments: 12 pages, 5 figures
Yilin Xiao, Junnan Dong, Chuang Zhou, Su Dong, Qianwen Zhang, Di Yin, Xing Sun, Xiao Huang
Graph Retrieval Augmented Generation (GraphRAG) has garnered increasing recognition for its potential to enhance large language models (LLMs) by structurally organizing domain-specific corpora and facilitating complex reasoning. However, current evaluations of GraphRAG models predominantly rely on traditional question-answering datasets. Their limited scope in questions and evaluation metrics fails to comprehensively assess the reasoning capacity improvements enabled by GraphRAG models. To address this gap, we introduce GraphRAG-Bench, a large-scale, domain-specific benchmark designed to rigorously evaluate GraphRAG models. Our benchmark offers three key superiorities: \((i)\) Challenging question design. Featuring college-level, domain-specific questions that demand multi-hop reasoning, the benchmark ensures that simple content retrieval is insufficient for problem-solving. For example, some questions require mathematical reasoning or programming. \((ii)\) Diverse task coverage. The dataset includes a broad spectrum of reasoning tasks, multiple-choice, true/false, multi-select, open-ended, and fill-in-the-blank. It spans 16 disciplines in twenty core textbooks. \((iii)\) Holistic evaluation framework. GraphRAG-Bench provides comprehensive assessment across the entire GraphRAG pipeline, including graph construction, knowledge retrieval, and answer generation. Beyond final-answer correctness, it evaluates the logical coherence of the reasoning process. By applying nine contemporary GraphRAG methods to GraphRAG-Bench, we demonstrate its utility in quantifying how graph-based structuring improves model reasoning capabilities. Our analysis reveals critical insights about graph architectures, retrieval efficacy, and reasoning capabilities, offering actionable guidance for the research community.
Yang Guo, Yutian Tao, Yifei Ming, Robert D. Nowak, Yingyu Liang
Retrieval-augmented generation (RAG) has seen many empirical successes in
recent years by aiding the LLM with external knowledge. However, its
theoretical aspect has remained mostly unexplored. In this paper, we propose
the first finite-sample generalization bound for RAG in in-context linear
regression and derive an exact bias-variance tradeoff. Our framework views the
retrieved texts as query-dependent noisy in-context examples and recovers the
classical in-context learning (ICL) and standard RAG as the limit cases. Our
analysis suggests that an intrinsic ceiling on generalization error exists on
RAG as opposed to the ICL. Furthermore, our framework is able to model
retrieval both from the training data and from external corpora by introducing
uniform and non-uniform RAG noise. In line with our theory, we show the sample
efficiency of ICL and RAG empirically with experiments on common QA benchmarks,
such as Natural Questions and TriviaQA.
Authors' comments: Under Review
Dayoon Ko, Jinyoung Kim, Sohyeon Kim, Jinhyuk Kim, Jaehoon Lee, Seonghak Song, Minyoung Lee, Gunhee Kim
Dense retrievers encode texts into embeddings to efficiently retrieve
relevant documents from large databases in response to user queries. However,
real-world corpora continually evolve, leading to a shift from the original
training distribution of the retriever. Without timely updates or retraining,
indexing newly emerging documents can degrade retrieval performance for future
queries. Thus, identifying when a dense retriever requires an update is
critical for maintaining robust retrieval systems. In this paper, we propose a
novel task of predicting whether a corpus is out-of-distribution (OOD) relative
to a dense retriever before indexing. Addressing this task allows us to
proactively manage retriever updates, preventing potential retrieval failures.
We introduce GradNormIR, an unsupervised approach that leverages gradient norms
to detect OOD corpora effectively. Experiments on the BEIR benchmark
demonstrate that GradNormIR enables timely updates of dense retrievers in
evolving document collections, significantly enhancing retrieval robustness and
efficiency.
Authors' comments: ACL 2025 Findings
Yang Zhao, Chengxiao Dai, Dusit Niyato, Chuan Fu Tan, Keyi Xiang, Yueyang Wang, Zhiquan Yeo, Daren Tan Zong Loong et al.
Large language models (LLMs) hold promise for sustainable manufacturing, but often hallucinate industrial codes and emission factors, undermining regulatory and investment decisions. We introduce CircuGraphRAG, a retrieval-augmented generation (RAG) framework that grounds LLMs outputs in a domain-specific knowledge graph for the circular economy. This graph connects 117,380 industrial and waste entities with classification codes and GWP100 emission data, enabling structured multi-hop reasoning. Natural language queries are translated into SPARQL and verified subgraphs are retrieved to ensure accuracy and traceability. Compared with Standalone LLMs and Naive RAG, CircuGraphRAG achieves superior performance in single-hop and multi-hop question answering, with ROUGE-L F1 scores up to 1.0, while baseline scores below 0.08. It also improves efficiency, halving the response time and reducing token usage by 16% in representative tasks. CircuGraphRAG provides fact-checked, regulatory-ready support for circular economy planning, advancing reliable, low-carbon resource decision making.
Yongkang Xiao, Sinian Zhang, Yi Dai, Huixue Zhou, Jue Hou, Jie Ding, Rui Zhang
Knowledge graph completion (KGC) aims to predict missing triples in knowledge graphs (KGs) by leveraging existing triples and textual information. Recently, generative large language models (LLMs) have been increasingly employed for graph tasks. However, current approaches typically encode graph context in textual form, which fails to fully exploit the potential of LLMs for perceiving and reasoning about graph structures. To address this limitation, we propose DrKGC (Dynamic Subgraph Retrieval-Augmented LLMs for Knowledge Graph Completion). DrKGC employs a flexible lightweight model training strategy to learn structural embeddings and logical rules within the KG. It then leverages a novel bottom-up graph retrieval method to extract a subgraph for each query guided by the learned rules. Finally, a graph convolutional network (GCN) adapter uses the retrieved subgraph to enhance the structural embeddings, which are then integrated into the prompt for effective LLM fine-tuning. Experimental results on two general domain benchmark datasets and two biomedical datasets demonstrate the superior performance of DrKGC. Furthermore, a realistic case study in the biomedical domain highlights its interpretability and practical utility.