Mingyan Wu, Zhenghao Liu, Yukun Yan, Xinze Li, Shi Yu, Zheni Zeng, Yu Gu, Ge Yu
Retrieval-Augmented Generation (RAG) enhances the performance of Large Language Models (LLMs) by incorporating external knowledge. However, LLMs still encounter challenges in effectively utilizing the knowledge from retrieved documents, often being misled by irrelevant or noisy information. To address this issue, we introduce RankCoT, a knowledge refinement method that incorporates reranking signals in generating CoT-based summarization for knowledge refinement based on given query and all retrieval documents. During training, RankCoT prompts the LLM to generate Chain-of-Thought (CoT) candidates based on the query and individual documents. It then fine-tunes the LLM to directly reproduce the best CoT from these candidate outputs based on all retrieved documents, which requires LLM to filter out irrelevant documents during generating CoT-style summarization. Additionally, RankCoT incorporates a self-reflection mechanism that further refines the CoT outputs, resulting in higher-quality training data. Our experiments demonstrate the effectiveness of RankCoT, showing its superior performance over other knowledge refinement models. Further analysis reveals that RankCoT can provide shorter but effective refinement results, enabling the generator to produce more accurate answers. All code and data are available at https://github.com/NEUIR/RankCoT.
Runzhong Wang, Rui-Xi Wang, Mrunali Manjrekar, Connor W. Coley
Molecular machine learning has gained popularity with the advancements of geometric deep learning. In parallel, retrieval-augmented generation has become a principled approach commonly used with language models. However, the optimal integration of retrieval augmentation into molecular machine learning remains unclear. Graph neural networks stand to benefit from clever matching to understand the structural alignment of retrieved molecules to a query molecule. Neural graph matching offers a compelling solution by explicitly modeling node and edge affinities between two structural graphs while employing a noise-robust, end-to-end neural network to learn affinity metrics. We apply this approach to mass spectrum simulation and introduce MARASON, a novel model that incorporates neural graph matching to enhance a fragmentation-based neural network. Experimental results highlight the effectiveness of our design, with MARASON achieving 28% top-1 accuracy, a substantial improvement over the non-retrieval state-of-the-art accuracy of 19%. Moreover, MARASON outperforms both naive retrieval-augmented generation methods and traditional graph matching approaches.
Ruichen Zhang, Shunpu Tang, Yinqiu Liu, Dusit Niyato, Zehui Xiong, Sumei Sun, Shiwen Mao, Zhu Han
The increasing complexity and scale of modern telecommunications networks
demand intelligent automation to enhance efficiency, adaptability, and
resilience. Agentic AI has emerged as a key paradigm for intelligent
communications and networking, enabling AI-driven agents to perceive, reason,
decide, and act within dynamic networking environments. However, effective
decision-making in telecom applications, such as network planning, management,
and resource allocation, requires integrating retrieval mechanisms that support
multi-hop reasoning, historical cross-referencing, and compliance with evolving
3GPP standards. This article presents a forward-looking perspective on
generative information retrieval-inspired intelligent communications and
networking, emphasizing the role of knowledge acquisition, processing, and
retrieval in agentic AI for telecom systems. We first provide a comprehensive
review of generative information retrieval strategies, including traditional
retrieval, hybrid retrieval, semantic retrieval, knowledge-based retrieval, and
agentic contextual retrieval. We then analyze their advantages, limitations,
and suitability for various networking scenarios. Next, we present a survey
about their applications in communications and networking. Additionally, we
introduce an agentic contextual retrieval framework to enhance telecom-specific
planning by integrating multi-source retrieval, structured reasoning, and
self-reflective validation. Experimental results demonstrate that our framework
significantly improves answer accuracy, explanation consistency, and retrieval
efficiency compared to traditional and semantic retrieval methods. Finally, we
outline future research directions.
Authors' comments: 7 pages, 4 figures
Jhon Rayo, Raul de la Rosa, Mario Garrido
Regulatory texts are inherently long and complex, presenting significant
challenges for information retrieval systems in supporting regulatory officers
with compliance tasks. This paper introduces a hybrid information retrieval
system that combines lexical and semantic search techniques to extract relevant
information from large regulatory corpora. The system integrates a fine-tuned
sentence transformer model with the traditional BM25 algorithm to achieve both
semantic precision and lexical coverage. To generate accurate and comprehensive
responses, retrieved passages are synthesized using Large Language Models
(LLMs) within a Retrieval Augmented Generation (RAG) framework. Experimental
results demonstrate that the hybrid system significantly outperforms standalone
lexical and semantic approaches, with notable improvements in Recall@10 and
MAP@10. By openly sharing our fine-tuned model and methodology, we aim to
advance the development of robust natural language processing tools for
compliance-driven applications in regulatory domains.
Authors' comments: 5 pages; Workshop paper; Proceedings of the 1st Regulatory NLP
Workshop (RegNLP 2025)
María Andrea Cruz Blandón, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico
Automatic evaluation of retrieval augmented generation (RAG) systems relies on fine-grained dimensions like faithfulness and relevance, as judged by expert human annotators. Meta-evaluation benchmarks support the development of automatic evaluators that correlate well with human judgement. However, existing benchmarks predominantly focus on English or use translated data, which fails to capture cultural nuances. A native approach provides a better representation of the end user experience. In this work, we develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG). Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs), which are then assessed by expert annotators for faithfulness and relevance. We describe our annotation process and show that it achieves high inter-annotator agreement. We then analyse the performance of the answer-generating LLMs across languages as per the human evaluators. Finally we apply the dataset to our main use-case which is to benchmark multilingual automatic evaluators (LLM-as-a-judge). We show that our benchmark can reliably identify improvements offered by advanced prompting techniques and LLMs. We will release our benchmark to support the community developing accurate evaluation methods for multilingual RAG systems.
Jeremi I. Kaczmarek, Jakub Pokrywka, Krzysztof Biedalak, Grzegorz Kurzyp, Łukasz Grzybowski
Advances in Large Language Models revolutionized medical education by enabling scalable and efficient learning solutions. This paper presents a pipeline employing Retrieval-Augmented Generation (RAG) system to prepare comments generation for Poland's State Specialization Examination (PES) based on verified resources. The system integrates these generated comments and source documents with a spaced repetition learning algorithm to enhance knowledge retention while minimizing cognitive overload. By employing a refined retrieval system, query rephraser, and an advanced reranker, our modified RAG solution promotes accuracy more than efficiency. Rigorous evaluation by medical annotators demonstrates improvements in key metrics such as document relevance, credibility, and logical coherence of generated content, proven by a series of experiments presented in the paper. This study highlights the potential of RAG systems to provide scalable, high-quality, and individualized educational resources, addressing non-English speaking users.
Xinwei Long, Zhiyuan Ma, Ermo Hua, Kaiyan Zhang, Biqing Qi, Bowen Zhou
Retrieval-augmented generation (RAG) has emerged to address the
knowledge-intensive visual question answering (VQA) task. Current methods
mainly employ separate retrieval and generation modules to acquire external
knowledge and generate answers, respectively. We propose ReAuSE, an alternative
to the previous RAG model for the knowledge-based VQA task, which seamlessly
integrates knowledge retriever into the generative multi-modal large language
model, serving as a built-in search engine. Specifically, our model functions
both as a generative retriever and an accurate answer generator. It not only
helps retrieve documents from the knowledge base by producing identifiers for
each document, but it also answers visual questions based on the retrieved
documents. Furthermore, we propose a reinforced retrieval calibration module
from relevance feedback to improve retrieval performance and align with the
preferences for accurate answer generation. Extensive experiments on two
representative OKVQA and A-OKVQA datasets demonstrate significant improvements
ranging from 2.9\% to 9.6\% across all evaluation metrics when compared to
strong baselines.
Authors' comments: AAAI-25
Yin Wu, Quanyu Long, Jing Li, Jianfei Yu, Wenya Wang
Retrieval-Augmented Generation (RAG) is a popular approach for enhancing
Large Language Models (LLMs) by addressing their limitations in verifying facts
and answering knowledge-intensive questions. As the research in LLM extends
their capability to handle input modality other than text, e.g. image, several
multimodal RAG benchmarks are proposed. Nonetheless, they mainly use textual
knowledge bases as the primary source of evidences for augmentation. There
still lack benchmarks designed to evaluate images as augmentation in RAG
systems and how they leverage visual knowledge. We propose Visual-RAG, a novel
Question Answering benchmark that emphasizes visual knowledge intensive
questions. Unlike prior works relying on text-based evidence, Visual-RAG
necessitates text-to-image retrieval and integration of relevant clue images to
extract visual knowledge as evidence. With Visual-RAG, we evaluate 5
open-sourced and 3 proprietary Multimodal LLMs (MLLMs), revealing that images
can serve as good evidence in RAG; however, even the SoTA models struggle with
effectively extracting and utilizing visual knowledge
Authors' comments: 23 pages, 6 figures
Dnyanesh Panchal, Aaryan Gole, Vaibhav Narute, Raunak Joshi
Access to legal knowledge in India is often hindered by a lack of awareness, misinformation and limited accessibility to judicial resources. Many individuals struggle to navigate complex legal frameworks, leading to the frequent misuse of laws and inadequate legal protection. To address these issues, we propose a Retrieval-Augmented Generation (RAG)-based legal chatbot powered by vectorstore oriented FAISS for efficient and accurate legal information retrieval. Unlike traditional chatbots, our model is trained using an extensive dataset comprising legal books, official documentation and the Indian Constitution, ensuring accurate responses to even the most complex or misleading legal queries. The chatbot leverages FAISS for rapid vector-based search, significantly improving retrieval speed and accuracy. It is also prompt-engineered to handle twisted or ambiguous legal questions, reducing the chances of incorrect interpretations. Apart from its core functionality of answering legal queries, the platform includes additional features such as real-time legal news updates, legal blogs, and access to law-related books, making it a comprehensive resource for users. By integrating advanced AI techniques with an optimized retrieval system, our chatbot aims to democratize legal knowledge, enhance legal literacy, and prevent the spread of misinformation. The study demonstrates that our approach effectively improves legal accessibility while maintaining high accuracy and efficiency, thereby contributing to a more informed and empowered society.
Deokhyung Kang, Jeonghun Cho, Yejin Jeon, Sunbin Jang, Minsub Lee, Jawoon Cho, Gary Geunbae Lee
Visual programming languages (VPLs) allow users to create programs through
graphical interfaces, which results in easier accessibility and their
widespread usage in various domains. To further enhance this accessibility,
recent research has focused on generating VPL code from user instructions using
large language models (LLMs). Specifically, by employing prompting-based
methods, these studies have shown promising results. Nevertheless, such
approaches can be less effective for industrial VPLs such as Ladder Diagram
(LD). LD is a pivotal language used in industrial automation processes and
involves extensive domain-specific configurations, which are difficult to
capture in a single prompt. In this work, we demonstrate that training-based
methods outperform prompting-based methods for LD generation accuracy, even
with smaller backbone models. Building on these findings, we propose a
two-stage training strategy to further enhance VPL generation. First, we employ
retrieval-augmented fine-tuning to leverage the repetitive use of subroutines
commonly seen in industrial VPLs. Second, we apply direct preference
optimization (DPO) to further guide the model toward accurate outputs, using
systematically generated preference pairs through graph editing operations.
Extensive experiments on real-world LD data demonstrate that our approach
improves program-level accuracy by over 10% compared to supervised fine-tuning,
which highlights its potential to advance industrial automation.
Authors' comments: Accepted at ACL 2025 (Main, long paper)
Haibo Xing, Kanefumi Matsuyama, Hao Deng, Jinxin Hu, Yu Zhang, Xiaoyi Zeng
Industrial recommendation systems typically involve a two-stage process:
retrieval and ranking, which aims to match users with millions of items. In the
retrieval stage, classic embedding-based retrieval (EBR) methods depend on
effective negative sampling techniques to enhance both performance and
efficiency. However, existing techniques often suffer from false negatives,
high cost for ensuring sampling quality and semantic information deficiency. To
address these limitations, we propose Effective and Semantic-Aware Negative
Sampling (ESANS), which integrates two key components: Effective Dense
Interpolation Strategy (EDIS) and Multimodal Semantic-Aware Clustering (MSAC).
EDIS generates virtual samples within the low-dimensional embedding space to
improve the diversity and density of the sampling distribution while minimizing
computational costs. MSAC refines the negative sampling distribution by
hierarchically clustering item representations based on multimodal information
(visual, textual, behavioral), ensuring semantic consistency and reducing false
negatives. Extensive offline and online experiments demonstrate the superior
efficiency and performance of ESANS.
Authors' comments: 10 pages, 6 figures, Proceedings of the ACM Web Conference 2025
Quanjun Zhang, Chunrong Fang, Yi Zheng, Yaxin Zhang, Yuan Zhao, Rubing Huang, Jianyi Zhou, Yun Yang et al.
Unit testing validates the correctness of the units of the software system
under test and serves as the cornerstone in improving software quality and
reliability. To reduce manual efforts in writing unit tests, some techniques
have been proposed to automatically generate test assertions, with recent
integration-based approaches considered state-of-the-art. Despite being
promising, such integration-based approaches face several limitations,
including reliance on lexical matching for assertion retrieval and a limited
training corpus for assertion generation.
This paper proposes a novel retrieval-augmented deep assertion generation
approach, namely RetriGen, based on a hybrid retriever and a pre-trained
language model (PLM)-based generator. Given a focal-test, RetriGen first builds
a hybrid assertion retriever to search for the most relevant Test-Assert Pair
from external codebases. The retrieval process considers lexical similarity and
semantical similarity via a token-based and an embedding-based retriever,
respectively. RetriGen then treats assertion generation as a
sequence-to-sequence task and designs a PLM-based assertion generator to
predict a correct assertion. We conduct extensive experiments to evaluate
RetriGen against six state-of-the-art approaches across two large-scale
datasets and two metrics. The results demonstrate that RetriGen achieves 57.66%
accuracy and 73.24% CodeBLEU, outperforming all baselines with average
improvements of 50.66% and 14.14%, respectively.
Authors' comments: Accepted to ACM Transactions on Software Engineering and Methodology
(TOSEM 2025)
Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, Bang Liu
Memory plays a key role in enhancing LLMs' performance when deployed to
real-world applications. Existing solutions face trade-offs: explicit memory
designs based on external storage require complex management and incur storage
overhead, while implicit memory designs that store information via parameters
struggle with reliable retrieval. In this paper, we propose R$^3$Mem, a memory
network that optimizes both information Retention and Retrieval through
Reversible context compression. Specifically, R$^3$Mem employs virtual memory
tokens to compress and encode infinitely long histories, further enhanced by a
hierarchical compression strategy that refines information from document- to
entity-level for improved assimilation across granularities. For retrieval,
R$^3$Mem employs a reversible architecture, reconstructing raw data by invoking
the model backward with compressed information. Implemented via
parameter-efficient fine-tuning, it can integrate seamlessly with any
Transformer-based model. Experiments demonstrate that our memory design
achieves state-of-the-art performance in long-context language modeling and
retrieval-augmented generation tasks. It also significantly outperforms
conventional memory modules in long-horizon interaction tasks like
conversational agents, showcasing its potential for next-generation retrieval
systems.
Authors' comments: Work in progress
Zaifu Zhan, Jun Wang, Shuang Zhou, Jiawen Deng, Rui Zhang
Objective: To optimize in-context learning in biomedical natural language
processing by improving example selection. Methods: We introduce a novel
multi-mode retrieval-augmented generation (MMRAG) framework, which integrates
four retrieval strategies: (1) Random Mode, selecting examples arbitrarily; (2)
Top Mode, retrieving the most relevant examples based on similarity; (3)
Diversity Mode, ensuring variation in selected examples; and (4) Class Mode,
selecting category-representative examples. This study evaluates MMRAG on three
core biomedical NLP tasks: Named Entity Recognition (NER), Relation Extraction
(RE), and Text Classification (TC). The datasets used include BC2GM for gene
and protein mention recognition (NER), DDI for drug-drug interaction extraction
(RE), GIT for general biomedical information extraction (RE), and HealthAdvice
for health-related text classification (TC). The framework is tested with two
large language models (Llama2-7B, Llama3-8B) and three retrievers (Contriever,
MedCPT, BGE-Large) to assess performance across different retrieval strategies.
Results: The results from the Random mode indicate that providing more examples
in the prompt improves the model's generation performance. Meanwhile, Top mode
and Diversity mode significantly outperform Random mode on the RE (DDI) task,
achieving an F1 score of 0.9669, a 26.4% improvement. Among the three
retrievers tested, Contriever outperformed the other two in a greater number of
experiments. Additionally, Llama 2 and Llama 3 demonstrated varying
capabilities across different tasks, with Llama 3 showing a clear advantage in
handling NER tasks. Conclusion: MMRAG effectively enhances biomedical
in-context learning by refining example selection, mitigating data scarcity
issues, and demonstrating superior adaptability for NLP-driven healthcare
applications.
Authors' comments: Submitted to JAMIA
Aryan Jadon, Avinash Patil, Shashank Kumar
Retrieval-Augmented Generation (RAG) systems face significant performance
gaps when applied to technical domains requiring precise information extraction
from complex documents. Current evaluation methodologies relying on
document-level metrics inadequately capture token-resolution retrieval accuracy
that is critical for domain-related documents. We propose a framework combining
granular evaluation metrics with synthetic data generation to optimize
domain-specific RAG performance. First, we introduce token-aware metrics
Precision $\Omega$ and Intersection-over-Union (IoU) that quantify context
preservation versus information density trade-offs inherent in technical texts.
Second, we develop a reasoning model-driven pipeline using instruction-tuned
LLMs (DeepSeek-R1, DeepSeek-R1 distilled variants, and Phi-4) to generate
context-anchored QA pairs with discontinuous reference spans across three
specialized corpora: SEC 10-K filings (finance), biomedical abstracts (PubMed),
and APT threat reports (cybersecurity).
Our empirical analysis reveals critical insights: smaller chunks (less than
10 tokens) improve precision by 31-42% (IoU = 0.071 vs. baseline 0.053) at
recall costs (-18%), while domain-specific embedding strategies yield 22%
variance in optimal chunk sizing (5-20 tokens). The
DeepSeek-R1-Distill-Qwen-32B model demonstrates superior concept alignment
(+14% mean IoU over alternatives), though no configuration universally
dominates. Financial texts favor larger chunks for risk factor coverage (Recall
= 0.81 at size = 20), whereas cybersecurity content benefits from atomic
segmentation, Precision $\Omega = 0.28$ at size = 5.
Our code is available on
https://github.com/aryan-jadon/Synthetic-Data-Generation-and-Evaluation-using-Reasoning-Model
Authors' comments: 8 Pages
Akos Nagy, Yannis Spyridis, Vasileios Argyriou
This paper presents a detailed evaluation of a Retrieval-Augmented Generation (RAG) system that integrates large language models (LLMs) to enhance information retrieval and instruction generation for maintenance personnel across diverse data formats. We assessed the performance of eight LLMs, emphasizing key metrics such as response speed and accuracy, which were quantified using BLEU and METEOR scores. Our findings reveal that advanced models like GPT-4 and GPT-4o-mini significantly outperform their counterparts, particularly when addressing complex queries requiring multi-format data integration. The results validate the system's ability to deliver timely and accurate responses, highlighting the potential of RAG frameworks to optimize maintenance operations. Future research will focus on refining retrieval techniques for these models and enhancing response generation, particularly for intricate scenarios, ultimately improving the system's practical applicability in dynamic real-world environments.
Juraj Vladika, Florian Matthes
Retrieval-augmented generation (RAG) has emerged as an approach to augment
large language models (LLMs) by reducing their reliance on static knowledge and
improving answer factuality. RAG retrieves relevant context snippets and
generates an answer based on them. Despite its increasing industrial adoption,
systematic exploration of RAG components is lacking, particularly regarding the
ideal size of provided context, and the choice of base LLM and retrieval
method. To help guide development of robust RAG systems, we evaluate various
context sizes, BM25 and semantic search as retrievers, and eight base LLMs.
Moving away from the usual RAG evaluation with short answers, we explore the
more challenging long-form question answering in two domains, where a good
answer has to utilize the entire context. Our findings indicate that final QA
performance improves steadily with up to 15 snippets but stagnates or declines
beyond that. Finally, we show that different general-purpose LLMs excel in the
biomedical domain than the encyclopedic one, and that open-domain evidence
retrieval in large corpora is challenging.
Authors' comments: Accepted to Findings of NAACL 2025
Yufan Ye, Pu Pang, Ting Zhang, Hua Huang
Code retrieval is a crucial component in modern software development, particularly in large-scale projects. However, existing approaches relying on sequence-based models often fail to fully exploit the structural dependencies inherent in code, leading to suboptimal retrieval performance, particularly with structurally complex code fragments. In this paper, we introduce GNN-Coder, a novel framework based on Graph Neural Network (GNN) to utilize Abstract Syntax Tree (AST). We make the first attempt to study how GNN-integrated Transformer can promote the development of semantic retrieval tasks by capturing the structural and semantic features of code. We further propose an innovative graph pooling method tailored for AST, utilizing the number of child nodes as a key feature to highlight the intrinsic topological relationships within the AST. This design effectively integrates both sequential and hierarchical representations, enhancing the model's ability to capture code structure and semantics. Additionally, we introduce the Mean Angular Margin (MAM), a novel metric for quantifying the uniformity of code embedding distributions, providing a standardized measure of feature separability. The proposed method achieves a lower MAM, indicating a more discriminative feature representation. This underscores GNN-Coder's superior ability to distinguish between code snippets, thereby enhancing retrieval accuracy. Experimental results show that GNN-Coder significantly boosts retrieval performance, with a 1\%-10\% improvement in MRR on the CSN dataset, and a notable 20\% gain in zero-shot performance on the CosQA dataset.
Yun-Wei Chu, Kai Zhang, Christopher Malon, Martin Renqiang Min
Multimodal Large Language Models (MLLMs) have shown impressive performance in
vision and text tasks. However, hallucination remains a major challenge,
especially in fields like healthcare where details are critical. In this work,
we show how MLLMs may be enhanced to support Visual RAG (V-RAG), a
retrieval-augmented generation framework that incorporates both text and visual
data from retrieved images. On the MIMIC-CXR chest X-ray report generation and
Multicare medical image caption generation datasets, we show that Visual RAG
improves the accuracy of entity probing, which asks whether a medical entities
is grounded by an image. We show that the improvements extend both to frequent
and rare entities, the latter of which may have less positive training data.
Downstream, we apply V-RAG with entity probing to correct hallucinations and
generate more clinically accurate X-ray reports, obtaining a higher RadGraph-F1
score.
Authors' comments: GenAI4Health - AAAI '25
Han Zhang, Langshi Zhou, Hanfang Yang
Extensive research has investigated the integration of large language models (LLMs) with knowledge graphs to enhance the reasoning process. However, understanding how models perform reasoning utilizing structured graph knowledge remains underexplored. Most existing approaches rely on LLMs or retrievers to make binary judgments regarding the utilization of knowledge, which is too coarse. Meanwhile, there is still a lack of feedback mechanisms for reflection and correction throughout the entire reasoning path. This paper proposes an Active self-Reflection framework for knowledge Graph reasoning ARG, introducing for the first time an end-to-end training approach to achieve iterative reasoning grounded on structured graphs. Within the framework, the model leverages special tokens to \textit{actively} determine whether knowledge retrieval is necessary, performs \textit{reflective} critique based on the retrieved knowledge, and iteratively reasons over the knowledge graph. The reasoning paths generated by the model exhibit high interpretability, enabling deeper exploration of the model's understanding of structured knowledge. Ultimately, the proposed model achieves outstanding results compared to existing baselines in knowledge graph reasoning tasks.