Jaeyoung Choe, Jihoon Kim, Woohwan Jung
Retrieval-augmented generation (RAG) based large language models (LLMs) are
widely used in finance for their excellent performance on knowledge-intensive
tasks. However, standardized documents (e.g., SEC filing) share similar formats
such as repetitive boilerplate texts, and similar table structures. This
similarity forces traditional RAG methods to misidentify near-duplicate text,
leading to duplicate retrieval that undermines accuracy and completeness. To
address these issues, we propose the Hierarchical Retrieval with Evidence
Curation (HiREC) framework. Our approach first performs hierarchical retrieval
to reduce confusion among similar texts. It first retrieve related documents
and then selects the most relevant passages from the documents. The evidence
curation process removes irrelevant passages. When necessary, it automatically
generates complementary queries to collect missing information. To evaluate our
approach, we construct and release a Large-scale Open-domain Financial (LOFin)
question answering benchmark that includes 145,897 SEC documents and 1,595
question-answer pairs. Our code and data are available at
https://github.com/deep-over/LOFin-bench-HiREC.
Authors' comments: ACL 2025 (Findings)
Haoran Xin, Ying Sun, Chao Wang, Weijia Zhang, Hui Xiong
Incorporating collaborative information (CI) effectively is crucial for
leveraging LLMs in recommendation tasks. Existing approaches often encode CI
using soft tokens or abstract identifiers, which introduces a semantic
misalignment with the LLM's natural language pretraining and hampers knowledge
integration. To address this, we propose expressing CI directly in natural
language to better align with LLMs' semantic space. We achieve this by
retrieving a curated set of the most relevant user behaviors in natural
language form. However, identifying informative CI is challenging due to the
complexity of similarity and utility assessment. To tackle this, we introduce a
Self-assessing COllaborative REtrieval framework (SCORE) following the
retrieve-rerank paradigm. First, a Collaborative Retriever (CAR) is developed
to consider both collaborative patterns and semantic similarity. Then, a
Self-assessing Reranker (SARE) leverages LLMs' own reasoning to assess and
prioritize retrieved behaviors. Finally, the selected behaviors are prepended
to the LLM prompt as natural-language CI to guide recommendation. Extensive
experiments on two public datasets validate the effectiveness of SCORE in
improving LLM-based recommendation.
Authors' comments: 13 pages, 6 figures
Haoqin Sun, Jingguang Tian, Jiaming Zhou, Hui Wang, Jiabei He, Shiwan Zhao, Xiangyu Kong, Desheng Hu et al.
The Contrastive Language-Audio Pretraining (CLAP) model has demonstrated excellent performance in general audio description-related tasks, such as audio retrieval. However, in the emerging field of emotional speaking style description (ESSD), cross-modal contrastive pretraining remains largely unexplored. In this paper, we propose a novel speech retrieval task called emotional speaking style retrieval (ESSR), and ESS-CLAP, an emotional speaking style CLAP model tailored for learning relationship between speech and natural language descriptions. In addition, we further propose relation-augmented CLAP (RA-CLAP) to address the limitation of traditional methods that assume a strict binary relationship between caption and audio. The model leverages self-distillation to learn the potential local matching relationships between speech and descriptions, thereby enhancing generalization ability. The experimental results validate the effectiveness of RA-CLAP, providing valuable reference in ESSD.
Yuan Li, Qi Luo, Xiaonan Li, Bufan Li, Qinyuan Cheng, Bo Wang, Yining Zheng, Yuxin Wang et al.
Retrieval-Augmented Generation (RAG) integrates external knowledge with Large Language Models (LLMs) to enhance factual correctness and mitigate hallucination. However, dense retrievers often become the bottleneck of RAG systems due to their limited parameters compared to LLMs and their inability to perform step-by-step reasoning. While prompt-based iterative RAG attempts to address these limitations, it is constrained by human-designed workflows. To address these limitations, we propose $\textbf{R3-RAG}$, which uses $\textbf{R}$einforcement learning to make the LLM learn how to $\textbf{R}$eason and $\textbf{R}$etrieve step by step, thus retrieving comprehensive external knowledge and leading to correct answers. R3-RAG is divided into two stages. We first use cold start to make the model learn the manner of iteratively interleaving reasoning and retrieval. Then we use reinforcement learning to further harness its ability to better explore the external retrieval environment. Specifically, we propose two rewards for R3-RAG: 1) answer correctness for outcome reward, which judges whether the trajectory leads to a correct answer; 2) relevance-based document verification for process reward, encouraging the model to retrieve documents that are relevant to the user question, through which we can let the model learn how to iteratively reason and retrieve relevant documents to get the correct answer. Experimental results show that R3-RAG significantly outperforms baselines and can transfer well to different retrievers. We release R3-RAG at https://github.com/Yuan-Li-FNLP/R3-RAG.
Xun Gong, Anqi Lv, Zhiming Wang, Huijia Zhu, Yanmin Qian
While speech large language models (SpeechLLMs) have advanced standard
automatic speech recognition (ASR), contextual biasing for named entities and
rare words remains challenging, especially at scale. To address this, we
propose BR-ASR: a Bias Retrieval framework for large-scale contextual biasing
(up to 200k entries) via two innovations: (1) speech-and-bias contrastive
learning to retrieve semantically relevant candidates; (2) dynamic curriculum
learning that mitigates homophone confusion which negatively impacts the final
performance. The is a general framework that allows seamless integration of the
retrieved candidates into diverse ASR systems without fine-tuning. Experiments
on LibriSpeech test-clean/-other achieve state-of-the-art (SOTA) biased word
error rates (B-WER) of 2.8%/7.1% with 2000 bias words, delivering 45% relative
improvement over prior methods. BR-ASR also demonstrates high scalability: when
expanding the bias list to 200k where traditional methods generally fail, it
induces only 0.3 / 2.9% absolute WER / B-WER degradation with a 99.99% pruning
rate and only 20ms latency per query on test-other.
Authors' comments: Accepted by InterSpeech 2025
Debdeep Sanyal, Agniva Maiti, Umakanta Maharana, Dhruv Kumar, Ankur Mali, C. Lee Giles, Murari Mandal
Effective teaching requires adapting instructional strategies to accommodate
the diverse cognitive and behavioral profiles of students, a persistent
challenge in education and teacher training. While Large Language Models (LLMs)
offer promise as tools to simulate such complex pedagogical environments,
current simulation frameworks are limited in two key respects: (1) they often
reduce students to static knowledge profiles, and (2) they lack adaptive
mechanisms for modeling teachers who evolve their strategies in response to
student feedback. To address these gaps, \textbf{we introduce a novel
simulation framework that integrates LLM-based heterogeneous student agents
with a self-optimizing teacher agent}. The teacher agent's pedagogical policy
is dynamically evolved using a genetic algorithm, allowing it to discover and
refine effective teaching strategies based on the aggregate performance of
diverse learners. In addition, \textbf{we propose Persona-RAG}, a Retrieval
Augmented Generation module that enables student agents to retrieve knowledge
tailored to their individual learning styles. Persona-RAG preserves the
retrieval accuracy of standard RAG baselines while enhancing personalization,
an essential factor in modeling realistic educational scenarios. Through
extensive experiments, we demonstrate how our framework supports the emergence
of distinct and interpretable teaching patterns when interacting with varied
student populations. Our results highlight the potential of LLM-driven
simulations to inform adaptive teaching practices and provide a testbed for
training human educators in controlled, data-driven environments.
Authors' comments: 38 Pages
Daniel Csizmadia, Andrei Codreanu, Victor Sim, Vighnesh Prabhu, Michael Lu, Kevin Zhu, Sean O'Brien, Vasu Sharma
We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities. CLIP models are typically constrained by fixed image resolutions and limited context, which can hinder their effectiveness in retrieval tasks that require fine-grained cross-modal understanding. DCLIP addresses these challenges through a meta teacher-student distillation framework, where a cross-modal transformer teacher is fine-tuned to produce enriched embeddings via bidirectional cross-attention between YOLO-extracted image regions and corresponding textual spans. These semantically and spatially aligned global representations guide the training of a lightweight student model using a hybrid loss that combines contrastive learning and cosine similarity objectives. Despite being trained on only ~67,500 samples curated from MSCOCO, Flickr30k, and Conceptual Captions-just a fraction of CLIP's original dataset-DCLIP significantly improves image-text retrieval metrics (Recall@K, MAP), while retaining approximately 94% of CLIP's zero-shot classification performance. These results demonstrate that DCLIP effectively mitigates the trade-off between task specialization and generalization, offering a resource-efficient, domain-adaptive, and detail-sensitive solution for advanced vision-language tasks. Code available at https://anonymous.4open.science/r/DCLIP-B772/README.md.
Weihan Xu, Yimeng Ma, Jingyue Huang, Yang Li, Wenye Ma, Taylor Berg-Kirkpatrick, Julian McAuley, Paul Pu Liang et al.
Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.
Yi Jiang, Sendong Zhao, Jianbo Li, Haochun Wang, Bing Qin
The Retrieval-Augmented Generation (RAG) framework introduces a retrieval
module to dynamically inject retrieved information into the input context of
large language models (LLMs), and has demonstrated significant success in
various NLP tasks. However, the current study points out that there is a
preference gap between retrievers and LLMs in the RAG framework, which limit
the further improvement of system performance. Some highly relevant passages
may interfere with LLM reasoning because they contain complex or contradictory
information; while some indirectly related or even inaccurate content may help
LLM generate more accurate answers by providing suggestive information or
logical clues. To solve this, we propose GainRAG, a novel approach that aligns
the retriever's and LLM's preferences by defining a new metric, "gain", which
measure how well an input passage contributes to correct outputs. Specifically,
we propose a method to estimate these gain signals and train a middleware that
aligns the preferences of the retriever and the LLM using only limited data. In
addition, we introduce a pseudo-passage strategy to mitigate degradation. The
experimental results on 6 datasets verify the effectiveness of GainRAG.
Authors' comments: Accepted by ACL 2025
Jonathan Leung, Yongjie Wang, Zhiqi Shen
Large Language Models (LLMs) demonstrate impressive general capabilities but often struggle with step-by-step reasoning, especially in complex applications such as games. While retrieval-augmented methods like GraphRAG attempt to bridge this gap through cross-document extraction and indexing, their fragmented entity-relation graphs and overly dense local connectivity hinder the construction of coherent reasoning. In this paper, we propose a novel framework based on Goal-Oriented Graphs (GoGs), where each node represents a goal and its associated attributes, and edges encode logical dependencies between goals. This structure enables explicit retrieval of reasoning paths by first identifying high-level goals and recursively retrieving their subgoals, forming coherent reasoning chains to guide LLM prompting. Our method significantly enhances the reasoning ability of LLMs in game-playing tasks, as demonstrated by extensive experiments on the Minecraft testbed, outperforming GraphRAG and other baselines.
GuangHao Meng, Sunan He, Jinpeng Wang, Tao Dai, Letian Zhang, Jieming Zhu, Qing Li, Gang Wang et al.
Vision-language retrieval (VLR) has attracted significant attention in both
academia and industry, which involves using text (or images) as queries to
retrieve corresponding images (or text). However, existing methods often
neglect the rich visual semantics knowledge of entities, thus leading to
incorrect retrieval results. To address this problem, we propose the Entity
Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual
knowledge of entities to enrich queries. Specifically, since humans recognize
entities through visual cues, we employ a large language model (LLM) to
generate Entity Visual Descriptions (EVDs) as alignment cues to complement
textual data. These EVDs are then integrated into raw queries to create
visually-rich, EVD-enhanced queries. Furthermore, recognizing that EVD-enhanced
queries may introduce noise or low-quality expansions, we develop a novel,
trainable EVD-aware Rewriter (EaRW) for vision-language retrieval tasks. EaRW
utilizes EVD knowledge and the generative capabilities of the language model to
effectively rewrite queries. With our specialized training strategy, EaRW can
generate high-quality and low-noise EVD-enhanced queries. Extensive
quantitative and qualitative experiments on image-text retrieval benchmarks
validate the superiority of EvdCLIP on vision-language retrieval tasks.
Authors' comments: 9 pages, 6 figures
Hongru Song, Yu-an Liu, Ruqing Zhang, Jiafeng Guo, Jianming Lv, Maarten de Rijke, Xueqi Cheng
We explore adversarial attacks against retrieval-augmented generation (RAG)
systems to identify their vulnerabilities. We focus on generating
human-imperceptible adversarial examples and introduce a novel imperceptible
retrieve-to-generate attack against RAG. This task aims to find imperceptible
perturbations that retrieve a target document, originally excluded from the
initial top-$k$ candidate set, in order to influence the final answer
generation. To address this task, we propose ReGENT, a reinforcement
learning-based framework that tracks interactions between the attacker and the
target RAG and continuously refines attack strategies based on
relevance-generation-naturalness rewards. Experiments on newly constructed
factual and non-factual question-answering benchmarks demonstrate that ReGENT
significantly outperforms existing attack methods in misleading RAG systems
with small imperceptible text perturbations.
Authors' comments: 18 pages,accepted by ACL25 findings
Hongjia Wu, Hongxin Zhang, Wei Chen, Jiazhi Xia
Various industries have produced a large number of documents such as industrial plans, technical guidelines, and regulations that are structurally complex and content-wise fragmented. This poses significant challenges for experts and decision-makers in terms of retrieval and understanding. Although existing LLM-based Retrieval-Augmented Generation methods can provide context-related suggestions, they lack quantitative weighting and traceable reasoning paths, making it difficult to offer multi-level and transparent decision support. To address this issue, this paper proposes the RAD method, which integrates Multi-Criteria Decision Making with the semantic understanding capabilities of LLMs. The method automatically extracts key criteria from industry documents, builds a weighted hierarchical decision model, and generates structured reports under model guidance. The RAD framework introduces explicit weight assignment and reasoning chains in decision generation to ensure accuracy, completeness, and traceability. Experiments show that in various decision-making tasks, the decision reports generated by RAD significantly outperform existing methods in terms of detail, rationality, and structure, demonstrating its application value and potential in complex decision support scenarios.
Khandakar Ashrafi Akbar, Md Nahiyan Uddin, Latifur Khan, Trayce Hockstad, Mizanur Rahman, Mashrur Chowdhury, Bhavani Thuraisingham
As connected and automated transportation systems evolve, there is a growing
need for federal and state authorities to revise existing laws and develop new
statutes to address emerging cybersecurity and data privacy challenges. This
study introduces a Retrieval-Augmented Generation (RAG) based Large Language
Model (LLM) framework designed to support policymakers by extracting relevant
legal content and generating accurate, inquiry-specific responses. The
framework focuses on reducing hallucinations in LLMs by using a curated set of
domain-specific questions to guide response generation. By incorporating
retrieval mechanisms, the system enhances the factual grounding and specificity
of its outputs. Our analysis shows that the proposed RAG-based LLM outperforms
leading commercial LLMs across four evaluation metrics: AlignScore, ParaScore,
BERTScore, and ROUGE, demonstrating its effectiveness in producing reliable and
context-aware legal insights. This approach offers a scalable, AI-driven method
for legislative analysis, supporting efforts to update legal frameworks in line
with advancements in transportation technologies.
Authors' comments: Presented at the Transportation Research Board (TRB) Annual Meeting
2025, and subsequently submitted for publication consideration in the
Transportation Research Record (TRR)
Jianghao Wu, Feilong Tang, Yulong Li, Ming Hu, Haochen Xue, Shoaib Jameel, Yutong Xie, Imran Razzak
Recent advances such as Chain-of-Thought prompting have significantly
improved large language models (LLMs) in zero-shot medical reasoning. However,
prompting-based methods often remain shallow and unstable, while fine-tuned
medical LLMs suffer from poor generalization under distribution shifts and
limited adaptability to unseen clinical scenarios. To address these
limitations, we present TAGS, a test-time framework that combines a broadly
capable generalist with a domain-specific specialist to offer complementary
perspectives without any model fine-tuning or parameter updates. To support
this generalist-specialist reasoning process, we introduce two auxiliary
modules: a hierarchical retrieval mechanism that provides multi-scale exemplars
by selecting examples based on both semantic and rationale-level similarity,
and a reliability scorer that evaluates reasoning consistency to guide final
answer aggregation. TAGS achieves strong performance across nine MedQA
benchmarks, boosting GPT-4o accuracy by 13.8%, DeepSeek-R1 by 16.8%, and
improving a vanilla 7B model from 14.1% to 23.9%. These results surpass several
fine-tuned medical LLMs, without any parameter updates. The code will be
available at https://github.com/JianghaoWu/TAGS.
Authors' comments: 16 pages including references, 2 figures
Minsoo Khang, Sangjun Park, Teakgyu Hong, Dawoon Jung
Large Language Models (LLMs) have made substantial progress in recent years, yet evaluating their capabilities in practical Retrieval-Augmented Generation (RAG) scenarios remains challenging. In practical applications, LLMs must demonstrate complex reasoning, refuse to answer appropriately, provide precise citations, and effectively understand document layout. These capabilities are crucial for advanced task handling, uncertainty awareness, maintaining reliability, and structural understanding. While some of the prior works address these aspects individually, there is a need for a unified framework that evaluates them collectively in practical RAG scenarios. To address this, we present CReSt (A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents), a benchmark designed to assess these key dimensions holistically. CReSt comprises 2,245 human-annotated examples in English and Korean, designed to capture practical RAG scenarios that require complex reasoning over structured documents. It also introduces a tailored evaluation methodology to comprehensively assess model performance in these critical areas. Our evaluation shows that even advanced LLMs struggle to perform consistently across these dimensions, underscoring key areas for improvement. We release CReSt to support further research and the development of more robust RAG systems. The dataset and code are available at: https://github.com/UpstageAI/CReSt.
Salahuddin Alawadhi, Noorhan Abbas
Integrating Retrieval Augmented Generation (RAG) with Large Language Models
(LLMs) has shown the potential to provide precise, contextually relevant
responses in knowledge intensive domains. This study investigates the
ap-plication of RAG for ABB circuit breakers, focusing on accuracy,
reliability, and contextual relevance in high-stakes engineering environments.
By leveraging tailored datasets, advanced embedding models, and optimized
chunking strategies, the research addresses challenges in data retrieval and
contextual alignment unique to engineering documentation. Key contributions
include the development of a domain-specific dataset for ABB circuit breakers
and the evaluation of three RAG pipelines: OpenAI GPT4o, Cohere, and Anthropic
Claude. Advanced chunking methods, such as paragraph-based and title-aware
segmentation, are assessed for their impact on retrieval accuracy and response
generation. Results demonstrate that while certain configurations achieve high
precision and relevancy, limitations persist in ensuring factual faithfulness
and completeness, critical in engineering contexts. This work underscores the
need for iterative improvements in RAG systems to meet the stringent demands of
electrical engineering tasks, including design, troubleshooting, and
operational decision-making. The findings in this paper help advance research
of AI in highly technical domains such as electrical engineering.
Authors' comments: 17 pages, 4 figures, published in CSIT Vol. 15, 2025. DOI:
10.5121/csit.2025.150905
Yuxin Yang, Yinan Zhou, Yuxin Chen, Ziqi Zhang, Zongyang Ma, Chunfeng Yuan, Bing Li, Lin Song et al.
Composed Image Retrieval (CIR) aims to retrieve target images from a gallery
based on a reference image and modification text as a combined query. Recent
approaches focus on balancing global information from two modalities and encode
the query into a unified feature for retrieval. However, due to insufficient
attention to fine-grained details, these coarse fusion methods often struggle
with handling subtle visual alterations or intricate textual instructions. In
this work, we propose DetailFusion, a novel dual-branch framework that
effectively coordinates information across global and detailed granularities,
thereby enabling detail-enhanced CIR. Our approach leverages atomic detail
variation priors derived from an image editing dataset, supplemented by a
detail-oriented optimization strategy to develop a Detail-oriented Inference
Branch. Furthermore, we design an Adaptive Feature Compositor that dynamically
fuses global and detailed features based on fine-grained information of each
unique multimodal query. Extensive experiments and ablation analyses not only
demonstrate that our method achieves state-of-the-art performance on both CIRR
and FashionIQ datasets but also validate the effectiveness and cross-domain
adaptability of detail enhancement for CIR.
Authors' comments: 20 pages, 6 figures
Ziyu Ge, Yuhao Wu, Daniel Wai Kit Chin, Roy Ka-Wei Lee, Rui Cao
Large Language Models (LLMs) augmented with retrieval mechanisms have
demonstrated significant potential in fact-checking tasks by integrating
external knowledge. However, their reliability decreases when confronted with
conflicting evidence from sources of varying credibility. This paper presents
the first systematic evaluation of Retrieval-Augmented Generation (RAG) models
for fact-checking in the presence of conflicting evidence. To support this
study, we introduce \textbf{CONFACT} (\textbf{Con}flicting Evidence for
\textbf{Fact}-Checking) (Dataset available at
https://github.com/zoeyyes/CONFACT), a novel dataset comprising questions
paired with conflicting information from various sources. Extensive experiments
reveal critical vulnerabilities in state-of-the-art RAG methods, particularly
in resolving conflicts stemming from differences in media source credibility.
To address these challenges, we investigate strategies to integrate media
background information into both the retrieval and generation stages. Our
results show that effectively incorporating source credibility significantly
enhances the ability of RAG models to resolve conflicting evidence and improve
fact-checking performance.
Authors' comments: Camera-ready for IJCAI 2025, AI and Social Good
Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, Sung Ju Hwang
Large language models (LLMs) excel at complex reasoning tasks but remain
computationally expensive, limiting their practical deployment. To address
this, recent works have focused on distilling reasoning capabilities into
smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher
LLMs. However, this approach struggles in scenarios requiring rare factual
knowledge or precise computation, where sLMs often hallucinate due to limited
capability. In this work, we propose Agent Distillation, a framework for
transferring not only reasoning capability but full task-solving behavior from
LLM-based agents into sLMs with retrieval and code tools. We improve agent
distillation along two complementary axes: (1) we introduce a prompting method
called first-thought prefix to enhance the quality of teacher-generated
trajectories; and (2) we propose a self-consistent action generation for
improving test-time robustness of small agents. We evaluate our method on eight
reasoning tasks across factual and mathematical domains, covering both
in-domain and out-of-domain generalization. Our results show that sLMs as small
as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier
larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the
potential of agent distillation for building practical, tool-using small
agents. Our code is available at https://github.com/Nardien/agent-distillation.
Authors' comments: preprint, v1