Hwanjun Song, Jeonghwan Choi, Minseok Kim
Retrieval-augmented generation (RAG) enhances LLMs by integrating external knowledge, but generation remains fragile due to the uncertain placement of relevant chunks and retrieval-induced information overload, leading to hallucinations. We propose Ext2Gen, a novel extract-then-generate model that enhances RAG robustness by first extracting query-relevant sentences before generating answers. To optimize this model, we employ preference alignment through pairwise feedback learning, enabling the model to generate robust answers regardless of variations in retrieval results. Extensive experiments demonstrate that Ext2Gen effectively identifies query-relevant sentences with high precision and recall, leading to highly reliable answers. Furthermore, deploying our model in a RAG environment reveals that it not only boosts the performance of the base LLM but also synergizes with advanced retrieval strategies like query expansion. The dataset and model will be released soon.
Gerion Spielberger, Florian M. Artinger, Jochen Reb, Rudolf Kerschreiter
Analyzing textual data is the cornerstone of qualitative research. While
traditional methods such as grounded theory and content analysis are widely
used, they are labor-intensive and time-consuming. Topic modeling offers an
automated complement. Yet, existing approaches, including LLM-based topic
modeling, still struggle with issues such as high data preprocessing
requirements, interpretability, and reliability. This paper introduces Agentic
Retrieval-Augmented Generation (Agentic RAG) as a method for topic modeling
with LLMs. It integrates three key components: (1) retrieval, enabling
automatized access to external data beyond an LLM's pre-trained knowledge; (2)
generation, leveraging LLM capabilities for text synthesis; and (3)
agent-driven learning, iteratively refining retrieval and query formulation
processes. To empirically validate Agentic RAG for topic modeling, we reanalyze
a Twitter/X dataset, previously examined by Mu et al. (2024a). Our findings
demonstrate that the approach is more efficient, interpretable and at the same
time achieves higher reliability and validity in comparison to the standard
machine learning approach but also in comparison to LLM prompting for topic
modeling. These results highlight Agentic RAG's ability to generate
semantically relevant and reproducible topics, positioning it as a robust,
scalable, and transparent alternative for AI-driven qualitative research in
leadership, managerial, and organizational research.
Authors' comments: 30 pages, 4 figures
Yifei Duan, Raphael Shang, Deng Liang, Yongqiang Cai
Language models can be viewed as functions that embed text into Euclidean space, where the quality of the embedding vectors directly determines model performance, training such neural networks involves various uncertainties. This paper focuses on improving the performance of pre-trained language models in zero-shot settings through a simple and easily implementable method. We propose a novel backward attention mechanism to enhance contextual information encoding. Evaluated on the Chinese Massive Text Embedding Benchmark (C-MTEB), our approach achieves significant improvements across multiple tasks, providing valuable insights for advancing zero-shot learning capabilities.
Pengcheng Jiang, Jiacheng Lin, Lang Cao, Runchu Tian, SeongKu Kang, Zifeng Wang, Jimeng Sun, Jiawei Han
Information retrieval systems are crucial for enabling effective access to large document collections. Recent approaches have leveraged Large Language Models (LLMs) to enhance retrieval performance through query augmentation, but often rely on expensive supervised learning or distillation techniques that require significant computational resources and hand-labeled data. We introduce DeepRetrieval, a reinforcement learning (RL) approach that trains LLMs for query generation through trial and error without supervised data (reference query). Using retrieval metrics as rewards, our system generates queries that maximize retrieval performance. DeepRetrieval outperforms leading methods on literature search with 65.07% (vs. previous SOTA 24.68%) recall for publication search and 63.18% (vs. previous SOTA 32.11%) recall for trial search using real-world search engines. DeepRetrieval also dominates in evidence-seeking retrieval, classic information retrieval and SQL database search. With only 3B parameters, it outperforms industry-leading models like GPT-4o and Claude-3.5-Sonnet on 11/13 datasets. These results demonstrate that our RL approach offers a more efficient and effective paradigm for information retrieval. Our data and code are available at: https://github.com/pat-jj/DeepRetrieval.
Chanwoo Choi, Jinsoo Kim, Sukmin Cho, Soyeong Jeong, Buru Chang
With the growing adoption of retrieval-augmented generation (RAG) systems, various attack methods have been proposed to degrade their performance. However, most existing approaches rely on unrealistic assumptions in which external attackers have access to internal components such as the retriever. To address this issue, we introduce a realistic black-box attack based on the RAG paradox, a structural vulnerability arising from the system's effort to enhance trust by revealing both the retrieved documents and their sources to users. This transparency enables attackers to observe which sources are used and how information is phrased, allowing them to craft poisoned documents that are more likely to be retrieved and upload them to the identified sources. Moreover, as RAG systems directly provide retrieved content to users, these documents must not only be retrievable but also appear natural and credible to maintain user confidence in the search results. Unlike prior work that focuses solely on improving document retrievability, our attack method explicitly considers both retrievability and user trust in the retrieved content. Both offline and online experiments demonstrate that our method significantly degrades system performance without internal access, while generating natural-looking poisoned documents.
Abdelrahman Abdallah, Bhawna Piryani, Jonas Wallat, Avishek Anand, Adam Jatowt
Temporal awareness is crucial in many information retrieval tasks, particularly in scenarios where the relevance of documents depends on their alignment with the query's temporal context. Traditional approaches such as BM25 and Dense Passage Retrieval (DPR) focus on lexical or semantic similarity but tend to neglect the temporal alignment between queries and documents, which is essential for time-sensitive tasks like temporal question answering (TQA). We propose TempRetriever, a novel extension of DPR that explicitly incorporates temporal information by embedding both the query date and document timestamp into the retrieval process. This allows retrieving passages that are not only contextually relevant but also aligned with the temporal intent of queries. We evaluate TempRetriever on two large-scale datasets ArchivalQA and ChroniclingAmericaQA demonstrating its superiority over baseline retrieval models across multiple metrics. TempRetriever achieves a 6.63\% improvement in Top-1 retrieval accuracy and a 3.79\% improvement in NDCG@10 compared to the standard DPR on ArchivalQA. Similarly, for ChroniclingAmericaQA, TempRetriever exhibits a 9.56\% improvement in Top-1 retrieval accuracy and a 4.68\% improvement in NDCG@10. We also propose a novel, time-based negative sampling strategy which further enhances retrieval performance by addressing temporal misalignment during training. Our results underline the importance of temporal aspects in dense retrieval systems and establish a new benchmark for time-aware passage retrieval.
Chahine-Nicolas Zede, Laurent Carrafa, Valérie Gouet-Brunet
Retrieval in 3D point clouds is a challenging task that consists in
retrieving the most similar point clouds to a given query within a reference of
3D points. Current methods focus on comparing descriptors of point clouds in
order to identify similar ones. Due to the complexity of this latter step, here
we focus on the acceleration of the retrieval by adapting the Differentiable
Search Index (DSI), a transformer-based approach initially designed for text
information retrieval, for 3D point clouds retrieval. Our approach generates 1D
identifiers based on the point descriptors, enabling direct retrieval in
constant time. To adapt DSI to 3D data, we integrate Vision Transformers to map
descriptors to these identifiers while incorporating positional and semantic
encoding. The approach is evaluated for place recognition on a public benchmark
comparing its retrieval capabilities against state-of-the-art methods, in terms
of quality and speed of returned point clouds.
Authors' comments: 8 pages, 1 figures
Ke Niu, Haiyang Yu, Mengyang Zhao, Teng Fu, Siyang Yi, Wei Lu, Bin Li, Xuelin Qian et al.
Person re-identification (Re-ID) is a critical task in human-centric intelligent systems, enabling consistent identification of individuals across different camera views using multi-modal query information. Recent studies have successfully integrated LVLMs with person Re-ID, yielding promising results. However, existing LVLM-based methods face several limitations. They rely on extracting textual embeddings from fixed templates, which are used either as intermediate features for image representation or for prompt tuning in domain-specific tasks. Furthermore, they are unable to adopt the VQA inference format, significantly restricting their broader applicability. In this paper, we propose a novel, versatile, one-for-all person Re-ID framework, ChatReID. Our approach introduces a Hierarchical Progressive Tuning (HPT) strategy, which ensures fine-grained identity-level retrieval by progressively refining the model's ability to distinguish pedestrian identities. Extensive experiments demonstrate that our approach outperforms SOTA methods across ten benchmarks in four different Re-ID settings, offering enhanced flexibility and user-friendliness. ChatReID provides a scalable, practical solution for real-world person Re-ID applications, enabling effective multi-modal interaction and fine-grained identity discrimination.
Lang Huang, Qiyu Wu, Zhongtao Miao, Toshihiko Yamasaki
Information retrieval is indispensable for today's Internet applications, yet traditional semantic matching techniques often fall short in capturing the fine-grained cross-modal interactions required for complex queries. Although late-fusion two-tower architectures attempt to bridge this gap by independently encoding visual and textual data before merging them at a high level, they frequently overlook the subtle interplay essential for comprehensive understanding. In this work, we rigorously assess these limitations and introduce a unified retrieval framework that fuses visual and textual cues from the ground up, enabling early cross-modal interactions for enhancing context interpretation. Through a two-stage training process--comprising post-training adaptation followed by instruction tuning--we adapt MLLMs as retrievers using a simple one-tower architecture. Our approach outperforms conventional methods across diverse retrieval scenarios, particularly when processing complex multi-modal inputs. Notably, the joint fusion encoder yields greater improvements on tasks that require modality fusion compared to those that do not, underscoring the transformative potential of early integration strategies and pointing toward a promising direction for contextually aware and effective information retrieval.
Haitao Li, Yifan Chen, Yiran Hu, Qingyao Ai, Junjie Chen, Xiaoyu Yang, Jianhui Yang, Yueyue Wu et al.
Retrieval-augmented generation (RAG) has proven highly effective in improving
large language models (LLMs) across various domains. However, there is no
benchmark specifically designed to assess the effectiveness of RAG in the legal
domain, which restricts progress in this area. To fill this gap, we propose
LexRAG, the first benchmark to evaluate RAG systems for multi-turn legal
consultations. LexRAG consists of 1,013 multi-turn dialogue samples and 17,228
candidate legal articles. Each sample is annotated by legal experts and
consists of five rounds of progressive questioning. LexRAG includes two key
tasks: (1) Conversational knowledge retrieval, requiring accurate retrieval of
relevant legal articles based on multi-turn context. (2) Response generation,
focusing on producing legally sound answers. To ensure reliable
reproducibility, we develop LexiT, a legal RAG toolkit that provides a
comprehensive implementation of RAG system components tailored for the legal
domain. Additionally, we introduce an LLM-as-a-judge evaluation pipeline to
enable detailed and effective assessment. Through experimental analysis of
various LLMs and retrieval methods, we reveal the key limitations of existing
RAG systems in handling legal consultation conversations. LexRAG establishes a
new benchmark for the practical application of RAG systems in the legal domain,
with its code and data available at https://github.com/CSHaitao/LexRAG.
Authors' comments: 10 pages
Yongjia Lei, Haoyu Han, Ryan A. Rossi, Franck Dernoncourt, Nedim Lipka, Mahantesh M Halappanavar, Jiliang Tang, Yu Wang
Text-rich Graph Knowledge Bases (TG-KBs) have become increasingly crucial for answering queries by providing textual and structural knowledge. However, current retrieval methods often retrieve these two types of knowledge in isolation without considering their mutual reinforcement and some hybrid methods even bypass structural retrieval entirely after neighboring aggregation. To fill in this gap, we propose a Mixture of Structural-and-Textual Retrieval (MoR) to retrieve these two types of knowledge via a Planning-Reasoning-Organizing framework. In the Planning stage, MoR generates textual planning graphs delineating the logic for answering queries. Following planning graphs, in the Reasoning stage, MoR interweaves structural traversal and textual matching to obtain candidates from TG-KBs. In the Organizing stage, MoR further reranks fetched candidates based on their structural trajectory. Extensive experiments demonstrate the superiority of MoR in harmonizing structural and textual retrieval with insights, including uneven retrieving performance across different query logics and the benefits of integrating structural trajectories for candidate reranking. Our code is available at https://github.com/Yoega/MoR.
Achuth Chandrasekhar, Omid Barati Farimani, Olabode T. Ajenifujah, Janghoon Ock, Amir Barati Farimani
This paper presents the development and application of a Large Language Model
Retrieval-Augmented Generation (LLM-RAG) system tailored for nanotechnology
research. The system leverages the capabilities of a sophisticated language
model to serve as an intelligent research assistant, enhancing the efficiency
and comprehensiveness of literature reviews in the nanotechnology domain.
Central to this LLM-RAG system is its advanced query backend retrieval
mechanism, which integrates data from multiple reputable sources. The system
retrieves relevant literature by utilizing Google Scholar's advanced search,
and scraping open-access papers from Elsevier, Springer Nature, and ACS
Publications. This multifaceted approach ensures a broad and diverse collection
of up-to-date scholarly articles and papers. The proposed system demonstrates
significant potential in aiding researchers by providing a streamlined,
accurate, and exhaustive literature retrieval process, thereby accelerating
research advancements in nanotechnology. The effectiveness of the LLM-RAG
system is validated through rigorous testing, illustrating its capability to
significantly reduce the time and effort required for comprehensive literature
reviews, while maintaining high accuracy, query relevance and outperforming
standard, publicly available LLMS.
Authors' comments: 61 pages, 3 figures
Zhouyu Jiang, Mengshu Sun, Zhiqiang Zhang, Lei Liang
Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in Large Language Models (LLMs) but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bi'an}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on Bi'anBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon at https://github.com/OpenSPG/KAG.
Junlong Ren, Hao Wu, Hui Xiong, Hao Wang
The cross-modal 3D retrieval task aims to achieve mutual matching between
text descriptions and 3D shapes. This has the potential to enhance the
interaction between natural language and the 3D environment, especially within
the realms of robotics and embodied artificial intelligence (AI) applications.
However, the scarcity and expensiveness of 3D data constrain the performance of
existing cross-modal 3D retrieval methods. These methods heavily rely on
features derived from the limited number of 3D shapes, resulting in poor
generalization ability across diverse scenarios. To address this challenge, we
introduce SCA3D, a novel 3D shape and caption online data augmentation method
for cross-modal 3D retrieval. Our approach uses the LLaVA model to create a
component library, captioning each segmented part of every 3D shape within the
dataset. Notably, it facilitates the generation of extensive new 3D-text pairs
containing new semantic features. We employ both inter and intra distances to
align various components into a new 3D shape, ensuring that the components do
not overlap and are closely fitted. Further, text templates are utilized to
process the captions of each component and generate new text descriptions.
Besides, we use unimodal encoders to extract embeddings for 3D shapes and texts
based on the enriched dataset. We then calculate fine-grained cross-modal
similarity using Earth Mover's Distance (EMD) and enhance cross-modal matching
with contrastive learning, enabling bidirectional retrieval between texts and
3D shapes. Extensive experiments show our SCA3D outperforms previous works on
the Text2Shape dataset, raising the Shape-to-Text RR@1 score from 20.03 to
27.22 and the Text-to-Shape RR@1 score from 13.12 to 16.67. Codes can be found
in https://github.com/3DAgentWorld/SCA3D.
Authors' comments: ICRA 2025
Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, Guorui Zhou
Recently, generative retrieval-based recommendation systems have emerged as a promising paradigm. However, most modern recommender systems adopt a retrieve-and-rank strategy, where the generative model functions only as a selector during the retrieval stage. In this paper, we propose OneRec, which replaces the cascaded learning framework with a unified generative model. To the best of our knowledge, this is the first end-to-end generative model that significantly surpasses current complex and well-designed recommender systems in real-world scenarios. Specifically, OneRec includes: 1) an encoder-decoder structure, which encodes the user's historical behavior sequences and gradually decodes the videos that the user may be interested in. We adopt sparse Mixture-of-Experts (MoE) to scale model capacity without proportionally increasing computational FLOPs. 2) a session-wise generation approach. In contrast to traditional next-item prediction, we propose a session-wise generation, which is more elegant and contextually coherent than point-by-point generation that relies on hand-crafted rules to properly combine the generated results. 3) an Iterative Preference Alignment module combined with Direct Preference Optimization (DPO) to enhance the quality of the generated results. Unlike DPO in NLP, a recommendation system typically has only one opportunity to display results for each user's browsing request, making it impossible to obtain positive and negative samples simultaneously. To address this limitation, We design a reward model to simulate user generation and customize the sampling strategy. Extensive experiments have demonstrated that a limited number of DPO samples can align user interest preferences and significantly improve the quality of generated results. We deployed OneRec in the main scene of Kuaishou, achieving a 1.6\% increase in watch-time, which is a substantial improvement.
Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Jimmy Lin
While the current state-of-the-art dense retrieval models exhibit strong out-of-domain generalization, they might fail to capture nuanced domain-specific knowledge. In principle, fine-tuning these models for specialized retrieval tasks should yield higher effectiveness than relying on a one-size-fits-all model, but in practice, results can disappoint. We show that standard fine-tuning methods using an InfoNCE loss can unexpectedly degrade effectiveness rather than improve it, even for domain-specific scenarios. This holds true even when applying widely adopted techniques such as hard-negative mining and negative de-noising. To address this, we explore a training strategy that uses listwise distillation from a teacher cross-encoder, leveraging rich relevance signals to fine-tune the retriever. We further explore synthetic query generation using large language models. Through listwise distillation and training with a diverse set of queries ranging from natural user searches and factual claims to keyword-based queries, we achieve consistent effectiveness gains across multiple datasets. Our results also reveal that synthetic queries can rival human-written queries in training utility. However, we also identify limitations, particularly in the effectiveness of cross-encoder teachers as a bottleneck. We release our code and scripts to encourage further research.
Xueguang Ma, Xi Victoria Lin, Barlas Oguz, Jimmy Lin, Wen-tau Yih, Xilun Chen
Large language models (LLMs) have demonstrated strong effectiveness and robustness while fine-tuned as dense retrievers. However, their large parameter size brings significant inference time computational challenges, including high encoding costs for large-scale corpora and increased query latency, limiting their practical deployment. While smaller retrievers offer better efficiency, they often fail to generalize effectively with limited supervised fine-tuning data. In this work, we introduce DRAMA, a training framework that leverages LLMs to train smaller generalizable dense retrievers. In particular, we adopt pruned LLMs as the backbone and train on diverse LLM-augmented data in a single-stage contrastive learning setup. Experiments show that DRAMA offers better multilingual and long-context capabilities than traditional encoder-based retrievers, and achieves strong performance across multiple tasks and languages. These highlight the potential of connecting the training of smaller retrievers with the growing advancements in LLMs, bridging the gap between efficiency and generalization.
Nuo Xu, Pinghui Wang, Zi Liang, Junzhou Zhao, Xiaohong Guan
Legal case retrieval (LCR) aims to automatically scour for comparable legal cases based on a given query, which is crucial for offering relevant precedents to support the judgment in intelligent legal systems. Due to similar goals, it is often associated with a similar case matching (LCM) task. To address them, a daunting challenge is assessing the uniquely defined legal-rational similarity within the judicial domain, which distinctly deviates from the semantic similarities in general text retrieval. Past works either tagged domain-specific factors or incorporated reference laws to capture legal-rational information. However, their heavy reliance on expert or unrealistic assumptions restricts their practical applicability in real-world scenarios. In this paper, we propose an end-to-end model named LCM-LAI to solve the above challenges. Through meticulous theoretical analysis, LCM-LAI employs a dependent multi-task learning framework to capture legal-rational information within legal cases by a law article prediction (LAP) sub-task, without any additional assumptions in inference. Besides, LCM-LAI proposes an article-aware attention mechanism to evaluate the legal-rational similarity between across-case sentences based on law distribution, which is more effective than conventional semantic similarity. Weperform a series of exhaustive experiments including two different tasks involving four real-world datasets. Results demonstrate that LCM-LAI achieves state-of-the-art performance.
Zhuocheng Zhang, Yang Feng, Min Zhang
Retrieval-Augmented Generation (RAG) is a crucial method for mitigating
hallucinations in Large Language Models (LLMs) and integrating external
knowledge into their responses. Existing RAG methods typically employ query
rewriting to clarify the user intent and manage multi-hop logic, while using
hybrid retrieval to expand search scope. However, the tight coupling of query
rewriting to the dense retriever limits its compatibility with hybrid
retrieval, impeding further RAG performance improvements. To address this
challenge, we introduce a high-level searcher that decomposes complex queries
into atomic queries, independent of any retriever-specific optimizations.
Additionally, to harness the strengths of sparse retrievers for precise keyword
retrieval, we have developed a new sparse searcher that employs Lucene syntax
to enhance retrieval accuracy.Alongside web and dense searchers, these
components seamlessly collaborate within our proposed method,
\textbf{LevelRAG}. In LevelRAG, the high-level searcher orchestrates the
retrieval logic, while the low-level searchers (sparse, web, and dense) refine
the queries for optimal retrieval. This approach enhances both the completeness
and accuracy of the retrieval process, overcoming challenges associated with
current query rewriting techniques in hybrid retrieval scenarios. Empirical
experiments conducted on five datasets, encompassing both single-hop and
multi-hop question answering tasks, demonstrate the superior performance of
LevelRAG compared to existing RAG methods. Notably, LevelRAG outperforms the
state-of-the-art proprietary model, GPT4o, underscoring its effectiveness and
potential impact on the RAG field.
Authors' comments: First submit
Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, Feng Zhao
Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark.