Shantanu Agarwal, Joel Barry, Steven Fincke, Scott Miller
Authorship attribution (AA) is the task of identifying the most likely author of a query document from a predefined set of candidate authors. We introduce a two-stage retrieve-and-rerank framework that finetunes LLMs for cross-genre AA. Unlike the field of information retrieval (IR), where retrieve-and-rerank is a de facto strategy, cross-genre AA systems must avoid relying on topical cues and instead learn to identify author-specific linguistic patterns that are independent of the text's subject matter (genre/domain/topic). Consequently, for the reranker, we demonstrate that training strategies commonly used in IR are fundamentally misaligned with cross-genre AA, leading to suboptimal behavior. To address this, we introduce a targeted data curation strategy that enables the reranker to effectively learn author-discriminative signals. Using our LLM-based retrieve-and-rerank pipeline, we achieve substantial gains of 22.3 and 34.4 absolute Success@8 points over the previous state-of-the-art on HIATUS's challenging HRS1 and HRS2 cross-genre AA benchmarks.
Wonduk Seo, Juhyeon Lee, Junseo Koh, Hyunjin An, Jian Park, Seunghyun Lee, Haihua Chen, Yi Bu
Prompt optimization has emerged as an effective alternative to retraining for
improving the performance of Large Language Models (LLMs). However, most
existing approaches treat evaluation as a black box, relying solely on
numerical scores while offering limited insight into why a prompt succeeds or
fails. They also depend heavily on trial-and-error refinements, which are
difficult to interpret and control. In this paper, we introduce MA-SAPO, a
Multi-Agent framework for Score-Aware Prompt Optimization. Compared to prior
methods, MA-SAPO explicitly couples evaluation outcomes with structured
reasoning to guide systematic edits. The framework specifically consists of two
stages: during the Reasoning Phase, agents collaboratively explain metric
scores, diagnose weaknesses, and synthesize targeted refinements that are
stored as reusable reasoning assets; during the Test Phase, agents retrieve
these assets to analyze optimized prompts and apply only evidence-grounded
edits. By turning evaluation signals into interpretable reasoning chains,
MA-SAPO produces prompt refinements that are more transparent, auditable, and
controllable. Experiments on the HelpSteer1/2 benchmarks demonstrate consistent
improvements over single-pass prompting, retrieval-augmented baselines, and
prior multi-agent strategies, validating the effectiveness of our approach.
Authors' comments: Preprint
Wenbiao Tao, Yunshi Lan, Weining Qian
Retrieval-Augmented Generation enhances language models by retrieving external knowledge to support informed and grounded responses. However, traditional RAG methods rely on fragment-level retrieval, limiting their ability to address query-focused summarization queries. GraphRAG introduces a graph-based paradigm for global knowledge reasoning, yet suffers from inefficiencies in information extraction, costly resource consumption, and poor adaptability to incremental updates. To overcome these limitations, we propose TagRAG, a tag-guided hierarchical knowledge graph RAG framework designed for efficient global reasoning and scalable graph maintenance. TagRAG introduces two key components: (1) Tag Knowledge Graph Construction, which extracts object tags and their relationships from documents and organizes them into hierarchical domain tag chains for structured knowledge representation, and (2) Tag-Guided Retrieval-Augmented Generation, which retrieves domain-centric tag chains to localize and synthesize relevant knowledge during inference. This design significantly adapts to smaller language models, improves retrieval granularity, and supports efficient knowledge increment. Extensive experiments on UltraDomain datasets spanning Agriculture, Computer Science, Law, and cross-domain settings demonstrate that TagRAG achieves an average win rate of 95.41\% against baselines while maintaining about 14.6x construction and 1.9x retrieval efficiency compared with GraphRAG.
Effrosyni Sokli, Pranav Kasela, Georgios Peikos, Gabriella Pasi
Dense Retrieval Models (DRMs) are a prominent development in Information
Retrieval (IR). A key challenge with these neural Transformer-based models is
that they often struggle to generalize beyond the specific tasks and domains
they were trained on. To address this challenge, prior research in IR
incorporated the Mixture-of-Experts (MoE) framework within each Transformer
layer of a DRM, which, though effective, substantially increased the number of
additional parameters. In this paper, we propose a more efficient design, which
introduces a single MoE block (SB-MoE) after the final Transformer layer. To
assess the retrieval effectiveness of SB-MoE, we perform an empirical
evaluation across three IR tasks. Our experiments involve two evaluation
setups, aiming to assess both in-domain effectiveness and the model's zero-shot
generalizability. In the first setup, we fine-tune SB-MoE with four different
underlying DRMs on seven IR benchmarks and evaluate them on their respective
test sets. In the second setup, we fine-tune SB-MoE on MSMARCO and perform
zero-shot evaluation on thirteen BEIR datasets. Additionally, we perform
further experiments to analyze the model's dependency on its hyperparameters
(i.e., the number of employed and activated experts) and investigate how this
variation affects SB-MoE's performance. The obtained results show that SB-MoE
is particularly effective for DRMs with lightweight base models, such as
TinyBERT and BERT-Small, consistently exceeding standard model fine-tuning
across benchmarks. For DRMs with more parameters, such as BERT-Base and
Contriever, our model requires a larger number of training samples to achieve
improved retrieval performance. Our code is available online at:
https://github.com/FaySokli/SB-MoE.
Authors' comments: 8 pages, 4 figures, 3 tables, reproducible code available at
https://github.com/FaySokli/SB-MoE , Accepted for publication in Proceedings
of the 2025 IEEE/WIC International Conference on Web Intelligence and
Intelligent Agent Technology (WI-IAT 2025)
Ines Besrour, Jingbo He, Tobias Schreieder, Michael Färber
We present SQuAI (https://squai.scads.ai/), a scalable and trustworthy
multi-agent retrieval-augmented generation (RAG) framework for scientific
question answering (QA) with large language models (LLMs). SQuAI addresses key
limitations of existing RAG systems in the scholarly domain, where complex,
open-domain questions demand accurate answers, explicit claims with citations,
and retrieval across millions of scientific documents. Built on over 2.3
million full-text papers from arXiv.org, SQuAI employs four collaborative
agents to decompose complex questions into sub-questions, retrieve targeted
evidence via hybrid sparse-dense retrieval, and adaptively filter documents to
improve contextual relevance. To ensure faithfulness and traceability, SQuAI
integrates in-line citations for each generated claim and provides supporting
sentences from the source documents. Our system improves faithfulness, answer
relevance, and contextual relevance by up to +0.088 (12%) over a strong RAG
baseline. We further release a benchmark of 1,000 scientific
question-answer-evidence triplets to support reproducibility. With transparent
reasoning, verifiable citations, and domain-wide scalability, SQuAI
demonstrates how multi-agent RAG enables more trustworthy scientific QA with
LLMs.
Authors' comments: Accepted at CIKM 2025
Qiyu Wu, Shuyang Cui, Satoshi Hayakawa, Wei-Yao Wang, Hiromi Wakaki, Yuki Mitsufuji
Multimodal retrieval, which seeks to retrieve relevant content across modalities such as text or image, supports applications from AI search to contents production. Despite the success of separate-encoder approaches like CLIP align modality-specific embeddings with contrastive learning, recent multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs. While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut, leading to poor robustness under distribution shifts. We propose a modality composition awareness framework to mitigate this issue. Concretely, a preference loss enforces multimodal embeddings to outperform their unimodal counterparts, while a composition regularization objective aligns multimodal embeddings with prototypes composed from its unimodal parts. These objectives explicitly model structural relationships between the composed representation and its unimodal counterparts. Experiments on various benchmarks show gains in out-of-distribution retrieval, highlighting modality composition awareness as a effective principle for robust composed multimodal retrieval when utilizing MLLMs as the unified encoder.
Jaewan Park, Solbee Cho, Jay-Yoon Lee
Iterative retrieval-augmented generation (RAG) enables large language models
to answer complex multi-hop questions, but each additional loop increases
latency, costs, and the risk of introducing distracting evidence, motivating
the need for an efficient stopping strategy. Existing methods either use a
predetermined number of iterations or rely on confidence proxies that poorly
reflect whether more retrieval will actually help. We cast iterative RAG as a
finite-horizon Markov decision process and introduce Stop-RAG, a value-based
controller that adaptively decides when to stop retrieving. Trained with
full-width forward-view Q($\lambda$) targets from complete trajectories,
Stop-RAG learns effective stopping policies while remaining compatible with
black-box APIs and existing pipelines. On multi-hop question-answering
benchmarks, Stop-RAG consistently outperforms both fixed-iteration baselines
and prompting-based stopping with LLMs. These results highlight adaptive
stopping as a key missing component in current agentic systems, and demonstrate
that value-based control can improve the accuracy of RAG systems.
Authors' comments: NeurIPS 2025 MTI-LLM Workshop
Jianting Tang, Dongshuai Li, Tao Wen, Fuyu Lv, Dan Ou, Linli Xu
In modern e-commerce search systems, dense retrieval has become an indispensable component. By computing similarities between query and item (product) embeddings, it efficiently selects candidate products from large-scale repositories. With the breakthroughs in large language models (LLMs), mainstream embedding models have gradually shifted from BERT to LLMs for more accurate text modeling. However, these models still adopt direct-embedding methods, and the semantic accuracy of embeddings remains inadequate. Therefore, contrastive learning is heavily employed to achieve tight semantic alignment between positive pairs. Consequently, such models tend to capture statistical co-occurrence patterns in the training data, biasing them toward shallow lexical and semantic matches. For difficult queries exhibiting notable lexical disparity from target items, the performance degrades significantly. In this work, we propose the Large Reasoning Embedding Model (LREM), which novelly integrates reasoning processes into representation learning. For difficult queries, LREM first conducts reasoning to achieve a deep understanding of the original query, and then produces a reasoning-augmented query embedding for retrieval. This reasoning process effectively bridges the semantic gap between original queries and target items, significantly improving retrieval accuracy. Specifically, we adopt a two-stage training process: the first stage optimizes the LLM on carefully curated Query-CoT-Item triplets with SFT and InfoNCE losses to establish preliminary reasoning and embedding capabilities, and the second stage further refines the reasoning trajectories via reinforcement learning (RL). Extensive offline and online experiments validate the effectiveness of LREM, leading to its deployment on China's largest e-commerce platform since August 2025.
Md Mahadi Hasan Nahid, Davood Rafiei
Retrieval plays a central role in multi-hop question answering (QA), where
answering complex questions requires gathering multiple pieces of evidence. We
introduce an Agentic Retrieval System that leverages large language models
(LLMs) in a structured loop to retrieve relevant evidence with high precision
and recall. Our framework consists of three specialized agents: a Question
Analyzer that decomposes a multi-hop question into sub-questions, a Selector
that identifies the most relevant context for each sub-question (focusing on
precision), and an Adder that brings in any missing evidence (focusing on
recall). The iterative interaction between Selector and Adder yields a compact
yet comprehensive set of supporting passages. In particular, it achieves higher
retrieval accuracy while filtering out distracting content, enabling downstream
QA models to surpass full-context answer accuracy while relying on
significantly less irrelevant information. Experiments on four multi-hop QA
benchmarks -- HotpotQA, 2WikiMultiHopQA, MuSiQue, and MultiHopRAG --
demonstrates that our approach consistently outperforms strong baselines.
Authors' comments: 18 pages
Yilun Zheng, Dan Yang, Jie Li, Lin Shang, Lihui Chen, Jiahao Xu, Sitao Luan
Retrieval-Augmented Generation (RAG) systems enable large language models (LLMs) instant access to relevant information for the generative process, demonstrating their superior performance in addressing common LLM challenges such as hallucination, factual inaccuracy, and the knowledge cutoff. Graph-based RAG further extends this paradigm by incorporating knowledge graphs (KGs) to leverage rich, structured connections for more precise and inferential responses. A critical challenge, however, is that most Graph-based RAG systems rely on LLMs for automated KG construction, often yielding noisy KGs with redundant entities and unreliable relationships. This noise degrades retrieval and generation performance while also increasing computational cost. Crucially, current research does not comprehensively address the denoising problem for LLM-generated KGs. In this paper, we introduce DEnoised knowledge Graphs for Retrieval Augmented Generation (DEG-RAG), a framework that addresses these challenges through: (1) entity resolution, which eliminates redundant entities, and (2) triple reflection, which removes erroneous relations. Together, these techniques yield more compact, higher-quality KGs that significantly outperform their unprocessed counterparts. Beyond the methods, we conduct a systematic evaluation of entity resolution for LLM-generated KGs, examining different blocking strategies, embedding choices, similarity metrics, and entity merging techniques. To the best of our knowledge, this is the first comprehensive exploration of entity resolution in LLM-generated KGs. Our experiments demonstrate that this straightforward approach not only drastically reduces graph size but also consistently improves question answering performance across diverse popular Graph-based RAG variants.
Jiho Shin, Nima Shiri Harzevili, Reem Aleithan, Hadi Hemmati, Song Wang
Retrieval Augmented Generation (RAG) has advanced software engineering tasks but remains underexplored in unit test generation. To bridge this gap, we investigate the efficacy of RAG-based unit test generation for machine learning (ML/DL) APIs and analyze the impact of different knowledge sources on their effectiveness. We examine three domain-specific sources for RAG: (1) API documentation (official guidelines), (2) GitHub issues (developer-reported resolutions), and (3) StackOverflow Q&As (community-driven solutions). Our study focuses on five widely used Python-based ML/DL libraries, TensorFlow, PyTorch, Scikit-learn, Google JAX, and XGBoost, targeting the most-used APIs. We evaluate four state-of-the-art LLMs -- GPT-3.5-Turbo, GPT-4o, Mistral MoE 8x22B, and Llama 3.1 405B -- across three strategies: basic instruction prompting, Basic RAG, and API-level RAG. Quantitatively, we assess syntactical and dynamic correctness and line coverage. While RAG does not enhance correctness, RAG improves line coverage by 6.5% on average. We found that GitHub issues result in the best improvement in line coverage by providing edge cases from various issues. We also found that these generated unit tests can help detect new bugs. Specifically, 28 bugs were detected, 24 unique bugs were reported to developers, ten were confirmed, four were rejected, and ten are awaiting developers' confirmation. Our findings highlight RAG's potential in unit test generation for improving test coverage with well-targeted knowledge sources. Future work should focus on retrieval techniques that identify documents with unique program states to optimize RAG-based unit test generation further.
Authors' comments: 11 pages + reference. Accepted as Research Track Paper at ICSE'26
Zhichao Xu, Zongyu Wu, Yun Zhou, Aosong Feng, Kang Zhou, Sangmin Woo, Kiran Ramnath, Yijun Tian et al.
Inspired by the success of reinforcement learning (RL) in Large Language Model (LLM) training for domains like math and code, recent works have begun exploring how to train LLMs to use search engines more effectively as tools for retrieval-augmented generation. Although these methods achieve performance improvement across QA benchmarks, many prioritize final answer correctness while overlooking the quality of intermediate reasoning steps, which may lead to chain-of-thought unfaithfulness. In this paper, we first introduce a comprehensive evaluation framework for evaluating RL-based search agents, covering three distinct faithfulness metrics: information-think faithfulness, think-answer faithfulness, and think-search faithfulness. Our evaluations reveal that a prototypical RL-based search agent, Search-R1, has significant room for improvement in this regard. To foster faithful reasoning, we introduce VERITAS (Verifying Entailed Reasoning through Intermediate Traceability in Agentic Search), a novel framework that integrates fine-grained faithfulness rewards into the reinforcement learning process. Our experiments show that models trained with VERITAS not only significantly improve reasoning faithfulness, but also achieve comparable task performance across seven QA benchmarks.
Yee Man Choi, Xuehang Guo, Yi R., Fung, Qingyun Wang
Large Language Models (LLMs) have emerged as promising assistants for scientific writing. However, there have been concerns regarding the quality and reliability of the generated text, one of which is the citation accuracy and faithfulness. While most recent work relies on methods such as LLM-as-a-Judge, the reliability of LLM-as-a-Judge alone is also in doubt. In this work, we reframe citation evaluation as a problem of citation attribution alignment, which is assessing whether LLM-generated citations match those a human author would include for the same text. We propose CiteGuard, a retrieval-aware agent framework designed to provide more faithful grounding for citation validation. CiteGuard improves the prior baseline by 12.3%, and achieves up to 65.4% accuracy on the CiteME benchmark, on par with human-level performance (69.7%). It also enables the identification of alternative but valid citations.
Congying Liu, Xingyuan Wei, Peipei Liu, Yiqing Shen, Yanxu Mao, Tiehan Cui
Biomedical queries often rely on a deep understanding of specialized knowledge such as gene regulatory mechanisms and pathological processes of diseases. They require detailed analysis of complex physiological processes and effective integration of information from multiple data sources to support accurate retrieval and reasoning. Although large language models (LLMs) perform well in general reasoning tasks, their generated biomedical content often lacks scientific rigor due to the inability to access authoritative biomedical databases and frequently fabricates protein functions, interactions, and structural details that deviate from authentic information. Therefore, we present BioMedSearch, a multi-source biomedical information retrieval framework based on LLMs. The method integrates literature retrieval, protein database and web search access to support accurate and efficient handling of complex biomedical queries. Through sub-queries decomposition, keywords extraction, task graph construction, and multi-source information filtering, BioMedSearch generates high-quality question-answering results. To evaluate the accuracy of question answering, we constructed a multi-level dataset, BioMedMCQs, consisting of 3,000 questions. The dataset covers three levels of reasoning: mechanistic identification, non-adjacent semantic integration, and temporal causal reasoning, and is used to assess the performance of BioMedSearch and other methods on complex QA tasks. Experimental results demonstrate that BioMedSearch consistently improves accuracy over all baseline models across all levels. Specifically, at Level 1, the average accuracy increases from 59.1% to 91.9%; at Level 2, it rises from 47.0% to 81.0%; and at the most challenging Level 3, the average accuracy improves from 36.3% to 73.4%. The code and BioMedMCQs are available at: https://github.com/CyL-ucas/BioMed_Search
Kenan Alkiek, David Jurgens, Vinod Vydiswaran
Can we bring large-scale reasoning to local-scale compute? Small language models (SLMs) are increasingly attractive because they run efficiently on local hardware, offering strong privacy, low cost, and reduced environmental impact. Yet they often struggle with tasks that require multi-step reasoning or domain-specific knowledge. We address this limitation through instruction intervention at inference time, where an SLM retrieves structured reasoning procedures rather than generating them from scratch. Our method builds an Instruction Corpus by grouping similar training questions and creating instructions via GPT-5. During inference, the SLM retrieves the most relevant instructions and follows their steps. Unlike retrieval-augmented generation, which retrieves text passages, instruction retrieval gives the model structured guidance for reasoning. We evaluate this framework on MedQA (medical board exams), MMLU Professional Law, and MathQA using models from 3B to 14B parameters without any additional fine-tuning. Instruction retrieval yields consistent gains: 9.4% on MedQA, 7.9% on MMLU Law, and 5.1% on MathQA. Concise instructions outperform longer ones, and the magnitude of improvement depends strongly on model family and intrinsic reasoning ability.
Santanu Acharjee, Ripunjoy Choudhury
Due to the exponential growth of big data in this digital era, an advanced method for effective information retrieval becomes essential. The basic objective of this paper is to propose a topology-based method for cognitive information retrieval (CIR) in big data environments. By using concepts such as cognitive similarity distances, metric spaces, retrieval topologies, etc., this paper aims to propose the semantic alignment between user queries and document repositories. The paper also extends this approach to incorporate logical connectives in cognitive information retrieval.
Biao Zhang, Lixin Chen, Tong Liu, Bo Zheng
Large language models (LLMs) generate high-dimensional embeddings that
capture rich semantic and syntactic information. However, high-dimensional
embeddings exacerbate computational complexity and storage requirements,
thereby hindering practical deployment. To address these challenges, we propose
a novel training framework named Sequential Matryoshka Embedding Compression
(SMEC). This framework introduces the Sequential Matryoshka Representation
Learning(SMRL) method to mitigate gradient variance during training, the
Adaptive Dimension Selection (ADS) module to reduce information degradation
during dimension pruning, and the Selectable Cross-batch Memory (S-XBM) module
to enhance unsupervised learning between high- and low-dimensional embeddings.
Experiments on image, text, and multimodal datasets demonstrate that SMEC
achieves significant dimensionality reduction while maintaining performance.
For instance, on the BEIR dataset, our approach improves the performance of
compressed LLM2Vec embeddings (256 dimensions) by 1.1 points and 2.7 points
compared to the Matryoshka-Adaptor and Search-Adaptor models, respectively.
Authors' comments: Accepted by EMNLP2025
Linfeng Gao, Baolong Bi, Zheng Yuan, Le Wang, Zerui Chen, Zhimin Wei, Shenghua Liu, Qinggang Zhang et al.
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm to enhance the factuality of Large Language Models (LLMs). However, existing RAG systems often suffer from an unfaithfulness issue, where the model's response contradicts evidence from the retrieved context. Existing approaches to improving contextual faithfulness largely rely on external interventions, such as prompt engineering, decoding constraints, or reward-based fine-tuning. These works treat the LLM as a black box and overlook a crucial question: how does the LLM internally integrate retrieved evidence with its parametric memory, particularly under knowledge conflicts? To address this gap, we conduct a probing-based analysis of hidden-state representations in LLMs and observe three findings: knowledge integration occurs hierarchically, conflicts manifest as latent signals at the sentence level, and irrelevant context is often amplified when aligned with parametric knowledge. Building on these findings, we propose CLEAR (Conflict-Localized and Enhanced Attention for RAG), a framework that (i) decomposes context into fine-grained sentence-level knowledge, (ii) employs hidden-state probing to localize conflicting knowledge, and (iii) introduces conflict-aware fine-tuning to guide the model to accurately integrate retrieved evidence. Extensive experiments across three benchmarks demonstrate that CLEAR substantially improves both accuracy and contextual faithfulness, consistently outperforming strong baselines under diverse conflict conditions. The related resources are available at https://github.com/LinfengGao/CLEAR.
Ziyuan Luo, Yangyi Zhao, Ka Chun Cheung, Simon See, Renjie Wan
The widespread adoption of Retrieval-Augmented Image Generation (RAIG) has
raised significant concerns about the unauthorized use of private image
datasets. While these systems have shown remarkable capabilities in enhancing
generation quality through reference images, protecting visual datasets from
unauthorized use in such systems remains a challenging problem. Traditional
digital watermarking approaches face limitations in RAIG systems, as the
complex feature extraction and recombination processes fail to preserve
watermark signals during generation. To address these challenges, we propose
ImageSentinel, a novel framework for protecting visual datasets in RAIG. Our
framework synthesizes sentinel images that maintain visual consistency with the
original dataset. These sentinels enable protection verification through
randomly generated character sequences that serve as retrieval keys. To ensure
seamless integration, we leverage vision-language models to generate the
sentinel images. Experimental results demonstrate that ImageSentinel
effectively detects unauthorized dataset usage while preserving generation
quality for authorized applications. Code is available at
https://github.com/luo-ziyuan/ImageSentinel.
Authors' comments: Accepted at NeurIPS 2025
Eric He, Akash Gupta, Adian Liusie, Vatsal Raina, Piotr Molenda, Shirom Chabra, Vyas Raina
Text--image retrieval is necessary for applications such as product recommendation. Embedding-based approaches like CLIP enable efficient large-scale retrieval via vector similarity search, but they are primarily trained on literal caption-like text--image pairs and often fail to capture abstract or persona-driven attributes common in product recommendation applications (e.g., ``a gift for a mother who loves gardening''). In contrast, state-of-the-art vision--language models (vLLMs) can align text with images in a flexible manner, but their limited context window prevents them from directly handling retrieval over large catalogs. We propose a framework that distills the preference rankings of a powerful vLLM into an embedding-based system, transferring its nuanced alignment abilities while maintaining the inference-time scalability of an embedding-based approach. Experiments on persona-driven product recommendation tasks demonstrate that our method significantly outperforms existing embedding-based baselines, providing an efficient solution for personalized text--image retrieval.