Amirabbas Afzali, Amirreza Velae, Iman Ahmadi, Mohammad Aliannejadi
The presence of social biases in large language models (LLMs) has become a significant concern in AI research. These biases, often embedded in training data, can perpetuate harmful stereotypes and distort decision-making processes. When LLMs are integrated into ranking systems, they can propagate these biases, leading to unfair outcomes in critical applications such as search engines and recommendation systems. Backpack Language Models, unlike traditional transformer-based models that treat text sequences as monolithic structures, generate outputs as weighted combinations of non-contextual, learned word aspects, also known as senses. Leveraging this architecture, we propose a framework for debiasing ranking tasks. Our experimental results show that this framework effectively mitigates gender bias in text retrieval and ranking with minimal degradation in performance.
Yubo Wang, Haoyang Li, Fei Teng, Lei Chen
Graph-based retrieval-augmented generation (Graph-based RAG) has demonstrated significant potential in enhancing Large Language Models (LLMs) with structured knowledge. However, existing methods face three critical challenges: Inaccurate Graph Construction, caused by LLM hallucination; Poor Reasoning Ability, caused by failing to generate explicit reasons telling LLM why certain chunks were selected; and Inadequate Answering, which only partially answers the query due to the inadequate LLM reasoning, making their performance lag behind NaiveRAG on certain tasks. To address these issues, we propose AGRAG, an advanced graph-based retrieval-augmented generation framework. When constructing the graph, AGRAG substitutes the widely used LLM entity extraction method with a statistics-based method, avoiding hallucination and error propagation. When retrieval, AGRAG formulates the graph reasoning procedure as the Minimum Cost Maximum Influence (MCMI) subgraph generation problem, where we try to include more nodes with high influence score, but with less involving edge cost, to make the generated reasoning paths more comprehensive. We prove this problem to be NP-hard, and propose a greedy algorithm to solve it. The MCMI subgraph generated can serve as explicit reasoning paths to tell LLM why certain chunks were retrieved, thereby making the LLM better focus on the query-related part contents of the chunks, reducing the impact of noise, and improving AGRAG's reasoning ability. Furthermore, compared with the simple tree-structured reasoning paths, our MCMI subgraph can allow more complex graph structures, such as cycles, and improve the comprehensiveness of the generated reasoning paths.
Tasmia Zerin, Moumita Asad, B. M. Mainul Hossain, Kazi Sakib
Responsive websites frequently experience distorted layouts at specific
screen sizes, called Responsive Layout Failures (RLFs). Manually repairing
these RLFs involves tedious trial-and-error adjustments of HTML elements and
CSS properties. In this study, an automated repair approach, leveraging LLM
combined with domain-specific knowledge is proposed. The approach is named
ReDeFix, a Retrieval-Augmented Generation (RAG)-based solution that utilizes
Stack Overflow (SO) discussions to guide LLM on CSS repairs. By augmenting
relevant SO knowledge with RLF-specific contexts, ReDeFix creates a prompt that
is sent to the LLM to generate CSS patches. Evaluation demonstrates that our
approach achieves an 88\% accuracy in repairing RLFs. Furthermore, a study from
software engineers reveals that generated repairs produce visually correct
layouts while maintaining aesthetics.
Authors' comments: Accepted at the 41st IEEE International Conference on Software
Maintenance and Evolution 2025 (ICSME'25)
Qi Luo, Xiaonan Li, Junqi Dai, Shuang Cheng, Xipeng Qiu
Retrieval-Augmented Generation has shown remarkable results to address Large Language Models' hallucinations, which usually uses a large external corpus to supplement knowledge to LLMs. However, with the development of LLMs, the internal knowledge of LLMs has expanded significantly, thus causing significant knowledge redundancy between the external corpus and LLMs. On the one hand, the indexing cost of dense retrieval is highly related to the corpus size and thus significant redundant knowledge intensifies the dense retrieval's workload. On the other hand, the redundant knowledge in the external corpus is not helpful to LLMs and our exploratory analysis shows that it instead hurts the RAG performance on those questions which the LLM can answer by itself. To address these issues, we propose Zero-RAG to tackle these challenges. Specifically, we first propose the Mastery-Score metric to identify redundant knowledge in the RAG corpus to prune it. After pruning, answers to "mastered" questions rely primarily on internal knowledge of the LLM. To better harness the internal capacity, we propose Query Router and Noise-Tolerant Tuning to avoid the irrelevant documents' distraction and thus further improve the LLM's utilization of internal knowledge with pruned corpus. Experimental results show that Zero-RAG prunes the Wikipedia corpus by 30\% and accelerates the retrieval stage by 22\%, without compromising RAG's performance.
Shounak Paul, Dhananjay Ghumare, Pawan Goyal, Saptarshi Ghosh, Ashutosh Modi
Identifying/retrieving relevant statutes and prior cases/precedents for a
given legal situation are common tasks exercised by law practitioners.
Researchers to date have addressed the two tasks independently, thus developing
completely different datasets and models for each task; however, both retrieval
tasks are inherently related, e.g., similar cases tend to cite similar statutes
(due to similar factual situation). In this paper, we address this gap. We
propose IL-PCR (Indian Legal corpus for Prior Case and Statute Retrieval),
which is a unique corpus that provides a common testbed for developing models
for both the tasks (Statute Retrieval and Precedent Retrieval) that can exploit
the dependence between the two. We experiment extensively with several baseline
models on the tasks, including lexical models, semantic models and ensemble
based on GNNs. Further, to exploit the dependence between the two tasks, we
develop an LLM-based re-ranking approach that gives the best performance.
Authors' comments: Accepted at EMNLP 2025 (Main)
Xiang Li, Till Jahnke, Rebecca Boll, Jiaqi Han, Minkai Xu, Michael Meyer, Maria Novella Piancastelli, Daniel Rolles et al.
Capturing the structural changes that molecules undergo during chemical reactions in real space and time is a long-standing dream and an essential prerequisite for understanding and ultimately controlling femtochemistry. A key approach to tackle this challenging task is Coulomb explosion imaging, which benefited decisively from recently emerging high-repetition-rate X-ray free-electron laser sources. With this technique, information on the molecular structure is inferred from the momentum distributions of the ions produced by the rapid Coulomb explosion of molecules. Retrieving molecular structures from these distributions poses a highly non-linear inverse problem that remains unsolved for molecules consisting of more than a few atoms. Here, we address this challenge using a diffusion-based Transformer neural network. We show that the network reconstructs unknown molecular geometries from ion-momentum distributions with a mean absolute error below one Bohr radius, which is half the length of a typical chemical bond.
WonJun Moon, MinSeok Jung, Gilhan Park, Tae-Young Kim, Cheol-Ho Cho, Woojin Jun, Jae-Pil Heo
Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the
content matches a text query. Existing methods treat every annotated text-video
pair as a positive and all others as negatives, ignoring the rich semantic
variation both within a single video and across different videos. Consequently,
embeddings of both queries and their corresponding video-clip segments for
distinct events within the same video collapse together, while embeddings of
semantically similar queries and segments from different videos are driven
apart. This limits retrieval performance when videos contain multiple, diverse
events. This paper addresses the aforementioned problems, termed as semantic
collapse, in both the text and video embedding spaces. We first introduce Text
Correlation Preservation Learning, which preserves the semantic relationships
encoded by the foundation model across text queries. To address collapse in
video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive
alignment method that disentangles hierarchical video representations across
temporal scales. Subsequently, we introduce order-preserving token merging and
adaptive CBVA to enhance alignment by producing video segments that are
internally coherent yet mutually distinctive. Extensive experiments on PRVR
benchmarks demonstrate that our framework effectively prevents semantic
collapse and substantially improves retrieval accuracy.
Authors' comments: Accpeted to NeurIPS 2025. Code is available at
https://github.com/admins97/MSC_PRVR
Philipp Davydov, Ameya Prabhu, Matthias Bethge, Elisa Nguyen, Seong Joon Oh
Understanding how language-model outputs relate to the pretraining corpus is central to studying model behavior. Most training data attribution (TDA) methods ask which training examples causally influence a given output, often using leave-one-out tests. We invert the question: which outputs cannot be attributed to any pretraining example? We introduce un-attributability as an operational measure of semantic novelty: an output is novel if the pretraining corpus contains no semantically similar context. We approximate this with a simple two-stage retrieval pipeline: index the corpus with lightweight GIST embeddings, retrieve the top-n candidates, then rerank with ColBERTv2. If the nearest corpus item is less attributable than a human-generated text reference, we consider the output of the model as novel. We evaluate on SmolLM and SmolLM2 and report three findings: (1) models draw on pretraining data across much longer spans than previously reported; (2) some domains systematically promote or suppress novelty; and (3) instruction tuning not only alters style but also increases novelty. Reframing novelty assessment around un-attributability enables efficient analysis at pretraining scale. We release ~20 TB of corpus chunks and index artifacts to support replication and large-scale extension of our analysis at https://huggingface.co/datasets/stai-tuebingen/faiss-smollm
Yinglu Li, Zhiying Lu, Zhihang Liu, Chuanbin Liu, Hongtao Xie
Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model's attention and further degrade the performance. To address this challenge, we propose \modelname, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we propose a dynamic pipeline that intelligently groups salient patches into complete semantic regions. By delegating the task of identifying relevant regions to the retriever, \modelname enables the generator to focus solely on concise visual content relevant to queries, improving both efficiency and accuracy. Experiments on six benchmarks demonstrate that RegionRAG achieves state-of-the-art performance. Improves retrieval accuracy by 10.02\% in R@1 on average and increases question answering accuracy by 3.56\% while using only 71.42\% visual tokens compared to prior methods. The code will be available at https://github.com/Aeryn666/RegionRAG.
Jamie Mahowald, Benjamin Charles Germain Lee
Multimodal approaches have shown great promise for searching and navigating
digital collections held by libraries, archives, and museums. In this paper, we
introduce map-RAS: a retrieval-augmented search system for historic maps. In
addition to introducing our framework, we detail our publicly-hosted demo for
searching 101,233 map images held by the Library of Congress. With our system,
users can multimodally query the map collection via ColPali, summarize search
results using Llama 3.2, and upload their own collections to perform
inter-collection search. We articulate potential use cases for archivists,
curators, and end-users, as well as future work with our system in both machine
learning and the digital humanities. Our demo can be viewed at:
http://www.mapras.com.
Authors' comments: 5 pages, 5 figures
Thomas Cook, Richard Osuagwu, Liman Tsatiashvili, Vrynsia Vrynsia, Koustav Ghosal, Maraim Masoud, Riccardo Mattivi
Retrieval-Augmented Generation (RAG) systems often face limitations in
specialized domains such as fintech, where domain-specific ontologies, dense
terminology, and acronyms complicate effective retrieval and synthesis. This
paper introduces an agentic RAG architecture designed to address these
challenges through a modular pipeline of specialized agents. The proposed
system supports intelligent query reformulation, iterative sub-query
decomposition guided by keyphrase extraction, contextual acronym resolution,
and cross-encoder-based context re-ranking. We evaluate our approach against a
standard RAG baseline using a curated dataset of 85 question--answer--reference
triples derived from an enterprise fintech knowledge base. Experimental results
demonstrate that the agentic RAG system outperforms the baseline in retrieval
precision and relevance, albeit with increased latency. These findings suggest
that structured, multi-agent methodologies offer a promising direction for
enhancing retrieval robustness in complex, domain-specific settings.
Authors' comments: Keywords: RAG Agentic AI Fintech NLP KB Domain-Specific Ontology
Query Understanding
Prathamesh Tamhane
Modern astronomical surveys such as the Sloan Digital Sky Survey (SDSS) provide extensive astronomical databases enabling researchers to access vast amount of diverse data. However, retrieving data from archives requires knowledge of query languages and familiarity with their schema, which presents a barrier for non-experts. This work investigates the use of Microsoft Phi-2, a compact yet powerful transformer-based language model, fine-tuned on natural language--SQL pairs constructed from SDSS query examples. We develop an interface that translates user queries in natural language into SQL commands compatible with SDSS SkyServer. Preliminary evaluation shows that the fine-tuned model produces syntactically valid and largely semantically correct queries across a variety of astronomy-related requests. Our results show that even small-scale models, when carefully fine-tuned, can provide effective domain-specific natural language interfaces for large scientific databases.
Authors' comments: Submitted to the Proceedings of the International Astronomical Union (IAU Symposium 397, ''UniversAI: Exploring the Universe with Artificial Intelligence,'' 2025). 4 pages
Jiawei Zhou, Lei Chen
As retrieval-augmented generation (RAG) becomes increasingly widespread, the role of information retrieval (IR) is shifting from retrieving information for human users to retrieving contextual knowledge for artificial intelligence (AI) systems, where relevance becomes difficult to define or annotate beforehand. To address this challenge, we propose R3, a Retrieval framework optimized for RAG through trialand-feedback Reinforced contrastive learning. Unlike prior approaches that rely on annotated or synthetic data for supervised fine-tuning, R3 enables the retriever to dynamically explore and optimize relevance within the RAG environment. During training, the retrieved results interact with the environment to produce contrastive signals that automatically guide the retriever's self-improvement. Extensive experiments across diverse tasks demonstrate that R3 improves RAG performance by 5.2% over the original retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving comparable results to LLM-augmented retrieval and RAG systems built on post-trained or instruction-tuned LLMs. It is both efficient and practical, requiring only 4 GPUs and completing training within a single day.
Michail Dadopoulos, Anestis Ladas, Stratos Moschidis, Ioannis Negkakis
Retrieval-Augmented Generation (RAG) struggles on long, structured financial
filings where relevant evidence is sparse and cross-referenced. This paper
presents a systematic investigation of advanced metadata-driven
Retrieval-Augmented Generation (RAG) techniques, proposing and evaluating a
novel, multi-stage RAG architecture that leverages LLM-generated metadata. We
introduce a sophisticated indexing pipeline to create contextually rich
document chunks and benchmark a spectrum of enhancements, including
pre-retrieval filtering, post-retrieval reranking, and enriched embeddings,
benchmarked on the FinanceBench dataset. Our results reveal that while a
powerful reranker is essential for precision, the most significant performance
gains come from embedding chunk metadata directly with text ("contextual
chunks"). Our proposed optimal architecture combines LLM-driven pre-retrieval
optimizations with these contextual embeddings to achieve superior performance.
Additionally, we present a custom metadata reranker that offers a compelling,
cost-effective alternative to commercial solutions, highlighting a practical
trade-off between peak performance and operational efficiency. This study
provides a blueprint for building robust, metadata-aware RAG systems for
financial document analysis.
Authors' comments: Preprint version submitted to the International Journal of Accounting
Information Systems; currently under major revision. 20 pages, 1 figure, 1
table
Deniz Gorur, Antoni Rago, Francesca Toni
Judgmental forecasting is the task of making predictions about future events based on human judgment. This task can be seen as a form of claim verification, where the claim corresponds to a future event and the task is to assess the plausibility of that event. In this paper, we propose a novel multi-agent framework for claim verification, whereby different agents may disagree on claim veracity and bring specific evidence for and against the claims, represented as quantitative bipolar argumentation frameworks (QBAFs). We then instantiate the framework for supporting claim verification, with a variety of agents realised with Large Language Models (LLMs): (1) ArgLLM agents, an existing approach for claim verification that generates and evaluates QBAFs; (2) RbAM agents, whereby LLM-empowered Relation-based Argument Mining (RbAM) from external sources is used to generate QBAFs; (3) RAG-ArgLLM agents, extending ArgLLM agents with a form of Retrieval-Augmented Generation (RAG) of arguments from external sources. Finally, we conduct experiments with two standard judgmental forecasting datasets, with instances of our framework with two or three agents, empowered by six different base LLMs. We observe that combining evidence from agents can improve forecasting accuracy, especially in the case of three agents, while providing an explainable combination of evidence for claim verification.
Ziyu Liu, Yijing Liu, Jianfei Yuan, Minzhi Yan, Le Yue, Honghui Xiong, Yi Yang
Graph-based RAG constructs a knowledge graph (KG) from text chunks to enhance retrieval in Large Language Model (LLM)-based question answering. It is especially beneficial in domains such as biomedicine, law, and political science, where effective retrieval often involves multi-hop reasoning over proprietary documents. However, these methods demand numerous LLM calls to extract entities and relations from text chunks, incurring prohibitive costs at scale. Through a carefully designed ablation study, we observe that certain words (termed concepts) and their associated documents are more important. Based on this insight, we propose Graph-Guided Concept Selection (G2ConS). Its core comprises a chunk selection method and an LLM-independent concept graph. The former selects salient document chunks to reduce KG construction costs; the latter closes knowledge gaps introduced by chunk selection at zero cost. Evaluations on multiple real-world datasets show that G2ConS outperforms all baselines in construction cost, retrieval effectiveness, and answering quality.
Hao Jia, Penghao Zhao, Hao Wu, Yuan Gao, Yangyu Tao, Bin Cui
Accurate and long-term spatiotemporal prediction for complex physical systems remains a fundamental challenge in scientific computing. While deep learning models, as powerful parametric approximators, have shown remarkable success, they suffer from a critical limitation: the accumulation of errors during long-term autoregressive rollouts often leads to physically implausible artifacts. This deficiency arises from their purely parametric nature, which struggles to capture the full constraints of a system's intrinsic dynamics. To address this, we introduce a novel \textbf{Retrieval-Augmented Prediction (RAP)} framework, a hybrid paradigm that synergizes the predictive power of deep networks with the grounded truth of historical data. The core philosophy of RAP is to leverage historical evolutionary exemplars as a non-parametric estimate of the system's local dynamics. For any given state, RAP efficiently retrieves the most similar historical analog from a large-scale database. The true future evolution of this analog then serves as a \textbf{reference target}. Critically, this target is not a hard constraint in the loss function but rather a powerful conditional input to a specialized dual-stream architecture. It provides strong \textbf{dynamic guidance}, steering the model's predictions towards physically viable trajectories. In extensive benchmarks across meteorology, turbulence, and fire simulation, RAP not only surpasses state-of-the-art methods but also significantly outperforms a strong \textbf{analog-only forecasting baseline}. More importantly, RAP generates predictions that are more physically realistic by effectively suppressing error divergence in long-term rollouts.
Alexander Martin, William Walden, Reno Kriz, Dengjia Zhang, Kate Sanders, Eugene Yang, Chihsheng Jin, Benjamin Van Durme
We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don't verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics -- ACLE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.
Authors' comments: https://github.com/alexmartin1722/mirage
Yukun Huang, Sanxing Chen, Jian Pei, Manzil Zaheer, Bhuwan Dhingra
Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2) instruction tuning to elicit citation behavior. We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16x the original token count. Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.
Binxiao Xu, Junyu Feng, Ruichuan An, Yulin Luo, Shilin Yan, Hao Liang, Ming Lu, Wentao Zhang
The rapid development of Vision-language models (VLMs) enables open-ended
perception and reasoning. Recent works have started to investigate how to adapt
general-purpose VLMs into personalized assistants. Even commercial models such
as ChatGPT now support model personalization by incorporating user-specific
information. However, existing methods either learn a set of concept tokens or
train a VLM to utilize user-specific information. However, both pipelines
struggle to generate accurate answers as personalized assistants. We introduce
Jarvis, an innovative framework for a personalized AI assistant through
personal KV-Cache retrieval, which stores user-specific information in the
KV-Caches of both textual and visual tokens. The textual tokens are created by
summarizing user information into metadata, while the visual tokens are
produced by extracting distinct image patches from the user's images. When
answering a question, Jarvis first retrieves related KV-Caches from personal
storage and uses them to ensure accuracy in responses. We also introduce a
fine-grained benchmark built with the same distinct image patch mining
pipeline, emphasizing accurate question answering based on fine-grained
user-specific information. Jarvis is capable of providing more accurate
responses, particularly when they depend on specific local details. Jarvis
achieves state-of-the-art results in both visual question answering and
text-only tasks across multiple datasets, indicating a practical path toward
personalized AI assistants. The code and dataset will be released.
Authors' comments: 19 pages, 7 figures