Grigory Kovalev, Natalia Loukachevitch, Mikhail Tikhomirov, Olga Babina, Pavel Mamaev
In this paper, we present a novel series of Russian information retrieval datasets constructed from the "Did you know..." section of Russian Wikipedia. Our datasets support a range of retrieval tasks, including fact-checking, retrieval-augmented generation, and full-document retrieval, by leveraging interesting facts and their referenced Wikipedia articles annotated at the sentence level with graded relevance. We describe the methodology for dataset creation that enables the expansion of existing Russian Information Retrieval (IR) resources. Through extensive experiments, we extend the RusBEIR research by comparing lexical retrieval models, such as BM25, with state-of-the-art neural architectures fine-tuned for Russian, as well as multilingual models. Results of our experiments show that lexical methods tend to outperform neural models on full-document retrieval, while neural approaches better capture lexical semantics in shorter texts, such as in fact-checking or fine-grained retrieval. Using our newly created datasets, we also analyze the impact of document length on retrieval performance and demonstrate that combining retrieval with neural reranking consistently improves results. Our contribution expands the resources available for Russian information retrieval research and highlights the importance of accurate evaluation of retrieval models to achieve optimal performance. All datasets are publicly available at HuggingFace. To facilitate reproducibility and future research, we also release the full implementation on GitHub.
Ha Young Kim, Jun Li, Ana Beatriz Solana, Carolin M. Pirkl, Benedikt Wiestler, Julia A. Schnabel, Cosmin I. Bercea
Rare diseases represent the long tail of medical imaging, where AI models
often fail due to the scarcity of representative training data. In clinical
workflows, radiologists frequently consult case reports and literature when
confronted with unfamiliar findings. Following this line of reasoning, we
introduce RADAR, Retrieval Augmented Diagnostic Reasoning Agents, an agentic
system for rare disease detection in brain MRI. Our approach uses AI agents
with access to external medical knowledge by embedding both case reports and
literature using sentence transformers and indexing them with FAISS to enable
efficient similarity search. The agent retrieves clinically relevant evidence
to guide diagnostic decision making on unseen diseases, without the need of
additional training. Designed as a model-agnostic reasoning module, RADAR can
be seamlessly integrated with diverse large language models, consistently
improving their rare pathology recognition and interpretability. On the NOVA
dataset comprising 280 distinct rare diseases, RADAR achieves up to a 10.2%
performance gain, with the strongest improvements observed for open source
models such as DeepSeek. Beyond accuracy, the retrieved examples provide
interpretable, literature grounded explanations, highlighting
retrieval-augmented reasoning as a powerful paradigm for low-prevalence
conditions in medical imaging.
Authors' comments: Submitted on behalf of the PREDICTOM consortium
Yuanning Cui, Zequn Sun, Wei Hu, Zhangjie Fu
Large language models (LLMs) excel at reasoning but struggle with knowledge-intensive questions due to limited context and parametric knowledge. However, existing methods that rely on finetuned LLMs or GNN retrievers are limited by dataset-specific tuning and scalability on large or unseen graphs. We propose the LLM-KGFR collaborative framework, where an LLM works with a structured retriever, the Knowledge Graph Foundation Retriever (KGFR). KGFR encodes relations using LLM-generated descriptions and initializes entities based on their roles in the question, enabling zero-shot generalization to unseen KGs. To handle large graphs efficiently, it employs Asymmetric Progressive Propagation (APP)- a stepwise expansion that selectively limits high-degree nodes while retaining informative paths. Through node-, edge-, and path-level interfaces, the LLM iteratively requests candidate answers, supporting facts, and reasoning paths, forming a controllable reasoning loop. Experiments demonstrate that LLM-KGFR achieves strong performance while maintaining scalability and generalization, providing a practical solution for KG-augmented reasoning.
Manh Nguyen, Sunil Gupta, Dai Do, Hung Le
Hallucination mitigation remains a persistent challenge for large language models (LLMs), even as model scales grow. Existing approaches often rely on external knowledge sources, such as structured databases or knowledge graphs, accessed through prompting or retrieval. However, prompt-based grounding is fragile and domain-sensitive, while symbolic knowledge integration incurs heavy retrieval and formatting costs. Motivated by knowledge graphs, we introduce Graph-Retrieved Adaptive Decoding (GRAD), a decoding-time method that grounds generation in corpus-derived evidence without retraining. GRAD constructs a sparse token transition graph by accumulating next-token logits across a small retrieved corpus in a single forward pass. During decoding, graph-retrieved logits are max-normalized and adaptively fused with model logits to favor high-evidence continuations while preserving fluency. Across three models and a range of question-answering benchmarks spanning intrinsic, extrinsic hallucination, and factuality tasks, GRAD consistently surpasses baselines, achieving up to 9.7$\%$ higher intrinsic accuracy, 8.6$\%$ lower hallucination rates, and 6.9$\%$ greater correctness compared to greedy decoding, while attaining the highest truth--informativeness product score among all methods. GRAD offers a lightweight, plug-and-play alternative to contrastive decoding and knowledge graph augmentation, demonstrating that statistical evidence from corpus-level token transitions can effectively steer generation toward more truthful and verifiable outputs.
Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, Luo Mai
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with retrieved context but often suffers from downgraded prefill performance as modern applications demand longer and more complex inputs. Existing caching techniques either preserve accuracy with low cache reuse or improve reuse at the cost of degraded reasoning quality. We present RAGBoost, an efficient RAG system that achieves high cache reuse without sacrificing accuracy through accuracy-preserving context reuse. RAGBoost detects overlapping retrieved items across concurrent sessions and multi-turn interactions, using efficient context indexing, ordering, and de-duplication to maximize reuse, while lightweight contextual hints maintain reasoning fidelity. It integrates seamlessly with existing LLM inference engines and improves their prefill performance by 1.5-3X over state-of-the-art methods, while preserving or even enhancing reasoning accuracy across diverse RAG and agentic AI workloads. Our code is released at: https://github.com/Edinburgh-AgenticAI/RAGBoost.
Hung-Ting Chen, Xiang Liu, Shauli Ravfogel, Eunsol Choi
Most text retrievers generate \emph{one} query vector to retrieve relevant documents. Yet, the conditional distribution of relevant documents for the query may be multimodal, e.g., representing different interpretations of the query. We first quantify the limitations of existing retrievers. All retrievers we evaluate struggle more as the distance between target document embeddings grows. To address this limitation, we develop a new retriever architecture, \emph{A}utoregressive \emph{M}ulti-\emph{E}mbedding \emph{R}etriever (AMER). Our model autoregressively generates multiple query vectors, and all the predicted query vectors are used to retrieve documents from the corpus. We show that on the synthetic vectorized data, the proposed method could capture multiple target distributions perfectly, showing 4x better performance than single embedding model. We also fine-tune our model on real-world multi-answer retrieval datasets and evaluate in-domain. AMER presents 4 and 21\% relative gains over single-embedding baselines on two datasets we evaluate on. Furthermore, we consistently observe larger gains on the subset of dataset where the embeddings of the target documents are less similar to each other. We demonstrate the potential of using a multi-query vector retriever and open up a new direction for future work.
Artur Iasenovets, Fei Tang, Huihui Zhu, Ping Wang, Lei Liu
Permissioned blockchains ensure integrity and auditability of shared data but
expose query parameters to peers during read operations, creating privacy risks
for organizations querying sensitive records. This paper proposes a Private
Information Retrieval (PIR) mechanism to enable private reads from Hyperledger
Fabric's world state, allowing endorsing peers to process encrypted queries
without learning which record is accessed. We implement and benchmark a
PIR-enabled chaincode that performs ciphertext-plaintext (ct-pt) homomorphic
multiplication directly within evaluate transactions, preserving Fabric's
endorsement and audit semantics. The prototype achieves an average end-to-end
latency of 113 ms and a peer-side execution time below 42 ms, with
approximately 2 MB of peer network traffic per private read in development
mode--reducible by half under in-process deployment. Storage profiling across
three channel configurations shows near-linear growth: block size increases
from 77 kilobytes to 294 kilobytes and world-state from 112 kilobytes to 332
kilobytes as the ring dimension scales from 8,192 to 32,768 coefficients.
Parameter analysis further indicates that ring size and record length jointly
constrain packing capacity, supporting up to 512 records of 64 bytes each under
the largest configuration. These results confirm the practicality of PIR-based
private reads in Fabric for smaller, sensitive datasets and highlight future
directions to optimize performance and scalability.
Authors' comments: This work has been submitted to IEEE for possible publication
Rajan Das Gupta, Md Kishor Morol, Nafiz Fahad, Md Tanzib Hosain, Sumaya Binte Zilani Choya, Md Jakir Hossen
As the global burden of Alzheimer's disease (AD) continues to grow, early and
accurate detection has become increasingly critical, especially in regions with
limited access to advanced diagnostic tools. We propose BRAINS (Biomedical
Retrieval-Augmented Intelligence for Neurodegeneration Screening) to address
this challenge. This novel system harnesses the powerful reasoning capabilities
of Large Language Models (LLMs) for Alzheimer's detection and monitoring.
BRAINS features a dual-module architecture: a cognitive diagnostic module and a
case-retrieval module. The Diagnostic Module utilizes LLMs fine-tuned on
cognitive and neuroimaging datasets -- including MMSE, CDR scores, and brain
volume metrics -- to perform structured assessments of Alzheimer's risk.
Meanwhile, the Case Retrieval Module encodes patient profiles into latent
representations and retrieves similar cases from a curated knowledge base.
These auxiliary cases are fused with the input profile via a Case Fusion Layer
to enhance contextual understanding. The combined representation is then
processed with clinical prompts for inference. Evaluations on real-world
datasets demonstrate BRAINS effectiveness in classifying disease severity and
identifying early signs of cognitive decline. This system not only shows strong
potential as an assistive tool for scalable, explainable, and early-stage
Alzheimer's disease detection, but also offers hope for future applications in
the field.
Authors' comments: Accepted for publication in ICMLA 2025
Borong Zhang, Junjing Deng, Yi Jiang, Zichao Wendy Di
We present eMAGPIE (extended Multilevel-Adaptive-Guided Ptychographic Iterative Engine), a stochastic multigrid method for blind ptychographic phase retrieval that jointly recovers the object and the probe. We recast the task as the iterative minimization of a quadratic surrogate that majorizes the exit-wave misfit. From this surrogate, we derive closed-form updates, combined in a geometric-mean, phase-aligned joint step, yielding a simultaneous update of the object and probe with guaranteed descent of the sampled surrogate. This formulation naturally admits a multigrid acceleration that speeds up convergence. In experiments, eMAGPIE attains lower data misfit and phase error at comparable compute budgets and produces smoother, artifact-reduced phase reconstructions.
Sagar Dutta, Vipul Arora
This work presents a supervised deep hashing method for retrieving similar audio events. The proposed method, named AudioNet, is a deep-learning-based system for efficient hashing and retrieval of similar audio events using an audio example as a query. AudioNet achieves high retrieval performance on multiple standard datasets by generating binary hash codes for similar audio events, setting new benchmarks in the field, and highlighting its efficacy and effectiveness compare to other hashing methods. Through comprehensive experiments on standard datasets, our research represents a pioneering effort in evaluating the retrieval performance of similar audio events. A novel loss function is proposed which incorporates weighted contrastive and weighted pairwise loss along with hashcode balancing to improve the efficiency of audio event retrieval. The method adopts discrete gradient propagation, which allows gradients to be propagated through discrete variables during backpropagation. This enables the network to optimize the discrete hash codes using standard gradient-based optimization algorithms, which are typically used for continuous variables. The proposed method showcases promising retrieval performance, as evidenced by the experimental results, even when dealing with imbalanced datasets. The systematic analysis conducted in this study further supports the significant benefits of the proposed method in retrieval performance across multiple datasets. The findings presented in this work establish a baseline for future studies on the efficient retrieval of similar audio events using deep audio embeddings.
Pavan Kumar Perepu
Mathematical expressions (MEs) have complex two-dimensional structures in which symbols can be present at any nested depth like superscripts, subscripts, above, below etc. As MEs are represented using LaTeX format, several text retrieval methods based on string matching, vector space models etc., have also been applied for ME retrieval problem in the literature. As these methods are based on syntactic similarity, recently deep learning approaches based on embedding have been used for semantic similarity. In our present work, we have focused on the retrieval of mathematical expressions using deep learning approaches. In our approach, semantic features are extracted from the MEs using a deep recurrent neural network (DRNN) and these features have been used for matching and retrieval. We have trained the network for a classification task which determines the complexity of an ME. ME complexity has been quantified in terms of its nested depth. Based on the nested depth, we have considered three complexity classes of MEs: Simple, Medium and Complex. After training the network, outputs just before the the final fully connected layer are extracted for all the MEs. These outputs form the semantic features of MEs and are stored in a database. For a given ME query, its semantic features are computed using the trained DRNN and matched against the semantic feature database. Matching is performed based on the standard euclidean distance and top 'k' nearest matches are retrieved, where 'k' is a user-defined parameter. Our approach has been illustrated on a database of 829 MEs.
Min Fang, Zhihui Fu, Qibin Zhao, Jun Wang
Speculative decoding (SD) has emerged as an effective technique to accelerate large language model (LLM) inference without compromising output quality. However, the achievable speedup largely depends on the effectiveness of the drafting model. While model-based methods like EAGLE-2 are accurate but costly, retrieval-enhanced methods like SAM-Decoding rely on heuristic switching strategies that often trigger unnecessary retrievals. To address this, we propose ReSpec (\textbf{Re}trieval-enhanced \textbf{Spe}culative Decoding), a novel framework that transforms heuristic drafter switching into adaptive decision-making. ReSpec features three core innovations: 1) An \textbf{entropy-guided adaptive trigger} quantifies contextual predictability to initiate retrieval only when uncertainty is low, avoiding costly low-quality speculations. 2) A \textbf{feedback-driven candidate selection} leverages historical feedback to organize multiple high-quality candidates for parallel verification, maximizing retrieval utility. 3) A source-aware \textbf{relaxed verification strategy} applies strict checks to model-generated drafts while using a relaxed verification for retrieved drafts, achieving a better balance between accuracy and efficiency. Extensive experiments on Spec-Bench demonstrate that ReSpec achieves state-of-the-art acceleration,outperforming EAGLE-2 and SAM-Decoding by over $33\%$ and $25\%$, respectively, while maintaining output quality.
Amirabbas Afzali, Amirreza Velae, Iman Ahmadi, Mohammad Aliannejadi
The presence of social biases in large language models (LLMs) has become a significant concern in AI research. These biases, often embedded in training data, can perpetuate harmful stereotypes and distort decision-making processes. When LLMs are integrated into ranking systems, they can propagate these biases, leading to unfair outcomes in critical applications such as search engines and recommendation systems. Backpack Language Models, unlike traditional transformer-based models that treat text sequences as monolithic structures, generate outputs as weighted combinations of non-contextual, learned word aspects, also known as senses. Leveraging this architecture, we propose a framework for debiasing ranking tasks. Our experimental results show that this framework effectively mitigates gender bias in text retrieval and ranking with minimal degradation in performance.
Yubo Wang, Haoyang Li, Fei Teng, Lei Chen
Graph-based retrieval-augmented generation (Graph-based RAG) has demonstrated significant potential in enhancing Large Language Models (LLMs) with structured knowledge. However, existing methods face three critical challenges: Inaccurate Graph Construction, caused by LLM hallucination; Poor Reasoning Ability, caused by failing to generate explicit reasons telling LLM why certain chunks were selected; and Inadequate Answering, which only partially answers the query due to the inadequate LLM reasoning, making their performance lag behind NaiveRAG on certain tasks. To address these issues, we propose AGRAG, an advanced graph-based retrieval-augmented generation framework. When constructing the graph, AGRAG substitutes the widely used LLM entity extraction method with a statistics-based method, avoiding hallucination and error propagation. When retrieval, AGRAG formulates the graph reasoning procedure as the Minimum Cost Maximum Influence (MCMI) subgraph generation problem, where we try to include more nodes with high influence score, but with less involving edge cost, to make the generated reasoning paths more comprehensive. We prove this problem to be NP-hard, and propose a greedy algorithm to solve it. The MCMI subgraph generated can serve as explicit reasoning paths to tell LLM why certain chunks were retrieved, thereby making the LLM better focus on the query-related part contents of the chunks, reducing the impact of noise, and improving AGRAG's reasoning ability. Furthermore, compared with the simple tree-structured reasoning paths, our MCMI subgraph can allow more complex graph structures, such as cycles, and improve the comprehensiveness of the generated reasoning paths.
Tasmia Zerin, Moumita Asad, B. M. Mainul Hossain, Kazi Sakib
Responsive websites frequently experience distorted layouts at specific
screen sizes, called Responsive Layout Failures (RLFs). Manually repairing
these RLFs involves tedious trial-and-error adjustments of HTML elements and
CSS properties. In this study, an automated repair approach, leveraging LLM
combined with domain-specific knowledge is proposed. The approach is named
ReDeFix, a Retrieval-Augmented Generation (RAG)-based solution that utilizes
Stack Overflow (SO) discussions to guide LLM on CSS repairs. By augmenting
relevant SO knowledge with RLF-specific contexts, ReDeFix creates a prompt that
is sent to the LLM to generate CSS patches. Evaluation demonstrates that our
approach achieves an 88\% accuracy in repairing RLFs. Furthermore, a study from
software engineers reveals that generated repairs produce visually correct
layouts while maintaining aesthetics.
Authors' comments: Accepted at the 41st IEEE International Conference on Software
Maintenance and Evolution 2025 (ICSME'25)
Qi Luo, Xiaonan Li, Junqi Dai, Shuang Cheng, Xipeng Qiu
Retrieval-Augmented Generation has shown remarkable results to address Large Language Models' hallucinations, which usually uses a large external corpus to supplement knowledge to LLMs. However, with the development of LLMs, the internal knowledge of LLMs has expanded significantly, thus causing significant knowledge redundancy between the external corpus and LLMs. On the one hand, the indexing cost of dense retrieval is highly related to the corpus size and thus significant redundant knowledge intensifies the dense retrieval's workload. On the other hand, the redundant knowledge in the external corpus is not helpful to LLMs and our exploratory analysis shows that it instead hurts the RAG performance on those questions which the LLM can answer by itself. To address these issues, we propose Zero-RAG to tackle these challenges. Specifically, we first propose the Mastery-Score metric to identify redundant knowledge in the RAG corpus to prune it. After pruning, answers to "mastered" questions rely primarily on internal knowledge of the LLM. To better harness the internal capacity, we propose Query Router and Noise-Tolerant Tuning to avoid the irrelevant documents' distraction and thus further improve the LLM's utilization of internal knowledge with pruned corpus. Experimental results show that Zero-RAG prunes the Wikipedia corpus by 30\% and accelerates the retrieval stage by 22\%, without compromising RAG's performance.
Shounak Paul, Dhananjay Ghumare, Pawan Goyal, Saptarshi Ghosh, Ashutosh Modi
Identifying/retrieving relevant statutes and prior cases/precedents for a
given legal situation are common tasks exercised by law practitioners.
Researchers to date have addressed the two tasks independently, thus developing
completely different datasets and models for each task; however, both retrieval
tasks are inherently related, e.g., similar cases tend to cite similar statutes
(due to similar factual situation). In this paper, we address this gap. We
propose IL-PCR (Indian Legal corpus for Prior Case and Statute Retrieval),
which is a unique corpus that provides a common testbed for developing models
for both the tasks (Statute Retrieval and Precedent Retrieval) that can exploit
the dependence between the two. We experiment extensively with several baseline
models on the tasks, including lexical models, semantic models and ensemble
based on GNNs. Further, to exploit the dependence between the two tasks, we
develop an LLM-based re-ranking approach that gives the best performance.
Authors' comments: Accepted at EMNLP 2025 (Main)
Xiang Li, Till Jahnke, Rebecca Boll, Jiaqi Han, Minkai Xu, Michael Meyer, Maria Novella Piancastelli, Daniel Rolles et al.
Capturing the structural changes that molecules undergo during chemical reactions in real space and time is a long-standing dream and an essential prerequisite for understanding and ultimately controlling femtochemistry. A key approach to tackle this challenging task is Coulomb explosion imaging, which benefited decisively from recently emerging high-repetition-rate X-ray free-electron laser sources. With this technique, information on the molecular structure is inferred from the momentum distributions of the ions produced by the rapid Coulomb explosion of molecules. Retrieving molecular structures from these distributions poses a highly non-linear inverse problem that remains unsolved for molecules consisting of more than a few atoms. Here, we address this challenge using a diffusion-based Transformer neural network. We show that the network reconstructs unknown molecular geometries from ion-momentum distributions with a mean absolute error below one Bohr radius, which is half the length of a typical chemical bond.
WonJun Moon, MinSeok Jung, Gilhan Park, Tae-Young Kim, Cheol-Ho Cho, Woojin Jun, Jae-Pil Heo
Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the
content matches a text query. Existing methods treat every annotated text-video
pair as a positive and all others as negatives, ignoring the rich semantic
variation both within a single video and across different videos. Consequently,
embeddings of both queries and their corresponding video-clip segments for
distinct events within the same video collapse together, while embeddings of
semantically similar queries and segments from different videos are driven
apart. This limits retrieval performance when videos contain multiple, diverse
events. This paper addresses the aforementioned problems, termed as semantic
collapse, in both the text and video embedding spaces. We first introduce Text
Correlation Preservation Learning, which preserves the semantic relationships
encoded by the foundation model across text queries. To address collapse in
video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive
alignment method that disentangles hierarchical video representations across
temporal scales. Subsequently, we introduce order-preserving token merging and
adaptive CBVA to enhance alignment by producing video segments that are
internally coherent yet mutually distinctive. Extensive experiments on PRVR
benchmarks demonstrate that our framework effectively prevents semantic
collapse and substantially improves retrieval accuracy.
Authors' comments: Accpeted to NeurIPS 2025. Code is available at
https://github.com/admins97/MSC_PRVR
Philipp Davydov, Ameya Prabhu, Matthias Bethge, Elisa Nguyen, Seong Joon Oh
Understanding how language-model outputs relate to the pretraining corpus is central to studying model behavior. Most training data attribution (TDA) methods ask which training examples causally influence a given output, often using leave-one-out tests. We invert the question: which outputs cannot be attributed to any pretraining example? We introduce un-attributability as an operational measure of semantic novelty: an output is novel if the pretraining corpus contains no semantically similar context. We approximate this with a simple two-stage retrieval pipeline: index the corpus with lightweight GIST embeddings, retrieve the top-n candidates, then rerank with ColBERTv2. If the nearest corpus item is less attributable than a human-generated text reference, we consider the output of the model as novel. We evaluate on SmolLM and SmolLM2 and report three findings: (1) models draw on pretraining data across much longer spans than previously reported; (2) some domains systematically promote or suppress novelty; and (3) instruction tuning not only alters style but also increases novelty. Reframing novelty assessment around un-attributability enables efficient analysis at pretraining scale. We release ~20 TB of corpus chunks and index artifacts to support replication and large-scale extension of our analysis at https://huggingface.co/datasets/stai-tuebingen/faiss-smollm