Franco Maria Nardini, Thong Nguyen, Cosimo Rulli, Rossano Venturini, Andrew Yates
Learned Sparse Retrieval (LSR) is an effective IR approach that exploits pre-trained language models for encoding text into a learned bag of words. Several efforts in the literature have shown that sparsity is key to enabling a good trade-off between the efficiency and effectiveness of the query processor. To induce the right degree of sparsity, researchers typically use regularization techniques when training LSR models. Recently, new efficient -- inverted index-based -- retrieval engines have been proposed, leading to a natural question: has the role of regularization changed in training LSR models? In this paper, we conduct an extended evaluation of regularization approaches for LSR where we discuss their effectiveness, efficiency, and out-of-domain generalization capabilities. We first show that regularization can be relaxed to produce more effective LSR encoders. We also show that query encoding is now the bottleneck limiting the overall query processor performance. To remove this bottleneck, we advance the state-of-the-art of inference-free LSR by proposing Learned Inference-free Retrieval (Li-LSR). At training time, Li-LSR learns a score for each token, casting the query encoding step into a seamless table lookup. Our approach yields state-of-the-art effectiveness for both in-domain and out-of-domain evaluation, surpassing Splade-v3-Doc by 1 point of mRR@10 on MS MARCO and 1.8 points of nDCG@10 on BEIR.
Mingjun Xu, Zehui Wang, Hengxing Cai, Renxin Zhong
Retrieval-augmented generation (RAG) systems have predominantly focused on text-based retrieval, limiting their effectiveness in handling visually-rich documents that encompass text, images, tables, and charts. To bridge this gap, we propose a unified multi-granularity multimodal retrieval framework tailored for two benchmark tasks: MMDocIR and M2KR. Our approach integrates hierarchical encoding strategies, modality-aware retrieval mechanisms, and vision-language model (VLM)-based candidate filtering to effectively capture and utilize the complex interdependencies between textual and visual modalities. By leveraging off-the-shelf vision-language models and implementing a training-free hybrid retrieval strategy, our framework demonstrates robust performance without the need for task-specific fine-tuning. Experimental evaluations reveal that incorporating layout-aware search and VLM-based candidate verification significantly enhances retrieval accuracy, achieving a top performance score of 65.56. This work underscores the potential of scalable and reproducible solutions in advancing multimodal document retrieval systems.
Zihan Niu, Zheyong Xie, Shaosheng Cao, Chonggang Lu, Zheyu Ye, Tong Xu, Zuozhu Liu, Yan Gao et al.
Social chatbots have become essential intelligent companions in daily scenarios ranging from emotional support to personal interaction. However, conventional chatbots with passive response mechanisms usually rely on users to initiate or sustain dialogues by bringing up new topics, resulting in diminished engagement and shortened dialogue duration. In this paper, we present PaRT, a novel framework enabling context-aware proactive dialogues for social chatbots through personalized real-time retrieval and generation. Specifically, PaRT first integrates user profiles and dialogue context into a large language model (LLM), which is initially prompted to refine user queries and recognize their underlying intents for the upcoming conversation. Guided by refined intents, the LLM generates personalized dialogue topics, which then serve as targeted queries to retrieve relevant passages from RedNote. Finally, we prompt LLMs with summarized passages to generate knowledge-grounded and engagement-optimized responses. Our approach has been running stably in a real-world production environment for more than 30 days, achieving a 21.77\% improvement in the average duration of dialogues.
Manish Bhattarai, Miguel Cordova, Javier Santos, Dan O'Malley
In supercomputing, efficient and optimized code generation is essential to leverage high-performance systems effectively. We propose Agentic Retrieval-Augmented Code Synthesis (ARCS), an advanced framework for accurate, robust, and efficient code generation, completion, and translation. ARCS integrates Retrieval-Augmented Generation (RAG) with Chain-of-Thought (CoT) reasoning to systematically break down and iteratively refine complex programming tasks. An agent-based RAG mechanism retrieves relevant code snippets, while real-time execution feedback drives the synthesis of candidate solutions. This process is formalized as a state-action search tree optimization, balancing code correctness with editing efficiency. Evaluations on the Geeks4Geeks and HumanEval benchmarks demonstrate that ARCS significantly outperforms traditional prompting methods in translation and generation quality. By enabling scalable and precise code synthesis, ARCS offers transformative potential for automating and optimizing code development in supercomputing applications, enhancing computational resource utilization.
Carlo Merola, Jaspinder Singh
Retrieval-augmented generation (RAG) has become a transformative approach for
enhancing large language models (LLMs) by grounding their outputs in external
knowledge sources. Yet, a critical question persists: how can vast volumes of
external knowledge be managed effectively within the input constraints of LLMs?
Traditional methods address this by chunking external documents into smaller,
fixed-size segments. While this approach alleviates input limitations, it often
fragments context, resulting in incomplete retrieval and diminished coherence
in generation. To overcome these shortcomings, two advanced techniques, late
chunking and contextual retrieval, have been introduced, both aiming to
preserve global context. Despite their potential, their comparative strengths
and limitations remain unclear. This study presents a rigorous analysis of late
chunking and contextual retrieval, evaluating their effectiveness and
efficiency in optimizing RAG systems. Our results indicate that contextual
retrieval preserves semantic coherence more effectively but requires greater
computational resources. In contrast, late chunking offers higher efficiency
but tends to sacrifice relevance and completeness.
Authors' comments: 13 pages, 2 figures, Second Workshop on Knowledge-Enhanced
Information Retrieval, ECIR 2025
Qianren Mao, Qili Zhang, Hanwen Hao, Zhentao Han, Runhua Xu, Weifeng Jiang, Qi Hu, Zhijun Chen et al.
Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution for enhancing the accuracy and credibility of Large Language Models (LLMs), particularly in Question & Answer tasks. This is achieved by incorporating proprietary and private data from integrated databases. However, private RAG systems face significant challenges due to the scarcity of private domain data and critical data privacy issues. These obstacles impede the deployment of private RAG systems, as developing privacy-preserving RAG systems requires a delicate balance between data security and data availability. To address these challenges, we regard federated learning (FL) as a highly promising technology for privacy-preserving RAG services. We propose a novel framework called Federated Retrieval-Augmented Generation (FedE4RAG). This framework facilitates collaborative training of client-side RAG retrieval models. The parameters of these models are aggregated and distributed on a central-server, ensuring data privacy without direct sharing of raw data. In FedE4RAG, knowledge distillation is employed for communication between the server and client models. This technique improves the generalization of local RAG retrievers during the federated learning process. Additionally, we apply homomorphic encryption within federated learning to safeguard model parameters and mitigate concerns related to data leakage. Extensive experiments conducted on the real-world dataset have validated the effectiveness of FedE4RAG. The results demonstrate that our proposed framework can markedly enhance the performance of private RAG systems while maintaining robust data privacy protection.
Junhong Liang, Yu Zhou
Chinese Spelling Correction (CSC) aims to detect and correct erroneous tokens
in sentences. While Large Language Models (LLMs) have shown remarkable success
in identifying and rectifying potential errors, they often struggle with
maintaining consistent output lengths and adapting to domain-specific
corrections. Furthermore, existing CSC task impose rigid constraints requiring
input and output lengths to be identical, limiting their applicability. In this
work, we extend traditional CSC to variable-length correction scenarios,
including Chinese Splitting Error Correction (CSEC) and ASR N-best Error
Correction. To address domain adaptation and length consistency, we propose
MTCSC (Multi-Turn CSC) framework based on RAG enhanced with a length reflection
mechanism. Our approach constructs a retrieval database from domain-specific
training data and dictionaries, fine-tuning retrievers to optimize performance
for error-containing inputs. Additionally, we introduce a multi-source
combination strategy with iterative length reflection to ensure output length
fidelity. Experiments across diverse domain datasets demonstrate that our
method significantly outperforms current approaches in correction quality,
particularly in handling domain-specific and variable-length error correction
tasks.
Authors' comments: 12 pages, 2 figures
Erfan Loweimi, Mengjie Qian, Kate Knill, Mark Gales
There is a growing abundance of publicly available or company-owned
audio/video archives, highlighting the increasing importance of efficient
access to desired content and information retrieval from these archives. This
paper investigates the challenges, solutions, effectiveness, and robustness of
speaker retrieval systems developed "in the wild" which involves addressing two
primary challenges: extraction of task-relevant labels from limited metadata
for system development and evaluation, as well as the unconstrained acoustic
conditions encountered in the archive, ranging from quiet studios to adverse
noisy environments. While we focus on the publicly-available BBC Rewind archive
(spanning 1948 to 1979), our framework addresses the broader issue of speaker
retrieval on extensive and possibly aged archives with no control over the
content and acoustic conditions. Typically, these archives offer a brief and
general file description, mostly inadequate for specific applications like
speaker retrieval, and manual annotation of such large-scale archives is
unfeasible. We explore various aspects of system development (e.g., speaker
diarisation, embedding extraction, query selection) and analyse the challenges,
possible solutions, and their functionality. To evaluate the performance, we
conduct systematic experiments in both clean setup and against various
distortions simulating real-world applications. Our findings demonstrate the
effectiveness and robustness of the developed speaker retrieval systems,
establishing the versatility and scalability of the proposed framework for a
wide range of applications beyond the BBC Rewind corpus.
Authors' comments: 13 pages, 10 figures, 10 tables, 76 references
Jingjin Wang
Retrieval Augmented Generation (RAG) has become the standard non-parametric
approach for equipping Large Language Models (LLMs) with up-to-date knowledge
and mitigating catastrophic forgetting common in continual learning. However,
standard RAG, relying on independent passage retrieval, fails to capture the
interconnected nature of human memory crucial for complex reasoning
(associativity) and contextual understanding (sense-making). While structured
RAG methods like HippoRAG utilize knowledge graphs (KGs) built from triples,
the inherent context loss limits fidelity. We introduce PropRAG, a framework
leveraging contextually rich propositions and a novel beam search algorithm
over proposition paths to explicitly discover multi-step reasoning chains.
Crucially, PropRAG's online retrieval process operates entirely without
invoking generative LLMs, relying instead on efficient graph traversal and
pre-computed embeddings. This avoids online LLM inference costs and potential
inconsistencies during evidence gathering. LLMs are used effectively offline
for high-quality proposition extraction and post-retrieval for answer
generation. PropRAG achieves state-of-the-art zero-shot Recall@5 results on
PopQA (55.3%), 2Wiki (93.7%), HotpotQA (97.0%), and MuSiQue (77.3%), alongside
top F1 scores (e.g., 52.4% on MuSiQue). By improving evidence retrieval through
richer representation and explicit, LLM-free online path finding, PropRAG
advances non-parametric continual learning.
Authors' comments: Code and data to be released at:
https://github.com/ReLink-Inc/PropRAG
Zhen Zhang, Xinyu Ma, Weiwei Sun, Pengjie Ren, Zhumin Chen, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke et al.
Generative retrieval (GR) has emerged as a promising paradigm in information
retrieval (IR). However, most existing GR models are developed and evaluated
using a static document collection, and their performance in dynamic corpora
where document collections evolve continuously is rarely studied. In this
paper, we first reproduce and systematically evaluate various representative GR
approaches over dynamic corpora. Through extensive experiments, we reveal that
existing GR models with \textit{text-based} docids show superior generalization
to unseen documents. We observe that the more fine-grained the docid design in
the GR model, the better its performance over dynamic corpora, surpassing BM25
and even being comparable to dense retrieval methods. While GR models with
\textit{numeric-based} docids show high efficiency, their performance drops
significantly over dynamic corpora. Furthermore, our experiments find that the
underperformance of numeric-based docids is partly due to their excessive
tendency toward the initial document set, which likely results from overfitting
on the training set. We then conduct an in-depth analysis of the
best-performing GR methods. We identify three critical advantages of text-based
docids in dynamic corpora: (i) Semantic alignment with language models'
pretrained knowledge, (ii) Fine-grained docid design, and (iii) High lexical
diversity. Building on these insights, we finally propose a novel multi-docid
design that leverages both the efficiency of numeric-based docids and the
effectiveness of text-based docids, achieving improved performance in dynamic
corpus without requiring additional retraining. Our work offers empirical
evidence for advancing GR methods over dynamic corpora and paves the way for
developing more generalized yet efficient GR models in real-world search
engines.
Authors' comments: Accepted at SIGIR 2025 (Proceedings of the 48th International ACM
SIGIR Conference on Research and Development in Information Retrieval)
Yongkang Li, Panagiotis Eustratiadis, Simon Lupart, Evangelos Kanoulas
This paper concerns corpus poisoning attacks in dense information retrieval,
where an adversary attempts to compromise the ranking performance of a search
algorithm by injecting a small number of maliciously generated documents into
the corpus. Our work addresses two limitations in the current literature.
First, attacks that perform adversarial gradient-based word substitution search
do so in the discrete lexical space, while retrieval itself happens in the
continuous embedding space. We thus propose an optimization method that
operates in the embedding space directly. Specifically, we train a perturbation
model with the objective of maintaining the geometric distance between the
original and adversarial document embeddings, while also maximizing the
token-level dissimilarity between the original and adversarial documents.
Second, it is common for related work to have a strong assumption that the
adversary has prior knowledge about the queries. In this paper, we focus on a
more challenging variant of the problem where the adversary assumes no prior
knowledge about the query distribution (hence, unsupervised). Our core
contribution is an adversarial corpus attack that is fast and effective. We
present comprehensive experimental results on both in- and out-of-domain
datasets, focusing on two related tasks: a top-1 attack and a corpus poisoning
attack. We consider attacks under both a white-box and a black-box setting.
Notably, our method can generate successful adversarial examples in under two
minutes per target document; four times faster compared to the fastest
gradient-based word substitution methods in the literature with the same
hardware. Furthermore, our adversarial generation method generates text that is
more likely to occur under the distribution of natural text (low perplexity),
and is therefore more difficult to detect.
Authors' comments: This paper has been accepted as a full paper at SIGIR 2025 and will
be presented orally
Chanhee Park, Hyeonseok Moon, Chanjun Park, Heuiseok Lim
Retrieval-Augmented Generation (RAG) has gained prominence as an effective
method for enhancing the generative capabilities of Large Language Models
(LLMs) through the incorporation of external knowledge. However, the evaluation
of RAG systems remains a challenge, due to the intricate interplay between
retrieval and generation components. This limitation has resulted in a scarcity
of benchmarks that facilitate a detailed, component-specific assessment. In
this work, we present MIRAGE, a Question Answering dataset specifically
designed for RAG evaluation. MIRAGE consists of 7,560 curated instances mapped
to a retrieval pool of 37,800 entries, enabling an efficient and precise
evaluation of both retrieval and generation tasks. We also introduce novel
evaluation metrics aimed at measuring RAG adaptability, encompassing dimensions
such as noise vulnerability, context acceptability, context insensitivity, and
context misinterpretation. Through comprehensive experiments across various
retriever-LLM configurations, we provide new insights into the optimal
alignment of model pairs and the nuanced dynamics within RAG systems. The
dataset and evaluation code are publicly available, allowing for seamless
integration and customization in diverse research settings\footnote{The MIRAGE
code and data are available at https://github.com/nlpai-lab/MIRAGE.
Authors' comments: Accepted to NAACL2025 Findings
Parker Carlson, Wentai Xie, Shanxiu He, Tao Yang
This paper proposes superblock pruning (SP) during top-k online document
retrieval for learned sparse representations. SP structures the sparse index as
a set of superblocks on a sequence of document blocks and conducts a
superblock-level selection to decide if some superblocks can be pruned before
visiting their child blocks. SP generalizes the previous flat block or
cluster-based pruning, allowing the early detection of groups of documents that
cannot or are less likely to appear in the final top-k list. SP can accelerate
sparse retrieval in a rank-safe or approximate manner under a high-relevance
competitiveness constraint. Our experiments show that the proposed scheme
significantly outperforms state-of-the-art baselines on MS MARCO passages on a
single-threaded CPU.
Authors' comments: 6 pages, 3 figures, SIGIR 25
Xin Jiang, Hao Tang, Yonghua Pan, Zechao Li
Large-scale fine-grained image retrieval (FGIR) aims to retrieve images
belonging to the same subcategory as a given query by capturing subtle
differences in a large-scale setting. Recently, Vision Transformers (ViT) have
been employed in FGIR due to their powerful self-attention mechanism for
modeling long-range dependencies. However, most Transformer-based methods focus
primarily on leveraging self-attention to distinguish fine-grained details,
while overlooking the high computational complexity and redundant dependencies
inherent to these models, limiting their scalability and effectiveness in
large-scale FGIR. In this paper, we propose an Efficient and Effective
ViT-based framework, termed \textbf{EET}, which integrates token pruning module
with a discriminative transfer strategy to address these limitations.
Specifically, we introduce a content-based token pruning scheme to enhance the
efficiency of the vanilla ViT, progressively removing background or
low-discriminative tokens at different stages by exploiting feature responses
and self-attention mechanism. To ensure the resulting efficient ViT retains
strong discriminative power, we further present a discriminative transfer
strategy comprising both \textit{discriminative knowledge transfer} and
\textit{discriminative region guidance}. Using a distillation paradigm, these
components transfer knowledge from a larger ``teacher'' ViT to a more efficient
``student'' model, guiding the latter to focus on subtle yet crucial regions in
a cost-free manner. Extensive experiments on two widely-used fine-grained
datasets and four large-scale fine-grained datasets demonstrate the
effectiveness of our method. Specifically, EET reduces the inference latency of
ViT-Small by 42.7\% and boosts the retrieval performance of 16-bit hash codes
by 5.15\% on the challenging NABirds dataset.
Authors' comments: Accepted by IEEE TMM
Francisco Valentini, Diego Kozlowski, Vincent Larivière
Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from \'Erudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.
Assaf Gerner, Netta Madvil, Nadav Barak, Alex Zaikman, Jonatan Liberman, Liron Hamra, Rotem Brazilay, Shay Tsadok et al.
Despite advancements in grounded content generation, production Large Language Models (LLMs) based applications still suffer from hallucinated answers. We present "Grounded in Context" - a member of Deepchecks' ORION (Output Reasoning-based InspectiON) family of lightweight evaluation models. It is our framework for hallucination detection, designed for production-scale long-context data and tailored to diverse use cases, including summarization, data extraction, and RAG. Inspired by RAG architecture, our method integrates retrieval and Natural Language Inference (NLI) models to predict factual consistency between premises and hypotheses using an encoder-based model with only a 512-token context window. Our framework identifies unsupported claims with an F1 score of 0.83 in RAGTruth's response-level classification task, matching methods that trained on the dataset, and outperforming all comparable frameworks using similar-sized models.
Chunjing Gan, Dan Yang, Binbin Hu, Ziqi Liu, Yue Shen, Zhiqiang Zhang, Jian Wang, Jun Zhou
Large language models (LLMs) have become a disruptive force in the industry, introducing unprecedented capabilities in natural language processing, logical reasoning and so on. However, the challenges of knowledge updates and hallucination issues have limited the application of LLMs in medical scenarios, where retrieval-augmented generation (RAG) can offer significant assistance. Nevertheless, existing retrieve-then-read approaches generally digest the retrieved documents, without considering the timeliness, authoritativeness and commonality of retrieval. We argue that these approaches can be suboptimal, especially in real-world applications where information from different sources might conflict with each other and even information from the same source in different time scale might be different, and totally relying on this would deteriorate the performance of RAG approaches. We propose PolyRAG that carefully incorporate judges from different perspectives and finally integrate the polyviews for retrieval augmented generation in medical applications. Due to the scarcity of real-world benchmarks for evaluation, to bridge the gap we propose PolyEVAL, a benchmark consists of queries and documents collected from real-world medical scenarios (including medical policy, hospital & doctor inquiry and healthcare) with multiple tagging (e.g., timeliness, authoritativeness) on them. Extensive experiments and analysis on PolyEVAL have demonstrated the superiority of PolyRAG.
Xinjie Shen, Zhichao Geng, Yang Yang
With increasing demands for efficiency, information retrieval has developed a
branch of sparse retrieval, further advancing towards inference-free retrieval
where the documents are encoded during indexing time and there is no
model-inference for queries. Existing sparse retrieval models rely on FLOPS
regularization for sparsification, while this mechanism was originally designed
for Siamese encoders, it is considered to be suboptimal in inference-free
scenarios which is asymmetric. Previous attempts to adapt FLOPS for
inference-free scenarios have been limited to rule-based methods, leaving the
potential of sparsification approaches for inference-free retrieval models
largely unexplored. In this paper, we explore $\ell_0$ inspired sparsification
manner for inference-free retrievers. Through comprehensive out-of-domain
evaluation on the BEIR benchmark, our method achieves state-of-the-art
performance among inference-free sparse retrieval models and is comparable to
leading Siamese sparse retrieval models. Furthermore, we provide insights into
the trade-off between retrieval effectiveness and computational efficiency,
demonstrating practical value for real-world applications.
Authors' comments: Accepted by SIGIR 2025
Patrícia Pereira, Anders Lansner, Pawel Herman
The human lifespan retrieval curve describes the proportion of recalled
memories from each year of life. It exhibits a reminiscence bump - a tendency
for aged people to better recall memories formed during their young adulthood
than from other periods of life. We have modelled this using an attractor
Bayesian Confidence Propagation Neural Network (BCPNN) with incremental
learning. We systematically studied the synaptic mechanisms underlying the
reminiscence bump in this network model after introduction of an exponential
decay of the synaptic learning rate and examined its sensitivity to network
size and other relevant modelling mechanisms. The most influential parameters
turned out to be the synaptic learning rate at birth and the time constant of
its exponential decay with age, which set the bump position in the lifespan
retrieval curve. The other parameters mainly influenced the general magnitude
of this curve. Furthermore, we introduced the parametrization of the recency
phenomenon - the tendency to better remember the most recent memories -
reflected in the curve's upwards tail in the later years of the lifespan. Such
recency was achieved by adding a constant baseline component to the
exponentially decaying synaptic learning rate.
Authors' comments: IJCNN 2022
Muhammad Rafsan Kabir, Rafeed Mohammad Sultan, Fuad Rahman, Mohammad Ruhul Amin, Sifat Momen, Nabeel Mohammed, Shafin Rahman
Natural Language Processing (NLP) and computational linguistic techniques are
increasingly being applied across various domains, yet their use in legal and
regulatory tasks remains limited. To address this gap, we develop an efficient
bilingual question-answering framework for regulatory documents, specifically
the Bangladesh Police Gazettes, which contain both English and Bangla text. Our
approach employs modern Retrieval Augmented Generation (RAG) pipelines to
enhance information retrieval and response generation. In addition to
conventional RAG pipelines, we propose an advanced RAG-based approach that
improves retrieval performance, leading to more precise answers. This system
enables efficient searching for specific government legal notices, making legal
information more accessible. We evaluate both our proposed and conventional RAG
systems on a diverse test set on Bangladesh Police Gazettes, demonstrating that
our approach consistently outperforms existing methods across all evaluation
metrics.
Authors' comments: Accepted at IJCNN 2025