Mohsen Dehghankar, Raghav Mittal, Suraj Shetiya, Abolfazl Asudeh, Gautam Das
We study the problem of Direct-Access Ranked Retrieval (DAR) for interactive data tooling, where evolving data exploration practices, combined with large-scale and high-dimensional datasets, create new challenges. DAR concerns the problem of enabling efficient access to arbitrary rank positions according to a ranking function, without enumerating all preceding tuples. To address this need, we formalize the DAR problem and propose a theoretically efficient algorithm based on geometric arrangements, achieving logarithmic query time. However, this method suffers from exponential space complexity in high dimensions. Therefore, we develop a second class of algorithms based on $\varepsilon$-sampling, which consume a linear space. Since exactly locating the tuple at a specific rank is challenging due to its connection to the range counting problem, we introduce a relaxed variant called Conformal Set Ranked Retrieval (CSR), which returns a small subset guaranteed to contain the target tuple. To solve the CSR problem efficiently, we define an intermediate problem, Stripe Range Retrieval (SRR), and design a hierarchical sampling data structure tailored for narrow-range queries. Our method achieves practical scalability in both data size and dimensionality. We prove near-optimal bounds on the efficiency of our algorithms and validate their performance through extensive experiments on real and synthetic datasets, demonstrating scalability to millions of tuples and hundreds of dimensions.
Pengcheng Zhou, Yinglun Feng, Zhongliang Yang
Although Retrieval-Augmented Generation (RAG) systems have been widely applied, the privacy and security risks they face, such as data leakage and data poisoning, have not been systematically addressed yet. Existing defense strategies primarily rely on heuristic filtering or enhancing retriever robustness, which suffer from limited interpretability, lack of formal security guarantees, and vulnerability to adaptive attacks. To address these challenges, this paper proposes the first provably secure framework for RAG systems(SAG). Our framework employs a pre-storage full-encryption scheme to ensure dual protection of both retrieved content and vector embeddings, guaranteeing that only authorized entities can access the data. Through formal security proofs, we rigorously verify the scheme's confidentiality and integrity under a computational security model. Extensive experiments across multiple benchmark datasets demonstrate that our framework effectively resists a range of state-of-the-art attacks. This work establishes a theoretical foundation and practical paradigm for verifiably secure RAG systems, advancing AI-powered services toward formally guaranteed security.
Thi Thu Uyen Hoang, Viet Anh Nguyen
This paper presents an advancement in Question-Answering (QA) systems using a Retrieval Augmented Generation (RAG) framework to enhance information extraction from PDF files. Recognizing the richness and diversity of data within PDFs--including text, images, vector diagrams, graphs, and tables--poses unique challenges for existing QA systems primarily designed for textual content. We seek to develop a comprehensive RAG-based QA system that will effectively address complex multimodal questions, where several data types are combined in the query. This is mainly achieved by refining approaches to processing and integrating non-textual elements in PDFs into the RAG framework to derive precise and relevant answers, as well as fine-tuning large language models to better adapt to our system. We provide an in-depth experimental evaluation of our solution, demonstrating its capability to extract accurate information that can be applied to different types of content across PDFs. This work not only pushes the boundaries of retrieval-augmented QA systems but also lays a foundation for further research in multimodal data integration and processing.
Yunhao Shui, Xuekuan Wang, Feng Qiu, Yuqiu Huang, Jinzhu Li, Haoyu Zheng, Jinru Han, Zhuo Zeng et al.
We present RaCig, a novel system for generating comic-style image sequences with consistent characters and expressive gestures. RaCig addresses two key challenges: (1) maintaining character identity and costume consistency across frames, and (2) producing diverse and vivid character gestures. Our approach integrates a retrieval-based character assignment module, which aligns characters in textual prompts with reference images, and a regional character injection mechanism that embeds character features into specified image regions. Experimental results demonstrate that RaCig effectively generates engaging comic narratives with coherent characters and dynamic interactions. The source code will be publicly available to support further research in this area.
Emma J. Gerritse, Faegheh Hasibi, Arjen P. de Vries
In this research, we investigate methods for entity retrieval using graph
embeddings. While various methods have been proposed over the years, most
utilize a single graph embedding and entity linking approach. This hinders our
understanding of how different graph embedding and entity linking methods
impact entity retrieval. To address this gap, we investigate the effects of
three different categories of graph embedding techniques and five different
entity linking methods. We perform a reranking of entities using the distance
between the embeddings of annotated entities and the entities we wish to
rerank. We conclude that the selection of both graph embeddings and entity
linkers significantly impacts the effectiveness of entity retrieval. For graph
embeddings, methods that incorporate both graph structure and textual
descriptions of entities are the most effective. For entity linking, both
precision and recall concerning concepts are important for optimal retrieval
performance. Additionally, it is essential for the graph to encompass as many
entities as possible.
Authors' comments: arXiv admin note: text overlap with arXiv:2005.02843
Eric Xing, Abby Stylianou, Robert Pless, Nathan Jacobs
Massive-scale pretraining has made vision-language models increasingly
popular for image-to-image and text-to-image retrieval across a broad
collection of domains. However, these models do not perform well when used for
challenging retrieval tasks, such as instance retrieval in very large-scale
image collections. Recent work has shown that linear transformations of VLM
features trained for instance retrieval can improve performance by emphasizing
subspaces that relate to the domain of interest. In this paper, we explore a
more extreme version of this specialization by learning to map a given query to
a query-specific feature space transformation. Because this transformation is
linear, it can be applied with minimal computational cost to millions of image
embeddings, making it effective for large-scale retrieval or re-ranking.
Results show that this method consistently outperforms state-of-the-art
alternatives, including those that require many orders of magnitude more
computation at query time.
Authors' comments: 13 pages, 4 figures, 4 tables
Haocheng Ju, Bin Dong
Mathematical Information Retrieval (MIR) is the task of retrieving
information from mathematical documents and plays a key role in various
applications, including theorem search in mathematical libraries, answer
retrieval on math forums, and premise selection in automated theorem proving.
However, a unified benchmark for evaluating these diverse retrieval tasks has
been lacking. In this paper, we introduce MIRB (Mathematical Information
Retrieval Benchmark) to assess the MIR capabilities of retrieval models. MIRB
includes four tasks: semantic statement retrieval, question-answer retrieval,
premise retrieval, and formula retrieval, spanning a total of 12 datasets. We
evaluate 13 retrieval models on this benchmark and analyze the challenges
inherent to MIR. We hope that MIRB provides a comprehensive framework for
evaluating MIR systems and helps advance the development of more effective
retrieval models tailored to the mathematical domain.
Authors' comments: Our code and data are available at https://github.com/j991222/mirb
and https://huggingface.co/collections/hcju/mirb-6827001711765454f58c5a76
Sungwon Han, Seungeon Lee, Meeyoung Cha, Sercan O Arik, Jinsung Yoon
Time series forecasting uses historical data to predict future trends, leveraging the relationships between past observations and available features. In this paper, we propose RAFT, a retrieval-augmented time series forecasting method to provide sufficient inductive biases and complement the model's learning capacity. When forecasting the subsequent time frames, we directly retrieve historical data candidates from the training dataset with patterns most similar to the input, and utilize the future values of these candidates alongside the inputs to obtain predictions. This simple approach augments the model's capacity by externally providing information about past patterns via retrieval modules. Our empirical evaluations on ten benchmark datasets show that RAFT consistently outperforms contemporary baselines with an average win ratio of 86%.
Hyunseo Shin, Wonseok Hwang
Recent advancements in large language models (LLMs) have significantly enhanced the performance of conversational AI systems. To extend their capabilities to knowledge-intensive domains such as biomedical and legal fields, where the accuracy is critical, LLMs are often combined with information retrieval (IR) systems to generate responses based on retrieved documents. However, for IR systems to effectively support such applications, they must go beyond simple semantic matching and accurately capture diverse query intents, including causal relationships. Existing IR models primarily focus on retrieving documents based on surface-level semantic similarity, overlooking deeper relational structures such as causality. To address this, we propose CAWAI, a retrieval model that is trained with dual objectives: semantic and causal relations. Our extensive experiments demonstrate that CAWAI outperforms various models on diverse causal retrieval tasks especially under large-scale retrieval settings. We also show that CAWAI exhibits strong zero-shot generalization across scientific domain QA tasks.
Aashiq Muhamed, Mona Diab, Virginia Smith
Retrieval-Augmented Generation (RAG) models excel in knowledge-intensive
tasks, especially under few-shot learning constraints. We introduce CoRAG, a
framework extending RAG to collaborative settings, where clients jointly train
a shared model using a collaborative passage store. To evaluate CoRAG, we
introduce CRAB, a benchmark for collaborative homogeneous open-domain question
answering. Our experiments demonstrate that CoRAG consistently outperforms both
parametric collaborative learning methods and locally trained RAG models in
low-resource scenarios. Further analysis reveals the critical importance of
relevant passages within the shared store, the surprising benefits of
incorporating irrelevant passages, and the potential for hard negatives to
negatively impact performance. This introduces a novel consideration in
collaborative RAG: the trade-off between leveraging a collectively enriched
knowledge base and the potential risk of incorporating detrimental passages
from other clients. Our findings underscore the viability of CoRAG, while also
highlighting key design challenges and promising avenues for future research.
Authors' comments: NAACL 2024
Maarten de Rijke, Bart van den Hurk, Flora Salim, Alaa Al Khourdajie, Nan Bai, Renato Calzone, Declan Curran, Getnet Demil et al.
The purpose of the MANILA24 Workshop on information retrieval for climate
impact was to bring together researchers from academia, industry, governments,
and NGOs to identify and discuss core research problems in information
retrieval to assess climate change impacts. The workshop aimed to foster
collaboration by bringing communities together that have so far not been very
well connected -- information retrieval, natural language processing,
systematic reviews, impact assessments, and climate science. The workshop
brought together a diverse set of researchers and practitioners interested in
contributing to the development of a technical research agenda for information
retrieval to assess climate change impacts.
Authors' comments: Report on the MANILA24 Workshop
Zhengren Wang, Jiayang Yu, Dongsheng Ma, Zhe Chen, Yu Wang, Zhiyu Li, Feiyu Xiong, Yanfeng Wang et al.
Domain-specific intelligence demands specialized knowledge and sophisticated
reasoning for problem-solving, posing significant challenges for large language
models (LLMs) that struggle with knowledge hallucination and inadequate
reasoning capabilities under constrained parameter budgets. Inspired by Bloom's
Taxonomy in educational theory, we propose Retrieval-Augmented Reasoning
Modeling (RARE), a novel paradigm that decouples knowledge storage from
reasoning optimization. RARE externalizes domain knowledge to retrievable
sources and internalizes domain-specific reasoning patterns during training.
Specifically, by injecting retrieved knowledge into training prompts, RARE
transforms learning objectives from rote memorization to contextualized
reasoning application. It enables models to bypass parameter-intensive
memorization and prioritize the development of higher-order cognitive
processes. Our experiments demonstrate that lightweight RARE-trained models
(e.g., Llama-3.1-8B) could achieve state-of-the-art performance, surpassing
retrieval-augmented GPT-4 and Deepseek-R1 distilled counterparts. RARE
establishes a paradigm shift where maintainable external knowledge bases
synergize with compact, reasoning-optimized models, collectively driving more
scalable domain-specific intelligence. Repo:
https://github.com/Open-DataFlow/RARE
Authors' comments: Work in progress
Guofeng Quan, Wenfeng Feng, Chuzhan Hao, Guochao Jiang, Yuewei Zhang, Hao Wang
Speculative decoding accelerates inference in large language models (LLMs) by generating draft tokens for target model verification. Current approaches for obtaining draft tokens rely on lightweight draft models or additional model structures to generate draft tokens and retrieve context from databases. Due to the draft model's small size and limited training data, model-based speculative decoding frequently becomes less effective in out-of-domain scenarios. Additionally, the time cost of the drafting phase results in a low upper limit on acceptance length during the verification step, limiting overall efficiency. This paper proposes RASD (Retrieval-Augmented Speculative Decoding), which adopts retrieval methods to enhance model-based speculative decoding. We introduce tree pruning and tree fusion to achieve this. Specifically, we develop a pruning method based on the draft model's probability distribution to construct the optimal retrieval tree. Second, we employ the longest prefix matching algorithm to merge the tree generated by the draft model with the retrieval tree, resulting in a unified tree for verification. Experimental results demonstrate that RASD achieves state-of-the-art inference acceleration across tasks such as DocQA, Summary, Code, and In-Domain QA. Moreover, RASD exhibits strong scalability, seamlessly integrating with various speculative decoding approaches, including both generation-based and retrieval-based methods.
Ashish Singh, Priti Mohapatra
Retrieving the right level of context for a given query is a perennial
challenge in information retrieval - too large a chunk dilutes semantic
specificity, while chunks that are too small lack broader context. This paper
introduces the Hierarchical Re-ranker Retriever (HRR), a framework designed to
achieve both fine-grained and high-level context retrieval for large language
model (LLM) applications. In HRR, documents are split into sentence-level and
intermediate-level (512 tokens) chunks to maximize vector-search quality for
both short and broad queries. We then employ a reranker that operates on these
512-token chunks, ensuring an optimal balance neither too coarse nor too fine
for robust relevance scoring. Finally, top-ranked intermediate chunks are
mapped to parent chunks (2048 tokens) to provide an LLM with sufficiently large
context.
Authors' comments: 14 pages
Xin Zhang, Ziqi Dai, Yongqi Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Jun Yu et al.
Current multimodal information retrieval studies mainly focus on single-image
inputs, which limits real-world applications involving multiple images and
text-image interleaved content. In this work, we introduce the text-image
interleaved retrieval (TIIR) task, where the query and document are interleaved
text-image sequences, and the model is required to understand the semantics
from the interleaved context for effective retrieval. We construct a TIIR
benchmark based on naturally interleaved wikiHow tutorials, where a specific
pipeline is designed to generate interleaved queries. To explore the task, we
adapt several off-the-shelf retrievers and build a dense baseline by
interleaved multimodal large language model (MLLM). We then propose a novel
Matryoshka Multimodal Embedder (MME), which compresses the number of visual
tokens at different granularity, to address the challenge of excessive visual
tokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption
of existing models does not consistently yield effective results. Our MME
achieves significant improvements over the baseline by substantially fewer
visual tokens. We provide extensive analysis and will release the dataset and
code to facilitate future research.
Authors' comments: 16 pages, 14 figures
Julian Killingback, Hansi Zeng, Hamed Zamani
Existing information retrieval systems are largely constrained by their reliance on vector inner products to assess query-document relevance, which naturally limits the expressiveness of the relevance score they can produce. We propose a new paradigm; instead of representing a query as a vector, we use a small neural network that acts as a learned query-specific relevance function. This small neural network takes a document representation as input (in this work we use a single vector) and produces a scalar relevance score. To produce the small neural network we use a hypernetwork, a network that produces the weights of other networks, as our query encoder. We name this category of encoder models Hypencoders. Experiments on in-domain search tasks show that Hypencoders significantly outperform strong dense retrieval models and even surpass reranking models and retrieval models with an order of magnitude more parameters. To assess the extent of Hypencoders' capabilities, we evaluate on a set of hard retrieval tasks including tip-of-the-tongue and instruction-following retrieval tasks. On harder tasks, we find that the performance gap widens substantially compared to standard retrieval tasks. Furthermore, to demonstrate the practicality of our method, we implement an approximate search algorithm and show that our model is able to retrieve from a corpus of 8.8M documents in under 60 milliseconds.
Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, Furu Wei
This paper introduces an approach for training o1-like RAG models that
retrieve and reason over relevant information step by step before generating
the final answer. Conventional RAG methods usually perform a single retrieval
step before the generation process, which limits their effectiveness in
addressing complex queries due to imperfect retrieval results. In contrast, our
proposed method, CoRAG (Chain-of-Retrieval Augmented Generation), allows the
model to dynamically reformulate the query based on the evolving state. To
train CoRAG effectively, we utilize rejection sampling to automatically
generate intermediate retrieval chains, thereby augmenting existing RAG
datasets that only provide the correct final answer. At test time, we propose
various decoding strategies to scale the model's test-time compute by
controlling the length and number of sampled retrieval chains. Experimental
results across multiple benchmarks validate the efficacy of CoRAG, particularly
in multi-hop question answering tasks, where we observe more than 10 points
improvement in EM score compared to strong baselines. On the KILT benchmark,
CoRAG establishes a new state-of-the-art performance across a diverse range of
knowledge-intensive tasks. Furthermore, we offer comprehensive analyses to
understand the scaling behavior of CoRAG, laying the groundwork for future
research aimed at developing factual and grounded foundation models.
Authors' comments: 18 pages
Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Shihao Liu, Shuaiqing Wang, Dawei Yin, Xueqi Cheng
In book search, relevant book information should be returned in response to a
query. Books contain complex, multi-faceted information such as metadata,
outlines, and main text, where the outline provides hierarchical information
between chapters and sections. Generative retrieval (GR) is a new retrieval
paradigm that consolidates corpus information into a single model to generate
identifiers of documents that are relevant to a given query. How can GR be
applied to book search? Directly applying GR to book search is a challenge due
to the unique characteristics of book search: The model needs to retain the
complex, multi-faceted information of the book, which increases the demand for
labeled data. Splitting book information and treating it as a collection of
separate segments for learning might result in a loss of hierarchical
information. We propose an effective Generative retrieval framework for Book
Search (GBS) that features two main components: data augmentation and
outline-oriented book encoding. For data augmentation, GBS constructs multiple
query-book pairs for training; it constructs multiple book identifiers based on
the outline, various forms of book contents, and simulates real book retrieval
scenarios with varied pseudo-queries. This includes coverage-promoting book
identifier augmentation, allowing the model to learn to index effectively, and
diversity-enhanced query augmentation, allowing the model to learn to retrieve
effectively. Outline-oriented book encoding improves length extrapolation
through bi-level positional encoding and retentive attention mechanisms to
maintain context over long sequences. Experiments on a proprietary Baidu
dataset demonstrate that GBS outperforms strong baselines, achieving a 9.8\%
improvement in terms of MRR@20, over the state-of-the-art RIPOR method...
Authors' comments: Accepted at KDD ADS 2025
Xinping Zhao, Baotian Hu, Yan Zhong, Shouzheng Huang, Zihao Zheng, Meng Wang, Haofen Wang, Min Zhang
Although prevailing supervised and self-supervised learning (SSL)-augmented
sequential recommendation (SeRec) models have achieved improved performance
with powerful neural network architectures, we argue that they still suffer
from two limitations: (1) Preference Drift, where models trained on past data
can hardly accommodate evolving user preference; and (2) Implicit Memory, where
head patterns dominate parametric learning, making it harder to recall long
tails. In this work, we explore retrieval augmentation in SeRec, to address
these limitations. To this end, we propose a Retrieval-Augmented Sequential
Recommendation framework, named RaSeRec, the main idea of which is to maintain
a dynamic memory bank to accommodate preference drifts and retrieve relevant
memories to augment user modeling explicitly. It consists of two stages: (i)
collaborative-based pre-training, which learns to recommend and retrieve; (ii)
retrieval-augmented fine-tuning, which learns to leverage retrieved memories.
Extensive experiments on three datasets fully demonstrate the superiority and
effectiveness of RaSeRec.
Authors' comments: 20 pages, 8 figures, 8 tables
Kutay Tire, Ege Onur Taga, Muhammed Emrullah Ildiz, Samet Oymak
Retrieval-augmented generation (RAG) is a central component of modern LLM systems, particularly in scenarios where up-to-date information is crucial for accurately responding to user queries or when queries exceed the scope of the training data. The advent of time-series foundation models (TSFM), such as Chronos, and the need for effective zero-shot forecasting performance across various time-series domains motivates the question: Do benefits of RAG similarly carry over to time series forecasting? In this paper, we advocate that the dynamic and event-driven nature of time-series data makes RAG a crucial component of TSFMs and introduce a principled RAG framework for time-series forecasting, called Retrieval Augmented Forecasting (RAF). Within RAF, we develop efficient strategies for retrieving related time-series examples and incorporating them into forecast. Through experiments and mechanistic studies, we demonstrate that RAF indeed improves the forecasting accuracy across diverse time series domains and the improvement is more significant for larger TSFM sizes.