Minghan Li, Eric Gaussier, Guodong Zhou
In recent years, large language models (LLMs) have demonstrated exceptional power in various domains, including information retrieval. Most of the previous practices involve leveraging these models to create a single embedding for each query, each passage, or each document individually, a strategy exemplified and used by the Retrieval-Augmented Generation (RAG) framework. While this method has proven effective, we argue that it falls short in fully capturing the nuanced intricacies of document-level texts due to its reliance on a relatively coarse-grained representation. To address this limitation, we introduce a novel, fine-grained approach aimed at enhancing the accuracy of relevance scoring for long documents. Our methodology firstly segments a long document into blocks, each of which is embedded using an LLM, for matching with the query representation. When calculating the relevance score, we aggregate the query-block relevance scores through a weighted sum method, yielding a comprehensive score for the query with the entire document. Despite its apparent simplicity, our experimental findings reveal that this approach outperforms standard representation methods and achieves a significant reduction in embedding generation latency. Moreover, by carefully optimizing pairwise loss functions, superior performances have been achieved.
Xun Liang, Simin Niu, Zhiyu Li, Sensen Zhang, Hanyu Wang, Feiyu Xiong, Jason Zhaoxin Fan, Bo Tang et al.
The indexing-retrieval-generation paradigm of retrieval-augmented generation (RAG) has been highly successful in solving knowledge-intensive tasks by integrating external knowledge into large language models (LLMs). However, the incorporation of external and unverified knowledge increases the vulnerability of LLMs because attackers can perform attack tasks by manipulating knowledge. In this paper, we introduce a benchmark named SafeRAG designed to evaluate the RAG security. First, we classify attack tasks into silver noise, inter-context conflict, soft ad, and white Denial-of-Service. Next, we construct RAG security evaluation dataset (i.e., SafeRAG dataset) primarily manually for each task. We then utilize the SafeRAG dataset to simulate various attack scenarios that RAG may encounter. Experiments conducted on 14 representative RAG components demonstrate that RAG exhibits significant vulnerability to all attack tasks and even the most apparent attack task can easily bypass existing retrievers, filters, or advanced LLMs, resulting in the degradation of RAG service quality. Code is available at: https://github.com/IAAR-Shanghai/SafeRAG.
Yushi Guan, Daniel Kwan, Ruofan Liang, Selvakumar Panneer, Nilesh Jain, Nilesh Ahuja, Nandita Vijaykumar
Implicit neural representations (INRs) have become an important method for
encoding various data types, such as 3D objects or scenes, images, and videos.
They have proven to be particularly effective at representing 3D content, e.g.,
3D scene reconstruction from 2D images, novel 3D content creation, as well as
the representation, interpolation, and completion of 3D shapes. With the
widespread generation of 3D data in an INR format, there is a need to support
effective organization and retrieval of INRs saved in a data store. A key
aspect of retrieval and clustering of INRs in a data store is the formulation
of similarity between INRs that would, for example, enable retrieval of similar
INRs using a query INR. In this work, we propose INRet, a method for
determining similarity between INRs that represent shapes, thus enabling
accurate retrieval of similar shape INRs from an INR data store. INRet flexibly
supports different INR architectures such as INRs with octree grids, triplanes,
and hash grids, as well as different implicit functions including
signed/unsigned distance function and occupancy field. We demonstrate that our
method is more general and accurate than the existing INR retrieval method,
which only supports simple MLP INRs and requires the same architecture between
the query and stored INRs. Furthermore, compared to converting INRs to other
representations (e.g., point clouds or multi-view images) for 3D shape
retrieval, INRet achieves higher accuracy while avoiding the conversion
overhead.
Authors' comments: 3DV 2025
Chunyu Sun, Bingyu Liu, Zhichao Cui, Anbin Qi, Tian-hao Zhang, Dinghao Zhou, Lewei Lu
Embedding-based retrieval models have made significant strides in retrieval-augmented generation (RAG) techniques for text and multimodal large language models (LLMs) applications. However, when it comes to speech larage language models (SLLMs), these methods are limited to a two-stage process, where automatic speech recognition (ASR) is combined with text-based retrieval. This sequential architecture suffers from high latency and error propagation. To address these limitations, we propose a unified embedding framework that eliminates the need for intermediate text representations. Specifically, the framework includes separate speech and text encoders, followed by a shared scaling layer that maps both modalities into a common embedding space. Our model reduces pipeline latency by 50\% while achieving higher retrieval accuracy compared to traditional two-stage methods. We also provide a theoretical analysis of the challenges inherent in end-to-end speech retrieval and introduce architectural principles for effective speech-to-document matching. Extensive experiments demonstrate the robustness of our approach across diverse acoustic conditions and speaker variations, paving the way for a new paradigm in multimodal SLLMs retrieval systems.
Zijun Long, Kangheng Liang, Gerardo Aragon-Camarasa, Richard Mccreadie, Paul Henderson
Interactive Text-to-Image Retrieval (I-TIR) has emerged as a transformative user-interactive tool for applications in domains such as e-commerce and education. Yet, current methodologies predominantly depend on finetuned Multimodal Large Language Models (MLLMs), which face two critical limitations: (1) Finetuning imposes prohibitive computational overhead and long-term maintenance costs. (2) Finetuning narrows the pretrained knowledge distribution of MLLMs, reducing their adaptability to novel scenarios. These issues are exacerbated by the inherently dynamic nature of real-world I-TIR systems, where queries and image databases evolve in complexity and diversity, often deviating from static training distributions. To overcome these constraints, we propose Diffusion Augmented Retrieval (DAR), a paradigm-shifting framework that bypasses MLLM finetuning entirely. DAR synergizes Large Language Model (LLM)-guided query refinement with Diffusion Model (DM)-based visual synthesis to create contextually enriched intermediate representations. This dual-modality approach deciphers nuanced user intent more holistically, enabling precise alignment between textual queries and visually relevant images. Rigorous evaluations across four benchmarks reveal DAR's dual strengths: (1) Matches state-of-the-art finetuned I-TIR models on straightforward queries without task-specific training. (2) Scalable Generalization: Surpasses finetuned baselines by 7.61% in Hits@10 (top-10 accuracy) under multi-turn conversational complexity, demonstrating robustness to intricate, distributionally shifted interactions. By eliminating finetuning dependencies and leveraging generative-augmented representations, DAR establishes a new trajectory for efficient, adaptive, and scalable cross-modal retrieval systems.
Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, Adam Jatowt
Retrieval-Augmented Generation (RAG) models have drawn considerable attention
in modern open-domain question answering. The effectiveness of RAG depends on
the quality of the top retrieved documents. However, conventional retrieval
methods sometimes fail to rank the most relevant documents at the top. In this
paper, we introduce ASRank, a new re-ranking method based on scoring retrieved
documents using zero-shot answer scent which relies on a pre-trained large
language model to compute the likelihood of the document-derived answers
aligning with the answer scent. Our approach demonstrates marked improvements
across several datasets, including NQ, TriviaQA, WebQA, ArchivalQA, HotpotQA,
and Entity Questions. Notably, ASRank increases Top-1 retrieval accuracy on NQ
from $19.2\%$ to $46.5\%$ for MSS and $22.1\%$ to $47.3\%$ for BM25. It also
shows strong retrieval performance on several datasets compared to
state-of-the-art methods (47.3 Top-1 by ASRank vs 35.4 by UPR by BM25).
Authors' comments: Accepted At NAACL 2025
Zihang Li, Yangdong Ruan, Wenjun Liu, Zhengyang Wang, Tong Yang
Although retrieval-augmented generation(RAG) significantly improves generation quality by retrieving external knowledge bases and integrating generated content, it faces computational efficiency bottlenecks, particularly in knowledge retrieval tasks involving hierarchical structures for Tree-RAG. This paper proposes a Tree-RAG acceleration method based on the improved Cuckoo Filter, which optimizes entity localization during the retrieval process to achieve significant performance improvements. Tree-RAG effectively organizes entities through the introduction of a hierarchical tree structure, while the Cuckoo Filter serves as an efficient data structure that supports rapid membership queries and dynamic updates. The experiment results demonstrate that our method is much faster than naive Tree-RAG while maintaining high levels of generative quality. When the number of trees is large, our method is hundreds of times faster than naive Tree-RAG. Our work is available at https://github.com/TUPYP7180/CFT-RAG-2025.
Yuntong Hu, Zhihan Lei, Zhongjie Dai, Allen Zhang, Abhinav Angirekula, Zheng Zhang, Liang Zhao
Research question answering requires accurate retrieval and contextual
understanding of scientific literature. However, current Retrieval-Augmented
Generation (RAG) methods often struggle to balance complex document
relationships with precise information retrieval. In this paper, we introduce
Contextualized Graph Retrieval-Augmented Generation (CG-RAG), a novel framework
that integrates sparse and dense retrieval signals within graph structures to
enhance retrieval efficiency and subsequently improve generation quality for
research question answering. First, we propose a contextual graph
representation for citation graphs, effectively capturing both explicit and
implicit connections within and across documents. Next, we introduce
Lexical-Semantic Graph Retrieval (LeSeGR), which seamlessly integrates sparse
and dense retrieval signals with graph encoding. It bridges the gap between
lexical precision and semantic understanding in citation graph retrieval,
demonstrating generalizability to existing graph retrieval and hybrid retrieval
methods. Finally, we present a context-aware generation strategy that utilizes
the retrieved graph-structured information to generate precise and contextually
enriched responses using large language models (LLMs). Extensive experiments on
research question answering benchmarks across multiple domains demonstrate that
our CG-RAG framework significantly outperforms RAG methods combined with
various state-of-the-art retrieval approaches, delivering superior retrieval
accuracy and generation quality.
Authors' comments: 10 pages, 2 figures
Bingjun Luo, Jinpeng Wang, Wang Zewen, Junjie Zhu, Xibin Zhao
Video surveillance systems are crucial components for ensuring public safety
and management in smart city. As a fundamental task in video surveillance,
text-to-image person retrieval aims to retrieve the target person from an image
gallery that best matches the given text description. Most existing
text-to-image person retrieval methods are trained in a supervised manner that
requires sufficient labeled data in the target domain. However, it is common in
practice that only unlabeled data is available in the target domain due to the
difficulty and cost of data annotation, which limits the generalization of
existing methods in practical application scenarios. To address this issue, we
propose a novel unsupervised domain adaptation method, termed Graph-Based
Cross-Domain Knowledge Distillation (GCKD), to learn the cross-modal feature
representation for text-to-image person retrieval in a cross-dataset scenario.
The proposed GCKD method consists of two main components. Firstly, a
graph-based multi-modal propagation module is designed to bridge the
cross-domain correlation among the visual and textual samples. Secondly, a
contrastive momentum knowledge distillation module is proposed to learn the
cross-modal feature representation using the online knowledge distillation
strategy. By jointly optimizing the two modules, the proposed method is able to
achieve efficient performance for cross-dataset text-to-image person retrieval.
acExtensive experiments on three publicly available text-to-image person
retrieval datasets demonstrate the effectiveness of the proposed GCKD method,
which consistently outperforms the state-of-the-art baselines.
Authors' comments: Accepted by AAAI 2025
Ziwen Li, Xiang 'Anthony' Chen, Youngseung Jeon
Drug discovery (DD) has tremendously contributed to maintaining and improving
public health. Hypothesizing that inhibiting protein misfolding can slow
disease progression, researchers focus on target identification (Target ID) to
find protein structures for drug binding. While Large Language Models (LLMs)
and Retrieval-Augmented Generation (RAG) frameworks have accelerated drug
discovery, integrating models into cohesive workflows remains challenging. We
conducted a user study with drug discovery researchers to identify the
applicability of LLMs and RAGs in Target ID. We identified two main findings:
1) an LLM should provide multiple Protein-Protein Interactions (PPIs) based on
an initial protein and protein candidates that have a therapeutic impact; 2)
the model must provide the PPI and relevant explanations for better
understanding. Based on these observations, we identified three limitations in
previous approaches for Target ID: 1) semantic ambiguity, 2) lack of
explainability, and 3) short retrieval units. To address these issues, we
propose GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve
agent pipeline RAG framework to support large-scale PPI signaling pathway
exploration in understanding therapeutic impacts by decomposing the analysis of
entire PPI pathways into sub-tasks focused on the analysis of PPI edges.
Authors' comments: 14 pages; 5 figures. Published as a finding at NAACL 2025
Zinan Zhou, Keiichiro Toda, Rikimaru Kurata, Kohki Horie, Ryoichi Horisaki, Takuro Ideguchi
Zernike's phase contrast microscopy (PCM) is among the most widely used techniques for observing phase objects, but it lacks quantitative nature, as it cannot directly provide phase information. Current methods for computationally extracting phase distributions from PCM images, however, rely heavily on empirical regularization parameter tuning. In this paper we extend an existing approach by employing an untrained neural network as an image prior, removing the need for manual regularization. We quantitatively demonstrate improved accuracy and robustness in phase retrieval compared to existing methods, using numerical and experimental PCM images. Our results confirm the feasibility of applying deep priors for phase retrieval in incoherent illumination setups.
T. Y. S. S. Santosh, Isaac Misael Olguín Nolasco, Matthias Grabmair
Prior case retrieval (PCR) is crucial for legal practitioners to find
relevant precedent cases given the facts of a query case. Existing approaches
often overlook the underlying semantic intent in determining relevance with
respect to the query case. In this work, we propose LeCoPCR, a novel approach
that explicitly generate intents in the form of legal concepts from a given
query case facts and then augments the query with these concepts to enhance
models understanding of semantic intent that dictates relavance. To overcome
the unavailability of annotated legal concepts, we employ a weak supervision
approach to extract key legal concepts from the reasoning section using
Determinantal Point Process (DPP) to balance quality and diversity.
Experimental results on the ECtHR-PCR dataset demonstrate the effectiveness of
leveraging legal concepts and DPP-based key concept extraction.
Authors' comments: Accepted to NAACL 2025
Samuel Pinilla, Kumar Vijay Mishra, Brian M. Sadler
The ambiguity function (AF) is a critical tool in radar waveform design,
representing the two-dimensional correlation between a transmitted signal and
its time-delayed, frequency-shifted version. Obtaining a radar signal to match
a specified AF magnitude is a bi-variate variant of the well-known phase
retrieval problem. Prior approaches to this problem were either limited to a
few classes of waveforms or lacked a computable procedure to estimate the
signal. Our recent work provided a framework for solving this problem for both
band- and time-limited signals using non-convex optimization. In this paper, we
introduce a novel approach WaveMax that formulates waveform recovery as a
convex optimization problem by relying on the fractional Fourier transform
(FrFT)-based AF. We exploit the fact that AF of the FrFT of the original signal
is equivalent to a rotation of the original AF. In particular, we reconstruct
the radar signal by solving a low-rank minimization problem, which approximates
the waveform using the leading eigenvector of a matrix derived from the AF. Our
theoretical analysis shows that unique waveform reconstruction is achievable
with a sample size no more than three times the signal frequencies or time
samples. Numerical experiments validate the efficacy of WaveMax in recovering
signals from noiseless and noisy AF, including scenarios with randomly and
uniformly sampled sparse data.
Authors' comments: 13 pages, 6 figures
Lei Huang, Xiaocheng Feng, Weitao Ma, Yuchun Fan, Xiachong Feng, Yangfan Ye, Weihong Zhong, Yuxuan Gu et al.
Ensuring contextual faithfulness in retrieval-augmented large language models
(LLMs) is crucial for building trustworthy information-seeking systems,
particularly in long-form question-answering (LFQA) scenarios. In this work, we
identify a salient correlation between LFQA faithfulness and retrieval heads, a
set of attention heads responsible for retrieving contextual information.
Leveraging this insight, we propose RHIO, a framework designed to teach LLMs to
explicitly discriminate between faithful and unfaithful generations. RHIO first
augments unfaithful samples that simulate realistic model-intrinsic errors by
selectively masking retrieval heads. Then, these samples are incorporated into
joint training, enabling the model to distinguish unfaithful outputs from
faithful ones conditioned on control tokens. Furthermore, these control tokens
are leveraged to self-induce contrastive outputs, amplifying their difference
through contrastive decoding. Additionally, to facilitate the evaluation of
contextual faithfulness, we also introduce GroundBench, a comprehensive
benchmark compiled from five existing LFQA datasets. Extensive experimental
results on GroundBench demonstrate that RHIO significantly improves
faithfulness, even outperforming GPT-4o.
Authors' comments: Submitted to ARR October 2024
Zezhou Yang, Sirong Chen, Cuiyun Gao, Zhenhao Li, Xing Hu, Kui Liu, Xin Xia
Code generation aims to automatically generate code snippets of specific
programming language according to natural language descriptions. The continuous
advancements in deep learning, particularly pre-trained models, have empowered
the code generation task to achieve remarkable performance. One main challenge
of pre-trained models for code generation is the semantic gap between natural
language requirements and source code. To address the issue, prior studies
typically adopt a retrieval-augmented framework for the task, where the similar
code snippets collected by a retrieval process can be leveraged to help
understand the requirements and provide guidance for the generation process.
However, there is a lack of systematic study on the application of this
framework for code generation, including the impact of the final generated
results and the specific usage of the framework. In this paper, we choose three
popular pre-trained code models, namely CodeGen, UniXcoder, and CodeT5, to
assess the impact of the quality and utilization of retrieved code on the
retrieval-augmented framework. Our analysis shows that the retrieval-augmented
framework is beneficial for improving the performance of the existing
pre-trained models. We also provide suggestions on the utilization of the
retrieval-augmented code generation framework: BM25 and Sequential Integration
Fusion are recommended due to their convenience and superior performance.
Sketch Filling Fusion, which extracts a sketch of relevant code, could help the
model improve its performance further. Additionally, we conduct experiments to
investigate the influence of the retrieval-augmented framework on large
language models for code generation, showing the effectiveness of the
framework, and we discuss the trade-off between performance improvement and
computational costs in each phase within the framework.
Authors' comments: This paper is accepted by TOSEM
Jeonghun Cho, Gary Geunbae Lee
Retrieval-augmented question answering (QA) integrates external information
and thereby increases the QA accuracy of reader models that lack domain
knowledge. However, documents retrieved for closed domains require high
expertise, so the reader model may have difficulty fully comprehending the
text. Moreover, the retrieved documents contain thousands of tokens, some
unrelated to the question. As a result, the documents include some inaccurate
information, which could lead the reader model to mistrust the passages and
could result in hallucinations. To solve these problems, we propose K-comp
(Knowledge-injected compressor) which provides the knowledge required to answer
correctly. The compressor automatically generates the prior knowledge necessary
to facilitate the answer process prior to compression of the retrieved
passages. Subsequently, the passages are compressed autoregressively, with the
generated knowledge being integrated into the compression process. This process
ensures alignment between the question intent and the compressed context. By
augmenting this prior knowledge and concise context, the reader models are
guided toward relevant answers and trust the context.
Authors' comments: Accepted at NAACL 2025 (Main, long paper)
Pierre Onghena, Santiago Velasco-Forero, Beatriz Marcotegui
Point clouds are a set of data points in space to represent the 3D geometry of objects. A fundamental step in the processing is to identify a subset of points to represent the shape. While traditional sampling methods often ignore to incorporate geometrical information, recent developments in learning-based sampling models have achieved significant levels of performance. With the integration of geometrical priors, the ability to learn and preserve the underlying structure can be enhanced when sampling. To shed light into the shape, a qualitative skeleton serves as an effective descriptor to guide sampling for both local and global geometries. In this paper, we introduce MorphoSkel3D as a new technique based on morphology to facilitate an efficient skeletonization of shapes. With its low computational cost, MorphoSkel3D is a unique, rule-based algorithm to benchmark its quality and performance on two large datasets, ModelNet and ShapeNet, under different sampling ratios. The results show that training with MorphoSkel3D leads to an informed and more accurate sampling in the practical application of object classification and point cloud retrieval.
Mira Gonen, Michael Langberg, Alex Sprintson
This paper focuses on the design and analysis of privacy-preserving techniques for group testing and infection status retrieval. Our work is motivated by the need to provide accurate information on the status of disease spread among a group of individuals while protecting the privacy of the infection status of any single individual involved. The paper is motivated by practical scenarios, such as controlling the spread of infectious diseases, where individuals might be reluctant to participate in testing if their outcomes are not kept confidential. The paper makes the following contributions. First, we present a differential privacy framework for the subset retrieval problem, which focuses on sharing the infection status of individuals with administrators and decision-makers. We characterize the trade-off between the accuracy of subset retrieval and the degree of privacy guaranteed to the individuals. In particular, we establish tight lower and upper bounds on the achievable level of accuracy subject to the differential privacy constraints. We then formulate the differential privacy framework for the noisy group testing problem in which noise is added either before or after the pooling process. We establish a reduction between the private subset retrieval and noisy group testing problems and show that the converse and achievability schemes for subset retrieval carry over to differentially private group testing.
Yang Bai, Christan Earl Grant, Daisy Zhe Wang
Multi-modal retrieval-augmented Question Answering (MRAQA), integrating text
and images, has gained significant attention in information retrieval (IR) and
natural language processing (NLP). Traditional ranking methods rely on small
encoder-based language models, which are incompatible with modern decoder-based
generative large language models (LLMs) that have advanced various NLP tasks.
To bridge this gap, we propose RAMQA, a unified framework combining
learning-to-rank methods with generative permutation-enhanced ranking
techniques. We first train a pointwise multi-modal ranker using LLaVA as the
backbone. Then, we apply instruction tuning to train a LLaMA model for
re-ranking the top-k documents using an innovative autoregressive multi-task
learning approach. Our generative ranking model generates re-ranked document
IDs and specific answers from document candidates in various permutations.
Experiments on two MRAQA benchmarks, WebQA and MultiModalQA, show significant
improvements over strong baselines, highlighting the effectiveness of our
approach. Code and data are available at: https://github.com/TonyBY/RAMQA
Authors' comments: Accepted by NAACL 2025 Findings
Kenta Uesugi, Naoki Saito, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Composed Image Retrieval (CIR) provides an effective way to manage and access
large-scale visual data. Construction of the CIR model utilizes triplets that
consist of a reference image, modification text describing desired changes, and
a target image that reflects these changes. For effectively training CIR
models, extensive manual annotation to construct high-quality training
datasets, which can be time-consuming and labor-intensive, is required. To deal
with this problem, this paper proposes a novel triplet synthesis method by
leveraging counterfactual image generation. By controlling visual feature
modifications via counterfactual image generation, our approach automatically
generates diverse training triplets without any manual intervention. This
approach facilitates the creation of larger and more expressive datasets,
leading to the improvement of CIR model's performance.
Authors' comments: 4 pages, 4 figures