Quanjun Zhang, Chunrong Fang, Yi Zheng, Ruixiang Qian, Shengcheng Yu, Yuan Zhao, Jianyi Zhou, Yun Yang et al.
Unit testing attempts to validate the correctness of basic units of the
software system under test and has a crucial role in software development and
testing. Very recent work proposes a retrieve-and-edit approach to generate
unit test oracles, i.e., assertions. Despite being promising, it is still far
from perfect due to some limitations, such as splitting assertion retrieval and
generation into two separate components without benefiting each other. In this
paper, we propose AG-RAG, a retrieval-augmented automated assertion generation
approach that leverages external codebases and joint training to address
various technical limitations of prior work. Inspired by the plastic surgery
hypothesis, AG-RAG attempts to combine relevant unit tests and advanced
pre-trained language models (PLMs) with retrieval-augmented fine-tuning. AG-RAG
builds a dense retriever to search for relevant test-assert pairs (TAPs) with
semantic matching and a retrieval-augmented generator to synthesize accurate
assertions with the focal-test and retrieved TAPs as input. Besides, AG-RAG
leverages a code-aware language model CodeT5 as the cornerstone to facilitate
both assertion retrieval and generation tasks. Furthermore, the retriever is
optimized in conjunction with the generator as a whole pipeline with a joint
training strategy. This unified design fully adapts both components
specifically for retrieving more useful TAPs, thereby generating accurate
assertions. We extensively evaluate AG-RAG against six state-of-the-art AG
approaches on two benchmarks and three metrics. Experimental results show that
AG-RAG significantly outperforms previous AG approaches on all benchmarks and
metrics, e.g., improving the most recent baseline EditAS by 20.82% and 26.98%
in terms of accuracy. AG-RAG also correctly generates 1739 and 2866 unique
assertions that all baselines fail to generate, 3.45X and 9.20X more than
EditAS.
Authors' comments: Accepted to IEEE Transactions on Software Engineering (TSE 2025)
Yau-Shian Wang, Wei-Cheng Chang, Jyun-Yu Jiang, Jiong Zhang, Hsiang-Fu Yu, S. V. N. Vishwanathan
Extreme multi-label classification (XMC) seeks to find relevant labels from an extremely large label collection for a given text input. To tackle such a vast label space, current state-of-the-art methods fall into two categories. The one-versus-all (OVA) method uses learnable label embeddings for each label, excelling at memorization (i.e., capturing detailed training signals for accurate head label prediction). In contrast, the dual-encoder (DE) model maps input and label text into a shared embedding space for better generalization (i.e., the capability of predicting tail labels with limited training data), but may fall short at memorization. To achieve generalization and memorization, existing XMC methods often combine DE and OVA models, which involves complex training pipelines. Inspired by the success of retrieval-augmented language models, we propose the Retrieval-augmented Encoders for XMC (RAEXMC), a novel framework that equips a DE model with retrieval-augmented capability for efficient memorization without additional trainable parameter. During training, RAEXMC is optimized by the contrastive loss over a knowledge memory that consists of both input instances and labels. During inference, given a test input, RAEXMC retrieves the top-$K$ keys from the knowledge memory, and aggregates the corresponding values as the prediction scores. We showcase the effectiveness and efficiency of RAEXMC on four public LF-XMC benchmarks. RAEXMC not only advances the state-of-the-art (SOTA) DE method DEXML, but also achieves more than 10x speedup on the largest LF-AmazonTitles-1.3M dataset under the same 8 A100 GPUs training environments.
Yepeng Liu, Xuandong Zhao, Dawn Song, Yuheng Bu
Retrieval-Augmented Generation (RAG) has become an effective method for enhancing large language models (LLMs) with up-to-date knowledge. However, it poses a significant risk of IP infringement, as IP datasets may be incorporated into the knowledge database by malicious Retrieval-Augmented LLMs (RA-LLMs) without authorization. To protect the rights of the dataset owner, an effective dataset membership inference algorithm for RA-LLMs is needed. In this work, we introduce a novel approach to safeguard the ownership of text datasets and effectively detect unauthorized use by the RA-LLMs. Our approach preserves the original data completely unchanged while protecting it by inserting specifically designed canary documents into the IP dataset. These canary documents are created with synthetic content and embedded watermarks to ensure uniqueness, stealthiness, and statistical provability. During the detection process, unauthorized usage is identified by querying the canary documents and analyzing the responses of RA-LLMs for statistical evidence of the embedded watermark. Our experimental results demonstrate high query efficiency, detectability, and stealthiness, along with minimal perturbation to the original dataset, all without compromising the performance of the RAG system.
Zhao-Yi Zhou, Da-Jian Zhang
Using quantum measurements to extract information from states is a matter of
routine in quantum science and technologies. A recent work [Phys. Rev. Lett.
133, 040202 (2024)] reported the finding that the symmetric structures of a
state can be harnessed to dramatically reduce the sample complexity in
extracting information from the state. However, due to the presence of noise,
the actual state at hand is often corrupted, making its symmetric structures
distorted before the execution of quantum measurements. Here, using the
methodology of quantum metrology, we identify the optimal measurement that can
retrieve maximum information of a symmetric state from its corrupted copies. We
show that this measurement can be found by solving a semidefinite program in
generic cases and can be explicitly determined for a large class of noise
models covariant under the symmetry group in question. The results of this
study nicely complement the recent work by providing a method to optimally
utilize the distorted symmetric structures of corrupted states for information
retrieval.
Authors' comments: Close to the published version
Rotem Shalev-Arkushin, Rinon Gal, Amit H. Bermano, Ohad Fried
Diffusion models enable high-quality and diverse visual content synthesis. However, they struggle to generate rare or unseen concepts. To address this challenge, we explore the usage of Retrieval-Augmented Generation (RAG) with image generation models. We propose ImageRAG, a method that dynamically retrieves relevant images based on a given text prompt, and uses them as context to guide the generation process. Prior approaches that used retrieved images to improve generation, trained models specifically for retrieval-based generation. In contrast, ImageRAG leverages the capabilities of existing image conditioning models, and does not require RAG-specific training. Our approach is highly adaptable and can be applied across different model types, showing significant improvement in generating rare and fine-grained concepts using different base models. Our project page is available at: https://rotem-shalev.github.io/ImageRAG
Shu Wang, Yixiang Fang, Yingli Zhou, Xilin Liu, Yuchi Ma
Retrieval-Augmented Generation (RAG) has proven effective in integrating external knowledge into large language models (LLMs) for question-answer (QA) tasks. The state-of-the-art RAG approaches often use the graph data as the external data since they capture the rich semantic information and link relationships between entities. However, existing graph-based RAG approaches cannot accurately identify the relevant information from the graph and also consume large numbers of tokens in the online retrieval process. To address these issues, we introduce a novel graph-based RAG approach, called Attributed Community-based Hierarchical RAG (ArchRAG), by augmenting the question using attributed communities, and also introducing a novel LLM-based hierarchical clustering method. To retrieve the most relevant information from the graph for the question, we build a novel hierarchical index structure for the attributed communities and develop an effective online retrieval method. Experimental results demonstrate that ArchRAG outperforms existing methods in terms of both accuracy and token cost.
Shubham Gupta, Zichao Li, Tianyi Chen, Cem Subakan, Siva Reddy, Perouz Taslakian, Valentina Zantedeschi
Document retrieval is a core component of question-answering systems, as it enables conditioning answer generation on new and large-scale corpora. While effective, the standard practice of encoding documents into high-dimensional embeddings for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. In this paper, we propose a tree-based method for organizing and representing reference documents at various granular levels, which offers the flexibility to balance cost and utility, and eases the inspection of the corpus content and retrieval operations. Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches, hence directly optimizing for retrieval performance. Our evaluations show that ReTreever generally preserves full representation accuracy. Its hierarchical structure further provides strong coarse representations and enhances transparency by indirectly learning meaningful semantic groupings. Among hierarchical retrieval methods, ReTreever achieves the best retrieval accuracy at the lowest latency, proving that this family of techniques can be viable in practical applications.
Xingyi Zhang, Kun Xie, Ningqiao Huang, Wei Liu, Peilin Zhao, Sibo Wang, Kangfei Zhao, Biaobin Jiang
Recent advancements in protein design have leveraged diffusion models to generate structural scaffolds, followed by a process known as protein inverse folding, which involves sequence inference on these scaffolds. However, these methodologies face significant challenges when applied to hyper-variable structures such as antibody Complementarity-Determining Regions (CDRs), where sequence inference frequently results in non-functional sequences due to hallucinations. Distinguished from prevailing protein inverse folding approaches, this paper introduces Igseek, a novel structure-retrieval framework that infers CDR sequences by retrieving similar structures from a natural antibody database. Specifically, Igseek employs a simple yet effective multi-channel equivariant graph neural network to generate high-quality geometric representations of CDR backbone structures. Subsequently, it aligns sequences of structurally similar CDRs and utilizes structurally conserved sequence motifs to enhance inference accuracy. Our experiments demonstrate that Igseek not only proves to be highly efficient in structural retrieval but also outperforms state-of-the-art approaches in sequence recovery for both antibodies and T-Cell Receptors, offering a new retrieval-based perspective for therapeutic protein design.
Penghao Lu, Xin Dong, Yuansheng Zhou, Lei Cheng, Chuan Yuan, Linjian Mo
Generative retrieval constitutes an innovative approach in in- formation retrieval, leveraging generative language models (LM) to generate a ranked list of document identifiers (do- cid) for a given query. It simplifies the retrieval pipeline by replacing the large external index with model parameters. However, existing works merely learned the relationship be- tween queries and document identifiers, which is unable to directly represent the relevance between queries and docu- ments. To address the above problem, we propose a novel and general generative retrieval framework, namely Leverag- ing Document-Oriented Contrastive Learning in Generative Retrieval (DOGR), which leverages contrastive learning to improve generative retrieval tasks. It adopts a two-stage learn- ing strategy that captures the relationship between queries and documents comprehensively through direct interactions. Furthermore, negative sampling methods and correspond- ing contrastive learning objectives are implemented to en- hance the learning of semantic representations, thereby pro- moting a thorough comprehension of the relationship be- tween queries and documents. Experimental results demon- strate that DOGR achieves state-of-the-art performance com- pared to existing generative retrieval methods on two public benchmark datasets. Further experiments have shown that our framework is generally effective for common identifier con- struction techniques.
Osman Tursun, Sinan Kalkan, Simon Denman, Clinton Fookes
Zero-shot composed image retrieval (ZS-CIR) enables image search using a reference image and text prompt without requiring specialized text-image composition networks trained on large-scale paired data. However, current ZS-CIR approaches face three critical limitations in their reliance on composed text embeddings: static query embedding representations, insufficient utilization of image embeddings, and suboptimal performance when fusing text and image embeddings. To address these challenges, we introduce the Prompt Directional Vector (PDV), a simple yet effective training-free enhancement that captures semantic modifications induced by user prompts. PDV enables three key improvements: (1) dynamic composed text embeddings where prompt adjustments are controllable via a scaling factor, (2) composed image embeddings through semantic transfer from text prompts to image features, and (3) weighted fusion of composed text and image embeddings that enhances retrieval by balancing visual and semantic similarity. Our approach serves as a plug-and-play enhancement for existing ZS-CIR methods with minimal computational overhead. Extensive experiments across multiple benchmarks demonstrate that PDV consistently improves retrieval performance when integrated with state-of-the-art ZS-CIR approaches, particularly for methods that generate accurate compositional embeddings. The code will be publicly available.
Siddharth Gandhi, Luyu Gao, Jamie Callan
This paper presents a multi-stage reranking system for repository-level code
search, which leverages the vastly available commit histories of large
open-source repositories to aid in bug fixing. We define the task of
repository-level code search as retrieving the set of files from the current
state of a code repository that are most relevant to addressing a user's
question or bug. The proposed approach combines BM25-based retrieval over
commit messages with neural reranking using CodeBERT to identify the most
pertinent files. By learning patterns from diverse repositories and their
commit histories, the system can surface relevant files for the task at hand.
The system leverages both commit messages and source code for relevance
matching, and is evaluated in both normal and oracle settings. Experiments on a
new dataset created from 7 popular open-source repositories demonstrate
substantial improvements of up to 80% in MAP, MRR and P@1 over the BM25
baseline, across a diverse set of queries, demonstrating the effectiveness this
approach. We hope this work aids LLM agents as a tool for better code search
and understanding. Our code and results obtained are publicly available.
Authors' comments: 16 pages
Stephan Goerttler, Yucheng Wang, Emadeldeen Eldele, Fei He, Min Wu
Despite significant advances in deep learning-based sleep stage
classification, the clinical adoption of automatic classification models
remains slow. One key challenge is the lack of explainability, as many models
function as black boxes with millions of parameters. In response, recent work
has increasingly focussed on enhancing model explainability. This study
contributes to these efforts by globally explaining spectral processing of
individual EEG channels. Specifically, we introduce a method to retrieve the
filter spectrum of low-level convolutional feature extraction and compare it
with the classification-relevant spectral information in the data. We evaluate
our approach on the MSA-CNN model using the ISRUC-S3 and Sleep-EDF-20 datasets.
Our findings show that spectral processing plays a significant role in the
lower frequency bands. In addition, comparing the correlation between filter
spectrum and data-based spectral information with univariate performance
indicates that the model naturally prioritises the most informative channels in
a multimodal setting. We specify how these insights can be leveraged to enhance
model performance. The code for the filter spectrum retrieval and its analysis
is available at https://github.com/sgoerttler/MSA-CNN.
Authors' comments: 5 pages, 3 figures, conference paper
Yan Weng, Fengbin Zhu, Tong Ye, Haoyan Liu, Fuli Feng, Tat-Seng Chua
Retrieval-Augmented Generation (RAG), which integrates external knowledge
into Large Language Models (LLMs), has proven effective in enabling LLMs to
produce more accurate and reliable responses. However, it remains a significant
challenge how to effectively integrate external retrieved knowledge with
internal parametric knowledge in LLMs. In this work, we propose a novel
Self-Selection RAG framework, where the LLM is made to select from pairwise
responses generated with internal parametric knowledge solely and with external
retrieved knowledge together to achieve enhanced accuracy. To this end, we
devise a Self-Selection-RGP method to enhance the capabilities of the LLM in
both generating and selecting the correct answer, by training the LLM with
Direct Preference Optimization (DPO) over a curated Retrieval Generation
Preference (RGP) dataset. Experimental results with two open-source LLMs (i.e.,
Llama2-13B-Chat and Mistral-7B) well demonstrate the superiority of our
approach over other baseline methods on Natural Questions (NQ) and TrivialQA
datasets.
Authors' comments: 12 pages, 6 figures
Kevin Nanekhan, Venktesh V, Erik Martin, Henrik Vatndal, Vinay Setty, Avishek Anand
The advances in digital tools have led to the rampant spread of
misinformation. While fact-checking aims to combat this, manual fact-checking
is cumbersome and not scalable. It is essential for automated fact-checking to
be efficient for aiding in combating misinformation in real-time and at the
source. Fact-checking pipelines primarily comprise a knowledge retrieval
component which extracts relevant knowledge to fact-check a claim from large
knowledge sources like Wikipedia and a verification component. The existing
works primarily focus on the fact-verification part rather than evidence
retrieval from large data collections, which often face scalability issues for
practical applications such as live fact-checking. In this study, we address
this gap by exploring various methods for indexing a succinct set of factual
statements from large collections like Wikipedia to enhance the retrieval phase
of the fact-checking pipeline. We also explore the impact of vector
quantization to further improve the efficiency of pipelines that employ dense
retrieval approaches for first-stage retrieval. We study the efficiency and
effectiveness of the approaches on fact-checking datasets such as HoVer and
WiCE, leveraging Wikipedia as the knowledge source. We also evaluate the
real-world utility of the efficient retrieval approaches by fact-checking 2024
presidential debate and also open source the collection of claims with
corresponding labels identified in the debate. Through a combination of indexed
facts together with Dense retrieval and Index compression, we achieve up to a
10.0x speedup on CPUs and more than a 20.0x speedup on GPUs compared to the
classical fact-checking pipelines over large collections.
Authors' comments: Accepted to ECIR 2025, 15 pages
Mengxi Xiao, Zihao Jiang, Lingfei Qian, Zhengyu Chen, Yueru He, Yijing Xu, Yuecheng Jiang, Dong Li et al.
Accurately forecasting stock price movements is critical for informed
financial decision-making, supporting applications ranging from algorithmic
trading to risk management. However, this task remains challenging due to the
difficulty of retrieving subtle yet high-impact patterns from noisy financial
time-series data, where conventional retrieval methods, whether based on
generic language models or simplistic numeric similarity, often fail to capture
the intricate temporal dependencies and context-specific signals essential for
precise market prediction. To bridge this gap, we introduce FinSrag, the first
retrieval-augmented generation (RAG) framework with a novel domain-specific
retriever FinSeer for financial time-series forecasting. FinSeer leverages a
candidate selection mechanism refined by LLM feedback and a similarity-driven
training objective to align queries with historically influential sequences
while filtering out financial noise. Such training enables FinSeer to identify
the most relevant time-series data segments for downstream forecasting tasks,
unlike embedding or distance-based retrieval methods used in existing RAG
frameworks. The retrieved patterns are then fed into StockLLM, a 1B-parameter
LLM fine-tuned for stock movement prediction, which serves as the generative
backbone. Beyond the retrieval method, we enrich the retrieval corpus by
curating new datasets that integrate a broader set of financial indicators,
capturing previously overlooked market dynamics. Experiments demonstrate that
FinSeer outperforms existing textual retrievers and traditional distance-based
retrieval approaches in enhancing the prediction accuracy of StockLLM,
underscoring the importance of domain-specific retrieval frameworks in handling
the complexity of financial time-series data.
Authors' comments: 11 pages, 4 figures
Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin et al.
To deliver coherent and personalized experiences in long-term conversations,
existing approaches typically perform retrieval augmented response generation
by constructing memory banks from conversation history at either the
turn-level, session-level, or through summarization techniques. In this paper,
we present two key findings: (1) The granularity of memory unit matters:
Turn-level, session-level, and summarization-based methods each exhibit
limitations in both memory retrieval accuracy and the semantic quality of the
retrieved content. (2) Prompt compression methods, such as
\textit{LLMLingua-2}, can effectively serve as a denoising mechanism, enhancing
memory retrieval accuracy across different granularities. Building on these
insights, we propose SeCom, a method that constructs a memory bank with topical
segments by introducing a conversation Segmentation model, while performing
memory retrieval based on Compressed memory units. Experimental results show
that SeCom outperforms turn-level, session-level, and several
summarization-based methods on long-term conversation benchmarks such as LOCOMO
and Long-MT-Bench+. Additionally, the proposed conversation segmentation method
demonstrates superior performance on dialogue segmentation datasets such as
DialSeg711, TIAGE, and SuperDialSeg.
Authors' comments: 10 pages, 5 figures, conference
Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu et al.
Retrieval-Augmented Generation (RAG) is often used with Large Language Models
(LLMs) to infuse domain knowledge or user-specific information. In RAG, given a
user query, a retriever extracts chunks of relevant text from a knowledge base.
These chunks are sent to an LLM as part of the input prompt. Typically, any
given chunk is repeatedly retrieved across user questions. However, currently,
for every question, attention-layers in LLMs fully compute the key values (KVs)
repeatedly for the input chunks, as state-of-the-art methods cannot reuse
KV-caches when chunks appear at arbitrary locations with arbitrary contexts.
Naive reuse leads to output quality degradation. This leads to potentially
redundant computations on expensive GPUs and increases latency. In this work,
we propose Cache-Craft, a system for managing and reusing precomputed KVs
corresponding to the text chunks (we call chunk-caches) in RAG-based systems.
We present how to identify chunk-caches that are reusable, how to efficiently
perform a small fraction of recomputation to fix the cache to maintain output
quality, and how to efficiently store and evict chunk-caches in the hardware
for maximizing reuse while masking any overheads. With real production
workloads as well as synthetic datasets, we show that Cache-Craft reduces
redundant computation by 51% over SOTA prefix-caching and 75% over full
recomputation. Additionally, with continuous batching on a real production
workload, we get a 1.6X speed up in throughput and a 2X reduction in end-to-end
response latency over prefix-caching while maintaining quality, for both the
LLaMA-3-8B and LLaMA-3-70B models.
Authors' comments: Accepted at SIGMOD 2025
Ming Gao, Ruichen Qiu, Zeng Hui Chang, Kanjian Zhang, Haikun Wei, Hong Cai Chen
In the domain of analog circuit design, the retrieval of circuit diagrams has
drawn a great interest, primarily due to its vital role in the consultation of
legacy designs and the detection of design plagiarism. Existing image retrieval
techniques are adept at handling natural images, which converts images into
feature vectors and retrieval similar images according to the closeness of
these vectors. Nonetheless, these approaches exhibit limitations when applied
to the more specialized and intricate domain of circuit diagrams. This paper
presents a novel approach to circuit diagram retrieval by employing a graph
representation of circuit diagrams, effectively reformulating the retrieval
task as a graph retrieval problem. The proposed methodology consists of two
principal components: a circuit diagram recognition algorithm designed to
extract the circuit components and topological structure of the circuit using
proposed GAM-YOLO model and a 2-step connected domain filtering algorithm, and
a hierarchical retrieval strategy based on graph similarity and different graph
representation methods for analog circuits. Our methodology pioneers the
utilization of graph representation in the retrieval of circuit diagrams,
incorporating topological features that are commonly overlooked by standard
image retrieval methods. The results of our experiments substantiate the
efficacy of our approach in retrieving circuit diagrams across of different
types.
Authors' comments: 11 pages, 10 figures, 7 tables, under review paper
Baoyao Yang, Junxiang Chen, Wanyun Li, Wenbin Yao, Yang Zhou
Video-text retrieval has been stuck in the information mismatch caused by personalized and inadequate textual descriptions of videos. The substantial information gap between the two modalities hinders an effective cross-modal representation alignment, resulting in ambiguous retrieval results. Although text rewriting methods have been proposed to broaden text expressions, the modality gap remains significant, as the text representation space is hardly expanded with insufficient semantic enrichment.Instead, this paper turns to enhancing visual presentation, bridging video expression closer to textual representation via caption generation and thereby facilitating video-text matching.While multimodal large language models (mLLM) have shown a powerful capability to convert video content into text, carefully crafted prompts are essential to ensure the reasonableness and completeness of the generated captions. Therefore, this paper proposes an automatic caption enhancement method that improves expression quality and mitigates empiricism in augmented captions through self-learning.Additionally, an expertized caption selection mechanism is designed and introduced to customize augmented captions for each video, further exploring the utilization potential of caption augmentation.Our method is entirely data-driven, which not only dispenses with heavy data collection and computation workload but also improves self-adaptability by circumventing lexicon dependence and introducing personalized matching. The superiority of our method is validated by state-of-the-art results on various benchmarks, specifically achieving Top-1 recall accuracy of 68.5% on MSR-VTT, 68.1% on MSVD, and 62.0% on DiDeMo. Our code is publicly available at https://github.com/CaryXiang/ECA4VTR.
Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang
Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in enhancing Large Language Models (LLMs) through external knowledge integration, yet its application has primarily focused on textual content, leaving the rich domain of multi-modal video knowledge predominantly unexplored. This paper introduces VideoRAG, the first retrieval-augmented generation framework specifically designed for processing and understanding extremely long-context videos. Our core innovation lies in its dual-channel architecture that seamlessly integrates (i) graph-based textual knowledge grounding for capturing cross-video semantic relationships, and (ii) multi-modal context encoding for efficiently preserving visual features. This novel design empowers VideoRAG to process unlimited-length videos by constructing precise knowledge graphs that span multiple videos while maintaining semantic dependencies through specialized multi-modal retrieval paradigms. Through comprehensive empirical evaluation on our proposed LongerVideos benchmark-comprising over 160 videos totaling 134+ hours across lecture, documentary, and entertainment categories-VideoRAG demonstrates substantial performance compared to existing RAG alternatives and long video understanding methods. The source code of VideoRAG implementation and the benchmark dataset are openly available at: https://github.com/HKUDS/VideoRAG.