Wenchao Gu, Zongyi Lyu, Yanlin Wang, Hongyu Zhang, Cuiyun Gao, Michael R. Lyu
Code retrieval aims to provide users with desired code snippets based on users' natural language queries. With the development of deep learning technologies, adopting pre-trained models for this task has become mainstream. Considering the retrieval efficiency, most of the previous approaches adopt a dual-encoder for this task, which encodes the description and code snippet into representation vectors, respectively. However, the model structure of the dual-encoder tends to limit the model's performance, since it lacks the interaction between the code snippet and description at the bottom layer of the model during training. To improve the model's effectiveness while preserving its efficiency, we propose a framework, which adopts Self-AdaPtive Model Distillation for Efficient CodE Retrieval, named SPENCER. SPENCER first adopts the dual-encoder to narrow the search space and then adopts the cross-encoder to improve accuracy. To improve the efficiency of SPENCER, we propose a novel model distillation technique, which can greatly reduce the inference time of the dual-encoder while maintaining the overall performance. We also propose a teaching assistant selection strategy for our model distillation, which can adaptively select the suitable teaching assistant models for different pre-trained models during the model distillation to ensure the model performance. Extensive experiments demonstrate that the combination of dual-encoder and cross-encoder improves overall performance compared to solely dual-encoder-based models for code retrieval. Besides, our model distillation technique retains over 98% of the overall performance while reducing the inference time of the dual-encoder by 70%.
Moumita Asad, Rafed Muhammad Yasir, Armin Geramirad, Sam Malek
Information Retrieval-based Bug Localization aims to identify buggy source files for a given bug report. While existing approaches -- ranging from vector space models to deep learning models -- have shown potential in this domain, their effectiveness is often limited by the vocabulary mismatch between bug reports and source code. To address this issue, we propose a novel Large Language Model (LLM) based bug localization approach, called GenLoc. Given a bug report, GenLoc leverages an LLM equipped with code-exploration functions to iteratively analyze the code base and identify potential buggy files. To gather better context, GenLoc may optionally retrieve semantically relevant files using vector embeddings. GenLoc has been evaluated on over 9,000 real-world bug reports from six large-scale Java projects. Experimental results show that GenLoc outperforms five state-of-the-art bug localization techniques across multiple metrics, achieving an average improvement of more than 60\% in Accuracy@1.
Salah Eddine Bekhouche, Azeddine Benlamoudi, Yazid Bounab, Fadi Dornaika, Abdenour Hadid
Arabic poses a particular challenge for natural language processing (NLP) and information retrieval (IR) due to its complex morphology, optional diacritics and the coexistence of Modern Standard Arabic (MSA) and various dialects. Despite the growing global significance of Arabic, it is still underrepresented in NLP research and benchmark resources. In this paper, we present an enhanced Dense Passage Retrieval (DPR) framework developed specifically for Arabic. At the core of our approach is a novel Attentive Relevance Scoring (ARS) that replaces standard interaction mechanisms with an adaptive scoring function that more effectively models the semantic relevance between questions and passages. Our method integrates pre-trained Arabic language models and architectural refinements to improve retrieval performance and significantly increase ranking accuracy when answering Arabic questions. The code is made publicly available at \href{https://github.com/Bekhouche/APR}{GitHub}.
Daeyong Kwon, SeungHeon Doh, Juhan Nam
Recent advancements in Large language models (LLMs) have demonstrated
remarkable capabilities across diverse domains. While they exhibit strong
zero-shot performance on various tasks, LLMs' effectiveness in music-related
applications remains limited due to the relatively small proportion of
music-specific knowledge in their training data. To address this limitation, we
propose MusT-RAG, a comprehensive framework based on Retrieval Augmented
Generation (RAG) to adapt general-purpose LLMs for text-only music question
answering (MQA) tasks. RAG is a technique that provides external knowledge to
LLMs by retrieving relevant context information when generating answers to
questions. To optimize RAG for the music domain, we (1) propose MusWikiDB, a
music-specialized vector database for the retrieval stage, and (2) utilizes
context information during both inference and fine-tuning processes to
effectively transform general-purpose LLMs into music-specific models. Our
experiment demonstrates that MusT-RAG significantly outperforms traditional
fine-tuning approaches in enhancing LLMs' music domain adaptation capabilities,
showing consistent improvements across both in-domain and out-of-domain MQA
benchmarks. Additionally, our MusWikiDB proves substantially more effective
than general Wikipedia corpora, delivering superior performance and
computational efficiency.
Authors' comments: 8 pages, 2 figures
Roxana Petcu, Samarth Bhargav, Maarten de Rijke, Evangelos Kanoulas
Understanding and solving complex reasoning tasks is vital for addressing the information needs of a user. Although dense neural models learn contextualised embeddings, they still underperform on queries containing negation. To understand this phenomenon, we study negation in both traditional neural information retrieval and LLM-based models. We (1) introduce a taxonomy of negation that derives from philosophical, linguistic, and logical definitions; (2) generate two benchmark datasets that can be used to evaluate the performance of neural information retrieval models and to fine-tune models for a more robust performance on negation; and (3) propose a logic-based classification mechanism that can be used to analyze the performance of retrieval models on existing datasets. Our taxonomy produces a balanced data distribution over negation types, providing a better training setup that leads to faster convergence on the NevIR dataset. Moreover, we propose a classification schema that reveals the coverage of negation types in existing datasets, offering insights into the factors that might affect the generalization of fine-tuned models on negation.
Aryan Raj, Astitva Veer Garg, Anitha D
Retrieval-Augmented Language Models (RALMs) face significant challenges in
reducing factual errors, particularly in document relevance evaluation and
knowledge integration. We introduce a framework for structured relevance
assessment that enhances RALM robustness through improved document evaluation,
balanced intrinsic and external knowledge integration, and effective handling
of unanswerable queries. Our approach employs a multi-dimensional scoring
system that considers both semantic matching and source reliability, utilizing
embedding-based relevance scoring and synthetic training data with
mixed-quality documents. We implement specialized benchmarking on niche topics,
a knowledge integration mechanism, and an "unknown" response protocol for
queries with insufficient knowledge coverage. Preliminary evaluations
demonstrate significant reductions in hallucination rates and improved
transparency in reasoning processes. Our framework advances the development of
more reliable question-answering systems capable of operating effectively in
dynamic environments with variable data quality. While challenges persist in
accurately distinguishing credible information and balancing system latency
with thoroughness, this work represents a meaningful step toward enhancing RALM
reliability.
Authors' comments: International Conference on ICT for Sustainable Development (ICT4SD)
George Ibrahim, Rita Ramos, Yova Kementchedjhieva
Multilingual vision-language models have made significant strides in image
captioning, yet they still lag behind their English counterparts due to limited
multilingual training data and costly large-scale model parameterization.
Retrieval-augmented generation (RAG) offers a promising alternative by
conditioning caption generation on retrieved examples in the target language,
reducing the need for extensive multilingual training. However, multilingual
RAG captioning models often depend on retrieved captions translated from
English, which can introduce mismatches and linguistic biases relative to the
source language. We introduce CONCAP, a multilingual image captioning model
that integrates retrieved captions with image-specific concepts, enhancing the
contextualization of the input image and grounding the captioning process
across different languages. Experiments on the XM3600 dataset indicate that
CONCAP enables strong performance on low- and mid-resource languages, with
highly reduced data requirements. Our findings highlight the effectiveness of
concept-aware retrieval augmentation in bridging multilingual performance gaps.
Authors' comments: Published as a conference paper at COLM 2025
Shreya Meel, Sennur Ulukus
We introduce the problem of symmetric private information retrieval (SPIR) on replicated databases modeled by a simple graph. In this model, each vertex corresponds to a server, and a message is replicated on two servers if and only if there is an edge between them. We consider the setting where the server-side common randomness necessary to accomplish SPIR is also replicated at the servers according to the graph, and we call this as message-specific common randomness. In this setting, we establish a lower bound on the SPIR capacity, i.e., the maximum download rate, for general graphs, by proposing an achievable SPIR scheme. Next, we prove that, for any SPIR scheme to be feasible, the minimum size of message-specific randomness should be equal to the size of a message. Finally, by providing matching upper bounds, we derive the exact SPIR capacity for the class of path and regular graphs.
Bo Xiong, Linghao Zhang, Chong Wang, Peng Liang
A commit message describes the main code changes in a commit and plays a
crucial role in software maintenance. Existing commit message generation (CMG)
approaches typically frame it as a direct mapping which inputs a code diff and
produces a brief descriptive sentence as output. However, we argue that relying
solely on the code diff is insufficient, as raw code diff fails to capture the
full context needed for generating high-quality and informative commit
messages. In this paper, we propose a contextual code retrieval-based method
called C3Gen to enhance CMG by retrieving commit-relevant code snippets from
the repository and incorporating them into the model input to provide richer
contextual information at the repository scope. In the experiments, we
evaluated the effectiveness of C3Gen across various models using four objective
and three subjective metrics. Meanwhile, we design and conduct a human
evaluation to investigate how C3Gen-generated commit messages are perceived by
human developers. The results show that by incorporating contextual code into
the input, C3Gen enables models to effectively leverage additional information
to generate more comprehensive and informative commit messages with greater
practical value in real-world development scenarios. Further analysis
underscores concerns about the reliability of similaritybased metrics and
provides empirical insights for CMG.
Authors' comments: The 19th ACM/IEEE International Symposium on Empirical Software
Engineering and Measurement (ESEM)
Li Jun, Wang Jinpeng, Tan Chaolei, Lian Niu, Chen Long, Zhang Min, Wang Yaowei, Xia Shu-Tao et al.
Partially Relevant Video Retrieval (PRVR) addresses the critical challenge of
matching untrimmed videos with text queries describing only partial content.
Existing methods suffer from geometric distortion in Euclidean space that
sometimes misrepresents the intrinsic hierarchical structure of videos and
overlooks certain hierarchical semantics, ultimately leading to suboptimal
temporal modeling. To address this issue, we propose the first hyperbolic
modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space
learning to compensate for the suboptimal hierarchical modeling capabilities of
Euclidean space. Specifically, HLFormer integrates the Lorentz Attention Block
and Euclidean Attention Block to encode video embeddings in hybrid spaces,
using the Mean-Guided Adaptive Interaction Module to dynamically fuse features.
Additionally, we introduce a Partial Order Preservation Loss to enforce "text <
video" hierarchy through Lorentzian cone constraints. This approach further
enhances cross-modal matching by reinforcing partial relevance between video
content and text queries. Extensive experiments show that HLFormer outperforms
state-of-the-art methods. Code is released at
https://github.com/lijun2005/ICCV25-HLFormer.
Authors' comments: Accepted by ICCV'25. 13 pages, 6 figures, 4 tables
Mohamed Nomeir, Alptug Aytekin, Sennur Ulukus
We study the problem of semantic private information retrieval (Sem-PIR) with $T$ colluding servers (Sem-TPIR), i.e., servers that collectively share user queries. In Sem-TPIR, the message sizes are different, and message retrieval probabilities by any user are not uniform. This is a generalization of the classical PIR problem where the message sizes are equal and message retrieval probabilities are identical. The earlier work on Sem-PIR considered the case of no collusions, i.e., the collusion parameter of $T=1$. In this paper, we consider the general problem for arbitrary $T < N$. We find an upper bound on the retrieval rate and design a scheme that achieves this rate, i.e., we derive the exact capacity of Sem-TPIR.
Haomin Qi, Yuyang Du, Lihao Zhang, Soung Chang Liew, Kexin Chen, Yining Du
Large language models (LLMs) have demonstrated immense potential in
computer-aided design (CAD), particularly for automated debugging and
verification within electronic design automation (EDA) tools. However, Design
for Testability (DFT) remains a relatively underexplored area. This paper
presents VeriRAG, the first LLM-assisted DFT-EDA framework. VeriRAG leverages a
Retrieval-Augmented Generation (RAG) approach to enable LLM to revise code to
ensure DFT compliance. VeriRAG integrates (1) an autoencoder-based similarity
measurement model for precise retrieval of reference RTL designs for the LLM,
and (2) an iterative code revision pipeline that allows the LLM to ensure DFT
compliance while maintaining synthesizability. To support VeriRAG, we introduce
VeriDFT, a Verilog-based DFT dataset curated for DFT-aware RTL repairs. VeriRAG
retrieves structurally similar RTL designs from VeriDFT, each paired with a
rigorously validated correction, as references for code repair. With VeriRAG
and VeriDFT, we achieve fully automated DFT correction -- resulting in a
7.72-fold improvement in successful repair rate compared to the zero-shot
baseline (Fig. 5 in Section V). Ablation studies further confirm the
contribution of each component of the VeriRAG framework. We open-source our
data, models, and scripts at https://github.com/yuyangdu01/LLM4DFT.
Authors' comments: 8 pages, 5 figures
Deyu Zhang, Tingting Long, Jinrui Zhang, Ligeng Chen, Ju Ren, Yaoxue Zhang
Enabling efficient text-video retrieval on edge-end devices is critical for real-world applications. Yet, existing methods face a critical challenge in balancing accuracy and computational efficiency: uniform frame sampling methods ensure content coverage but incur prohibitive computational costs, while salient-frame sampling methods reduce overhead but suffer from query-agnostic frame selection that biases retrieval results. To address this, we propose ProCLIP, a user-centric framework that achieves state-of-the-art accuracy with significantly improved efficiency. We design a prompt-aware frame sampling strategy that dynamically guides lightweight feature extractors using textual prompts to select semantically relevant frames, overcoming the limitations of existing salient-frame sampling methods which rely on static, query-agnostic selection criteria. Moreover, we adopt a two-stage candidate pruning strategy that combines rapid coarse filtering via a lightweight module with CLIP-powered fine-grained re-ranking, enhancing retrieval efficiency while preserving accuracy. Experiments across benchmarks show ProCLIP achieves 75.3% latency reduction versus baselines while maintaining competitive accuracy, i.e., R@1=49.0 in MSR-VTT dataset. Code is available at https://github.com/tiffylong/ProCLIP.
Lu Guo, Yixiang Shan, Zhengbang Zhu, Qifan Liang, Lichang Song, Ting Long, Weinan Zhang, Yi Chang
Offline reinforcement learning (RL) enables agents to learn policies from fixed datasets, avoiding costly or unsafe environment interactions. However, its effectiveness is often limited by dataset sparsity and the lack of transition overlap between suboptimal and expert trajectories, which makes long-horizon planning particularly challenging. Prior solutions based on synthetic data augmentation or trajectory stitching often fail to generalize to novel states and rely on heuristic stitching points. To address these challenges, we propose Retrieval High-quAlity Demonstrations (RAD) for decision-making, which combines non-parametric retrieval with diffusion-based generative modeling. RAD dynamically retrieves high-return states from the offline dataset as target states based on state similarity and return estimation, and plans toward them using a condition-guided diffusion model. Such retrieval-guided generation enables flexible trajectory stitching and improves generalization when encountered with underrepresented or out-of-distribution states. Extensive experiments confirm that RAD achieves competitive or superior performance compared to baselines across diverse benchmarks, validating its effectiveness.
Hongye Hou, Liu Zhan, Yang Yang
Completing the whole 3D structure based on an incomplete point cloud is a challenging task, particularly when the residual point cloud lacks typical structural characteristics. Recent methods based on cross-modal learning attempt to introduce instance images to aid the structure feature learning. However, they still focus on each particular input class, limiting their generation abilities. In this work, we propose a novel retrieval-augmented point cloud completion framework. The core idea is to incorporate cross-modal retrieval into completion task to learn structural prior information from similar reference samples. Specifically, we design a Structural Shared Feature Encoder (SSFE) to jointly extract cross-modal features and reconstruct reference features as priors. Benefiting from a dual-channel control gate in the encoder, relevant structural features in the reference sample are enhanced and irrelevant information interference is suppressed. In addition, we propose a Progressive Retrieval-Augmented Generator (PRAG) that employs a hierarchical feature fusion mechanism to integrate reference prior information with input features from global to local. Through extensive evaluations on multiple datasets and real-world scenes, our method shows its effectiveness in generating fine-grained point clouds, as well as its generalization capability in handling sparse data and unseen categories.
Huayuan Ye, Juntong Chen, Shenzhuo Zhang, Yipeng Zhang, Changbo Wang, Chenhui Li
The dissemination of visualizations is primarily in the form of raster
images, which often results in the loss of critical information such as source
code, interactive features, and metadata. While previous methods have proposed
embedding metadata into images to facilitate Visualization Image Data Retrieval
(VIDR), most existing methods lack practicability since they are fragile to
common image tampering during online distribution such as cropping and editing.
To address this issue, we propose VisGuard, a tamper-resistant VIDR framework
that reliably embeds metadata link into visualization images. The embedded data
link remains recoverable even after substantial tampering upon images. We
propose several techniques to enhance robustness, including repetitive data
tiling, invertible information broadcasting, and an anchor-based scheme for
crop localization. VisGuard enables various applications, including interactive
chart reconstruction, tampering detection, and copyright protection. We conduct
comprehensive experiments on VisGuard's superior performance in data retrieval
accuracy, embedding capacity, and security against tampering and steganalysis,
demonstrating VisGuard's competence in facilitating and safeguarding
visualization dissemination and information conveyance.
Authors' comments: 9 pages, IEEE VIS 2025
Karan Mirhosseini, Arya Aftab, Alireza Sheikh
In an era of radical technology transformations, technology maps play a crucial role in enhancing decision making. These maps heavily rely on automated methods of technology extraction. This paper introduces Retrieval Augmented Technology Extraction (RATE), a Large Language Model (LLM) based pipeline for automated technology extraction from scientific literature. RATE combines Retrieval Augmented Generation (RAG) with multi-definition LLM-based validation. This hybrid method results in high recall in candidate generation alongside with high precision in candidate filtering. While the pipeline is designed to be general and widely applicable, we demonstrate its use on 678 research articles focused on Brain-Computer Interfaces (BCIs) and Extended Reality (XR) as a case study. Consequently, The validated technology terms by RATE were mapped into a co-occurrence network, revealing thematic clusters and structural features of the research landscape. For the purpose of evaluation, a gold standard dataset of technologies in 70 selected random articles had been curated by the experts. In addition, a technology extraction model based on Bidirectional Encoder Representations of Transformers (BERT) was used as a comparative method. RATE achieved F1-score of 91.27%, Significantly outperforming BERT with F1-score of 53.73%. Our findings highlight the promise of definition-driven LLM methods for technology extraction and mapping. They also offer new insights into emerging trends within the BCI-XR field. The source code is available https://github.com/AryaAftab/RATE
Authors' comments: 9 pages, 4 figures, 1 table
Lam Nguyen, Erika Barcelos, Roger French, Yinghui Wu
Ontology Matching (OM) is a cornerstone task of semantic interoperability,
yet existing systems often rely on handcrafted rules or specialized models with
limited adaptability. We present KROMA, a novel OM framework that harnesses
Large Language Models (LLMs) within a Retrieval-Augmented Generation (RAG)
pipeline to dynamically enrich the semantic context of OM tasks with
structural, lexical, and definitional knowledge. To optimize both performance
and efficiency, KROMA integrates a bisimilarity-based concept matching and a
lightweight ontology refinement step, which prune candidate concepts and
substantially reduce the communication overhead from invoking LLMs. Through
experiments on multiple benchmark datasets, we show that integrating knowledge
retrieval with context-augmented LLMs significantly enhances ontology matching,
outperforming both classic OM systems and cutting-edge LLM-based approaches
while keeping communication overhead comparable. Our study highlights the
feasibility and benefit of the proposed optimization techniques (targeted
knowledge retrieval, prompt enrichment, and ontology refinement) for ontology
matching at scale.
Authors' comments: Accepted to the 24th International Semantic Web Conference Research
Track (ISWC 2025)
Shad Nygren, Pinar Avci, Andre Daniels, Reza Rassol, Afshin Beheshti, Diego Galeano
Drug side effects are a major global health concern, necessitating advanced methods for their accurate detection and analysis. While Large Language Models (LLMs) offer promising conversational interfaces, their inherent limitations, including reliance on black-box training data, susceptibility to hallucinations, and lack of domain-specific knowledge, hinder their reliability in specialized fields like pharmacovigilance. To address this gap, we propose two architectures: Retrieval-Augmented Generation (RAG) and GraphRAG, which integrate comprehensive drug side effect knowledge into a Llama 3 8B language model. Through extensive evaluations on 19,520 drug side effect associations (covering 976 drugs and 3,851 side effect terms), our results demonstrate that GraphRAG achieves near-perfect accuracy in drug side effect retrieval. This framework offers a highly accurate and scalable solution, signifying a significant advancement in leveraging LLMs for critical pharmacovigilance applications.
Mubashara Akhtar, Michael Schlichtkrull, Andreas Vlachos
Current automated fact-checking (AFC) approaches typically evaluate evidence either implicitly via the predicted verdicts or through exact matches with predefined closed knowledge sources, such as Wikipedia. However, these methods are limited due to their reliance on evaluation metrics originally designed for other purposes and constraints from closed knowledge sources. In this work, we introduce \textbf{\textcolor{skyblue}{Ev\textsuperscript{2}}\textcolor{orangebrown}{R}} which combines the strengths of reference-based evaluation and verdict-level proxy scoring. Ev\textsuperscript{2}R jointly assesses how well the evidence aligns with the gold references and how reliably it supports the verdict, addressing the shortcomings of prior methods. We evaluate Ev\textsuperscript{2}R against three types of evidence evaluation approaches: reference-based, proxy-reference, and reference-less baselines. Assessments against human ratings and adversarial tests demonstrate that Ev\textsuperscript{2}R consistently outperforms existing scoring approaches in accuracy and robustness. It achieves stronger correlation with human judgments and greater robustness to adversarial perturbations, establishing it as a reliable metric for evidence evaluation in AFC.\footnote{Code is available at \href{https://github.com/mubasharaak/fc-evidence-evaluation}{https://github.com/mubasharaak/fc-evidence-evaluation}.}
Authors' comments: Accepted at TACL