Bo Xiong, Linghao Zhang, Chong Wang, Peng Liang
A commit message describes the main code changes in a commit and plays a
crucial role in software maintenance. Existing commit message generation (CMG)
approaches typically frame it as a direct mapping which inputs a code diff and
produces a brief descriptive sentence as output. However, we argue that relying
solely on the code diff is insufficient, as raw code diff fails to capture the
full context needed for generating high-quality and informative commit
messages. In this paper, we propose a contextual code retrieval-based method
called C3Gen to enhance CMG by retrieving commit-relevant code snippets from
the repository and incorporating them into the model input to provide richer
contextual information at the repository scope. In the experiments, we
evaluated the effectiveness of C3Gen across various models using four objective
and three subjective metrics. Meanwhile, we design and conduct a human
evaluation to investigate how C3Gen-generated commit messages are perceived by
human developers. The results show that by incorporating contextual code into
the input, C3Gen enables models to effectively leverage additional information
to generate more comprehensive and informative commit messages with greater
practical value in real-world development scenarios. Further analysis
underscores concerns about the reliability of similaritybased metrics and
provides empirical insights for CMG.
Authors' comments: The 19th ACM/IEEE International Symposium on Empirical Software
Engineering and Measurement (ESEM)
Li Jun, Wang Jinpeng, Tan Chaolei, Lian Niu, Chen Long, Zhang Min, Wang Yaowei, Xia Shu-Tao et al.
Partially Relevant Video Retrieval (PRVR) addresses the critical challenge of
matching untrimmed videos with text queries describing only partial content.
Existing methods suffer from geometric distortion in Euclidean space that
sometimes misrepresents the intrinsic hierarchical structure of videos and
overlooks certain hierarchical semantics, ultimately leading to suboptimal
temporal modeling. To address this issue, we propose the first hyperbolic
modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space
learning to compensate for the suboptimal hierarchical modeling capabilities of
Euclidean space. Specifically, HLFormer integrates the Lorentz Attention Block
and Euclidean Attention Block to encode video embeddings in hybrid spaces,
using the Mean-Guided Adaptive Interaction Module to dynamically fuse features.
Additionally, we introduce a Partial Order Preservation Loss to enforce "text <
video" hierarchy through Lorentzian cone constraints. This approach further
enhances cross-modal matching by reinforcing partial relevance between video
content and text queries. Extensive experiments show that HLFormer outperforms
state-of-the-art methods. Code is released at
https://github.com/lijun2005/ICCV25-HLFormer.
Authors' comments: Accepted by ICCV'25. 13 pages, 6 figures, 4 tables
Mohamed Nomeir, Alptug Aytekin, Sennur Ulukus
We study the problem of semantic private information retrieval (Sem-PIR) with $T$ colluding servers (Sem-TPIR), i.e., servers that collectively share user queries. In Sem-TPIR, the message sizes are different, and message retrieval probabilities by any user are not uniform. This is a generalization of the classical PIR problem where the message sizes are equal and message retrieval probabilities are identical. The earlier work on Sem-PIR considered the case of no collusions, i.e., the collusion parameter of $T=1$. In this paper, we consider the general problem for arbitrary $T < N$. We find an upper bound on the retrieval rate and design a scheme that achieves this rate, i.e., we derive the exact capacity of Sem-TPIR.
Haomin Qi, Yuyang Du, Lihao Zhang, Soung Chang Liew, Kexin Chen, Yining Du
Large language models (LLMs) have demonstrated immense potential in
computer-aided design (CAD), particularly for automated debugging and
verification within electronic design automation (EDA) tools. However, Design
for Testability (DFT) remains a relatively underexplored area. This paper
presents VeriRAG, the first LLM-assisted DFT-EDA framework. VeriRAG leverages a
Retrieval-Augmented Generation (RAG) approach to enable LLM to revise code to
ensure DFT compliance. VeriRAG integrates (1) an autoencoder-based similarity
measurement model for precise retrieval of reference RTL designs for the LLM,
and (2) an iterative code revision pipeline that allows the LLM to ensure DFT
compliance while maintaining synthesizability. To support VeriRAG, we introduce
VeriDFT, a Verilog-based DFT dataset curated for DFT-aware RTL repairs. VeriRAG
retrieves structurally similar RTL designs from VeriDFT, each paired with a
rigorously validated correction, as references for code repair. With VeriRAG
and VeriDFT, we achieve fully automated DFT correction -- resulting in a
7.72-fold improvement in successful repair rate compared to the zero-shot
baseline (Fig. 5 in Section V). Ablation studies further confirm the
contribution of each component of the VeriRAG framework. We open-source our
data, models, and scripts at https://github.com/yuyangdu01/LLM4DFT.
Authors' comments: 8 pages, 5 figures
Deyu Zhang, Tingting Long, Jinrui Zhang, Ligeng Chen, Ju Ren, Yaoxue Zhang
Enabling efficient text-video retrieval on edge-end devices is critical for real-world applications. Yet, existing methods face a critical challenge in balancing accuracy and computational efficiency: uniform frame sampling methods ensure content coverage but incur prohibitive computational costs, while salient-frame sampling methods reduce overhead but suffer from query-agnostic frame selection that biases retrieval results. To address this, we propose ProCLIP, a user-centric framework that achieves state-of-the-art accuracy with significantly improved efficiency. We design a prompt-aware frame sampling strategy that dynamically guides lightweight feature extractors using textual prompts to select semantically relevant frames, overcoming the limitations of existing salient-frame sampling methods which rely on static, query-agnostic selection criteria. Moreover, we adopt a two-stage candidate pruning strategy that combines rapid coarse filtering via a lightweight module with CLIP-powered fine-grained re-ranking, enhancing retrieval efficiency while preserving accuracy. Experiments across benchmarks show ProCLIP achieves 75.3% latency reduction versus baselines while maintaining competitive accuracy, i.e., R@1=49.0 in MSR-VTT dataset. Code is available at https://github.com/tiffylong/ProCLIP.
Lu Guo, Yixiang Shan, Zhengbang Zhu, Qifan Liang, Lichang Song, Ting Long, Weinan Zhang, Yi Chang
Offline reinforcement learning (RL) enables agents to learn policies from fixed datasets, avoiding costly or unsafe environment interactions. However, its effectiveness is often limited by dataset sparsity and the lack of transition overlap between suboptimal and expert trajectories, which makes long-horizon planning particularly challenging. Prior solutions based on synthetic data augmentation or trajectory stitching often fail to generalize to novel states and rely on heuristic stitching points. To address these challenges, we propose Retrieval High-quAlity Demonstrations (RAD) for decision-making, which combines non-parametric retrieval with diffusion-based generative modeling. RAD dynamically retrieves high-return states from the offline dataset as target states based on state similarity and return estimation, and plans toward them using a condition-guided diffusion model. Such retrieval-guided generation enables flexible trajectory stitching and improves generalization when encountered with underrepresented or out-of-distribution states. Extensive experiments confirm that RAD achieves competitive or superior performance compared to baselines across diverse benchmarks, validating its effectiveness.
Hongye Hou, Liu Zhan, Yang Yang
Completing the whole 3D structure based on an incomplete point cloud is a challenging task, particularly when the residual point cloud lacks typical structural characteristics. Recent methods based on cross-modal learning attempt to introduce instance images to aid the structure feature learning. However, they still focus on each particular input class, limiting their generation abilities. In this work, we propose a novel retrieval-augmented point cloud completion framework. The core idea is to incorporate cross-modal retrieval into completion task to learn structural prior information from similar reference samples. Specifically, we design a Structural Shared Feature Encoder (SSFE) to jointly extract cross-modal features and reconstruct reference features as priors. Benefiting from a dual-channel control gate in the encoder, relevant structural features in the reference sample are enhanced and irrelevant information interference is suppressed. In addition, we propose a Progressive Retrieval-Augmented Generator (PRAG) that employs a hierarchical feature fusion mechanism to integrate reference prior information with input features from global to local. Through extensive evaluations on multiple datasets and real-world scenes, our method shows its effectiveness in generating fine-grained point clouds, as well as its generalization capability in handling sparse data and unseen categories.
Huayuan Ye, Juntong Chen, Shenzhuo Zhang, Yipeng Zhang, Changbo Wang, Chenhui Li
The dissemination of visualizations is primarily in the form of raster
images, which often results in the loss of critical information such as source
code, interactive features, and metadata. While previous methods have proposed
embedding metadata into images to facilitate Visualization Image Data Retrieval
(VIDR), most existing methods lack practicability since they are fragile to
common image tampering during online distribution such as cropping and editing.
To address this issue, we propose VisGuard, a tamper-resistant VIDR framework
that reliably embeds metadata link into visualization images. The embedded data
link remains recoverable even after substantial tampering upon images. We
propose several techniques to enhance robustness, including repetitive data
tiling, invertible information broadcasting, and an anchor-based scheme for
crop localization. VisGuard enables various applications, including interactive
chart reconstruction, tampering detection, and copyright protection. We conduct
comprehensive experiments on VisGuard's superior performance in data retrieval
accuracy, embedding capacity, and security against tampering and steganalysis,
demonstrating VisGuard's competence in facilitating and safeguarding
visualization dissemination and information conveyance.
Authors' comments: 9 pages, IEEE VIS 2025
Karan Mirhosseini, Arya Aftab, Alireza Sheikh
In an era of radical technology transformations, technology maps play a crucial role in enhancing decision making. These maps heavily rely on automated methods of technology extraction. This paper introduces Retrieval Augmented Technology Extraction (RATE), a Large Language Model (LLM) based pipeline for automated technology extraction from scientific literature. RATE combines Retrieval Augmented Generation (RAG) with multi-definition LLM-based validation. This hybrid method results in high recall in candidate generation alongside with high precision in candidate filtering. While the pipeline is designed to be general and widely applicable, we demonstrate its use on 678 research articles focused on Brain-Computer Interfaces (BCIs) and Extended Reality (XR) as a case study. Consequently, The validated technology terms by RATE were mapped into a co-occurrence network, revealing thematic clusters and structural features of the research landscape. For the purpose of evaluation, a gold standard dataset of technologies in 70 selected random articles had been curated by the experts. In addition, a technology extraction model based on Bidirectional Encoder Representations of Transformers (BERT) was used as a comparative method. RATE achieved F1-score of 91.27%, Significantly outperforming BERT with F1-score of 53.73%. Our findings highlight the promise of definition-driven LLM methods for technology extraction and mapping. They also offer new insights into emerging trends within the BCI-XR field. The source code is available https://github.com/AryaAftab/RATE
Authors' comments: 9 pages, 4 figures, 1 table
Lam Nguyen, Erika Barcelos, Roger French, Yinghui Wu
Ontology Matching (OM) is a cornerstone task of semantic interoperability,
yet existing systems often rely on handcrafted rules or specialized models with
limited adaptability. We present KROMA, a novel OM framework that harnesses
Large Language Models (LLMs) within a Retrieval-Augmented Generation (RAG)
pipeline to dynamically enrich the semantic context of OM tasks with
structural, lexical, and definitional knowledge. To optimize both performance
and efficiency, KROMA integrates a bisimilarity-based concept matching and a
lightweight ontology refinement step, which prune candidate concepts and
substantially reduce the communication overhead from invoking LLMs. Through
experiments on multiple benchmark datasets, we show that integrating knowledge
retrieval with context-augmented LLMs significantly enhances ontology matching,
outperforming both classic OM systems and cutting-edge LLM-based approaches
while keeping communication overhead comparable. Our study highlights the
feasibility and benefit of the proposed optimization techniques (targeted
knowledge retrieval, prompt enrichment, and ontology refinement) for ontology
matching at scale.
Authors' comments: Accepted to the 24th International Semantic Web Conference Research
Track (ISWC 2025)
Shad Nygren, Pinar Avci, Andre Daniels, Reza Rassol, Afshin Beheshti, Diego Galeano
Drug side effects are a major global health concern, necessitating advanced methods for their accurate detection and analysis. While Large Language Models (LLMs) offer promising conversational interfaces, their inherent limitations, including reliance on black-box training data, susceptibility to hallucinations, and lack of domain-specific knowledge, hinder their reliability in specialized fields like pharmacovigilance. To address this gap, we propose two architectures: Retrieval-Augmented Generation (RAG) and GraphRAG, which integrate comprehensive drug side effect knowledge into a Llama 3 8B language model. Through extensive evaluations on 19,520 drug side effect associations (covering 976 drugs and 3,851 side effect terms), our results demonstrate that GraphRAG achieves near-perfect accuracy in drug side effect retrieval. This framework offers a highly accurate and scalable solution, signifying a significant advancement in leveraging LLMs for critical pharmacovigilance applications.
Mubashara Akhtar, Michael Schlichtkrull, Andreas Vlachos
Current automated fact-checking (AFC) approaches typically evaluate evidence either implicitly via the predicted verdicts or through exact matches with predefined closed knowledge sources, such as Wikipedia. However, these methods are limited due to their reliance on evaluation metrics originally designed for other purposes and constraints from closed knowledge sources. In this work, we introduce \textbf{\textcolor{skyblue}{Ev\textsuperscript{2}}\textcolor{orangebrown}{R}} which combines the strengths of reference-based evaluation and verdict-level proxy scoring. Ev\textsuperscript{2}R jointly assesses how well the evidence aligns with the gold references and how reliably it supports the verdict, addressing the shortcomings of prior methods. We evaluate Ev\textsuperscript{2}R against three types of evidence evaluation approaches: reference-based, proxy-reference, and reference-less baselines. Assessments against human ratings and adversarial tests demonstrate that Ev\textsuperscript{2}R consistently outperforms existing scoring approaches in accuracy and robustness. It achieves stronger correlation with human judgments and greater robustness to adversarial perturbations, establishing it as a reliable metric for evidence evaluation in AFC.\footnote{Code is available at \href{https://github.com/mubasharaak/fc-evidence-evaluation}{https://github.com/mubasharaak/fc-evidence-evaluation}.}
Authors' comments: Accepted at TACL
Chandana Cheerla
Organizations increasingly rely on proprietary enterprise data, including HR records, structured reports, and tabular documents, for critical decision-making. While Large Language Models (LLMs) have strong generative capabilities, they are limited by static pretraining, short context windows, and challenges in processing heterogeneous data formats. Conventional Retrieval-Augmented Generation (RAG) frameworks address some of these gaps but often struggle with structured and semi-structured data. This work proposes an advanced RAG framework that combines hybrid retrieval strategies using dense embeddings (all-mpnet-base-v2) and BM25, enhanced by metadata-aware filtering with SpaCy NER and cross-encoder reranking. The framework applies semantic chunking to maintain textual coherence and retains tabular data structures to preserve row-column integrity. Quantized indexing optimizes retrieval efficiency, while human-in-the-loop feedback and conversation memory improve adaptability. Experiments on enterprise datasets show notable improvements: Precision@5 increased by 15 percent (90 versus 75), Recall@5 by 13 percent (87 versus 74), and Mean Reciprocal Rank by 16 percent (0.85 versus 0.69). Qualitative evaluations show higher scores in Faithfulness (4.6 versus 3.0), Completeness (4.2 versus 2.5), and Relevance (4.5 versus 3.2) on a 5-point Likert scale. These results demonstrate the framework's effectiveness in delivering accurate, comprehensive, and contextually relevant responses for enterprise tasks. Future work includes extending to multimodal data and integrating agent-based retrieval. The source code will be released at https://github.com/CheerlaChandana/Enterprise-Chatbot
Sara Ghasvarianjahromi, Yauhen Yakimenka, Jörg Kliewer
This paper introduces and analyzes a search and retrieval model that adopts key semantic communication principles from retrieval-augmented generation. We specifically present an information-theoretic analysis of a remote document retrieval system operating over a symbol erasure channel. The proposed model encodes the feature vector of a query, derived from term-frequency weights of a language corpus by using a repetition code with an adaptive rate dependent on the contextual importance of the terms. At the decoder, we select between two documents based on the contextual closeness of the recovered query. By leveraging a jointly Gaussian approximation for both the true and reconstructed similarity scores, we derive an explicit expression for the retrieval error probability, i.e., the probability under which the less similar document is selected. Numerical simulations on synthetic and real-world data (Google NQ) confirm the validity of the analysis. They further demonstrate that assigning greater redundancy to critical features effectively reduces the error rate, highlighting the effectiveness of semantic-aware feature encoding in error-prone communication settings.
Satoshi Yoshida, Jisho Miyazaki, Mio Murao
Storage and retrieval refer to the task of encoding a quantum channel
$\Lambda$ into a quantum state, known as the program state, such that the
channel can later be retrieved. This task is closely related to quantum channel
estimation, where multiple queries to $\Lambda$ are used to prepare a quantum
state $\phi_\Lambda$ that encodes information about the channel. The channel
can then be retrieved by measuring $\phi_\Lambda$, following a
measure-and-prepare strategy. In this work, we analyze the asymptotic
performance of the estimation-based strategy for storage and retrieval of
isometry channels. We show that the optimal fidelity for isometry estimation is
given by $F = 1-{d(D-d)\over n} + O(n^{-2})$, where $d$ and $D$ denote the
input and output dimensions of the isometry, and $n$ is the number of queries.
This result indicates that, unlike in the case of unitary channels, the
estimation-based strategy is suboptimal for the storage and retrieval of
isometry channels, which requires $n = \Theta(\epsilon^{-1})$ to achieve the
diamond-norm error $\epsilon$. To address this limitation, we propose a more
efficient protocol based on port-based teleportation, which stores the isometry
channel in a program state using only $n = \Theta(1/\sqrt{\epsilon})$ queries.
As an application, we extend our approach to general quantum channels,
achieving improved program cost compared to prior results by Gschwedtner,
Bluhm, and Winter [Quantum $\textbf{5}$, 488 (2021)].
Authors' comments: 5+28 pages, 2 figures
Yue Meng, Cheng Guo, Xiaohui Hu, Honghu Deng, Yi Cao, Tong Liu, Bo Zheng
User behavior sequence modeling, which captures user interest from rich historical interactions, is pivotal for industrial recommendation systems. Despite breakthroughs in ranking-stage models capable of leveraging ultra-long behavior sequences with length scaling up to thousands, existing retrieval models remain constrained to sequences of hundreds of behaviors due to two main challenges. One is strict latency budget imposed by real-time service over large-scale candidate pool. The other is the absence of target-aware mechanisms and cross-interaction architectures, which prevent utilizing ranking-like techniques to simplify long sequence modeling. To address these limitations, we propose a new framework named User Long-term Multi-Interest Retrieval Model(ULIM), which enables thousand-scale behavior modeling in retrieval stages. ULIM includes two novel components: 1)Category-Aware Hierarchical Dual-Interest Learning partitions long behavior sequences into multiple category-aware subsequences representing multi-interest and jointly optimizes long-term and short-term interests within specific interest cluster. 2)Pointer-Enhanced Cascaded Category-to-Item Retrieval introduces Pointer-Generator Interest Network(PGIN) for next-category prediction, followed by next-item retrieval upon the top-K predicted categories. Comprehensive experiments on Taobao dataset show that ULIM achieves substantial improvement over state-of-the-art methods, and brings 5.54% clicks, 11.01% orders and 4.03% GMV lift for Taobaomiaosha, a notable mini-app of Taobao.
Hai Toan Nguyen, Tien Dat Nguyen, Viet Ha Nguyen
Retrieval-Augmented Generation (RAG) systems commonly use chunking strategies for retrieval, which enhance large language models (LLMs) by enabling them to access external knowledge, ensuring that the retrieved information is up-to-date and domain-specific. However, traditional methods often fail to create chunks that capture sufficient semantic meaning, as they do not account for the underlying textual structure. This paper proposes a novel framework that enhances RAG by integrating hierarchical text segmentation and clustering to generate more meaningful and semantically coherent chunks. During inference, the framework retrieves information by leveraging both segment-level and cluster-level vector representations, thereby increasing the likelihood of retrieving more precise and contextually relevant information. Evaluations on the NarrativeQA, QuALITY, and QASPER datasets indicate that the proposed method achieved improved results compared to traditional chunking techniques.
Mehmet Onurcan Kaya, Figen S. Oktem
We propose a novel framework for phase retrieval that leverages Langevin dynamics to enable efficient posterior sampling, yielding reconstructions that explicitly balance distortion and perceptual quality. Unlike conventional approaches that prioritize pixel-wise accuracy, our method navigates the perception-distortion tradeoff through a principled combination of stochastic sampling, learned denoising, and model-based updates. The framework comprises three variants of increasing complexity, integrating theoretically grounded Langevin inference, adaptive noise schedule learning, parallel reconstruction sampling, and warm-start initialization from classical solvers. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple benchmarks, both in terms of fidelity and perceptual quality.
Isaac Shi, Zeyuan Li, Fan Liu, Wenli Wang, Lewei He, Yang Yang, Tianyu Shi
We present eSapiens, an AI-as-a-Service (AIaaS) platform engineered around a business-oriented trifecta: proprietary data, operational workflows, and any major agnostic Large Language Model (LLM). eSapiens gives businesses full control over their AI assets, keeping everything in-house for AI knowledge retention and data security. eSapiens AI Agents (Sapiens) empower your team by providing valuable insights and automating repetitive tasks, enabling them to focus on high-impact work and drive better business outcomes. The system integrates structured document ingestion, hybrid vector retrieval, and no-code orchestration via LangChain, and supports top LLMs including OpenAI, Claude, Gemini, and DeepSeek. A key component is the THOR Agent, which handles structured SQL-style queries and generates actionable insights over enterprise databases. To evaluate the system, we conduct two experiments. First, a retrieval benchmark on legal corpora reveals that a chunk size of 512 tokens yields the highest retrieval precision (Top-3 accuracy: 91.3%). Second, a generation quality test using TRACe metrics across five LLMs shows that eSapiens delivers more context-consistent outputs with up to a 23% improvement in factual alignment. These results demonstrate the effectiveness of eSapiens in enabling trustworthy, auditable AI workflows for high-stakes domains like legal and finance.
Bangcheng Sun, Yazhe Chen, Jilin Yang, Xiaodong Li, Hui Li
Explainable Recommender System (ExRec) provides transparency to the recommendation process, increasing users' trust and boosting the operation of online services. With the rise of large language models (LLMs), whose extensive world knowledge and nuanced language understanding enable the generation of human-like, contextually grounded explanations, LLM-powered ExRec has gained great momentum. However, existing LLM-based ExRec models suffer from profile deviation and high retrieval overhead, hindering their deployment. To address these issues, we propose Retrieval-Augmented Recommendation Explanation Generation with Hierarchical Aggregation (REXHA). Specifically, we design a hierarchical aggregation based profiling module that comprehensively considers user and item review information, hierarchically summarizing and constructing holistic profiles. Furthermore, we introduce an efficient retrieval module using two types of pseudo-document queries to retrieve relevant reviews to enhance the generation of recommendation explanations, effectively reducing retrieval latency and improving the recall of relevant reviews. Extensive experiments demonstrate that our method outperforms existing approaches by up to 12.6% w.r.t. the explanation quality while achieving high retrieval efficiency.