Iman Barati, Mostafa Amiri, Heshaam Faili
Supervised Fine-Tuning (SFT) is essential for training large language models (LLMs), significantly enhancing critical capabilities such as instruction following and in-context learning. Nevertheless, creating suitable training datasets tailored for specific domains remains challenging due to unique domain constraints and data scarcity. In this paper, we propose SearchInstruct, an innovative method explicitly designed to construct high quality instruction datasets for SFT. Our approach begins with a limited set of domain specific, human generated questions, which are systematically expanded using a large language model. Subsequently, domain relevant resources are dynamically retrieved to generate accurate and contextually appropriate answers for each augmented question. Experimental evaluation demonstrates that SearchInstruct enhances both the diversity and quality of SFT datasets, leading to measurable improvements in LLM performance within specialized domains. Additionally, we show that beyond dataset generation, the proposed method can also effectively facilitate tasks such as model editing, enabling efficient updates to existing models. To facilitate reproducibility and community adoption, we provide full implementation details, the complete set of generated instruction response pairs, and the source code in a publicly accessible Git repository: [https://github.com/mostafaamiri/SearchInstruct](https://github.com/mostafaamiri/SearchInstruct)
Guohang Yan, Yue Zhang, Pinlong Cai, Ding Wang, Song Mao, Hongwei Zhang, Yaoze Zhang, Hairong Zhang et al.
Retrieval-augmented generation (RAG) has become a dominant paradigm for mitigating knowledge hallucination and staleness in large language models (LLMs) while preserving data security. By retrieving relevant evidence from private, domain-specific corpora and injecting it into carefully engineered prompts, RAG delivers trustworthy responses without the prohibitive cost of fine-tuning. Traditional retrieval-augmented generation (RAG) systems are text-only and often rely on a single storage backend, most commonly a vector database. In practice, this monolithic design suffers from unavoidable trade-offs: vector search captures semantic similarity yet loses global context; knowledge graphs excel at relational precision but struggle with recall; full-text indexes are fast and exact yet semantically blind; and relational engines such as MySQL provide strong transactional guarantees but no semantic understanding. We argue that these heterogeneous retrieval paradigms are complementary, and propose a principled fusion scheme to orchestrate them synergistically, mitigating the weaknesses of any single modality. In this work we introduce HetaRAG, a hybrid, deep-retrieval augmented generation framework that orchestrates cross-modal evidence from heterogeneous data stores. We plan to design a system that unifies vector indices, knowledge graphs, full-text engines, and structured databases into a single retrieval plane, dynamically routing and fusing evidence to maximize recall, precision, and contextual fidelity. To achieve this design goal, we carried out preliminary explorations and constructed an initial RAG pipeline; this technical report provides a brief overview. The partial code is available at https://github.com/KnowledgeXLab/HetaRAG.
Authors' comments: 15 pages, 4 figures
Zakaria El Kassimi, Fares Fourati, Mohamed-Slim Alouini
We study question answering in the domain of radio regulations, a legally sensitive and high-stakes area. We propose a telecom-specific Retrieval-Augmented Generation (RAG) pipeline and introduce, to our knowledge, the first multiple-choice evaluation set for this domain, constructed from authoritative sources using automated filtering and human validation. To assess retrieval quality, we define a domain-specific retrieval metric, under which our retriever achieves approximately 97% accuracy. Beyond retrieval, our approach consistently improves generation accuracy across all tested models. In particular, while naively inserting documents without structured retrieval yields only marginal gains for GPT-4o (less than 1%), applying our pipeline results in nearly a 12% relative improvement. These findings demonstrate that carefully targeted grounding provides a simple yet strong baseline and an effective domain-specific solution for regulatory question answering. All code and evaluation scripts, along with our derived question-answer dataset, are available at https://github.com/Zakaria010/Radio-RAG.
Hassan Gharoun, Mohammad Sadegh Khorshidi, Kasra Ranjbarigderi, Fang Chen, Amir H. Gandomi
This work proposes an evidence-retrieval mechanism for uncertainty-aware
decision-making that replaces a single global cutoff with an
evidence-conditioned, instance-adaptive criterion. For each test instance,
proximal exemplars are retrieved in an embedding space; their predictive
distributions are fused via Dempster-Shafer theory. The resulting fused belief
acts as a per-instance thresholding mechanism. Because the supporting evidences
are explicit, decisions are transparent and auditable. Experiments on
CIFAR-10/100 with BiT and ViT backbones show higher or comparable
uncertainty-aware performance with materially fewer confidently incorrect
outcomes and a sustainable review load compared with applying threshold on
prediction entropy. Notably, only a few evidences are sufficient to realize
these gains; increasing the evidence set yields only modest changes. These
results indicate that evidence-conditioned tagging provides a more reliable and
interpretable alternative to fixed prediction entropy thresholds for
operational uncertainty-aware decision-making.
Authors' comments: 15 pages, 4 figures, 3 tables
Haichao Zhang, Chong Zhang, Peiyu Hu, Shi Qiu, Jia Wang
Modern recommender systems face a critical challenge in complying with
privacy regulations like the 'right to be forgotten': removing a user's data
without disrupting recommendations for others. Traditional unlearning methods
address this by partial model updates, but introduce propagation bias--where
unlearning one user's data distorts recommendations for behaviorally similar
users, degrading system accuracy. While retraining eliminates bias, it is
computationally prohibitive for large-scale systems. To address this challenge,
we propose CRAGRU, a novel framework leveraging Retrieval-Augmented Generation
(RAG) for efficient, user-specific unlearning that mitigates bias while
preserving recommendation quality. CRAGRU decouples unlearning into distinct
retrieval and generation stages. In retrieval, we employ three tailored
strategies designed to precisely isolate the target user's data influence,
minimizing collateral impact on unrelated users and enhancing unlearning
efficiency. Subsequently, the generation stage utilizes an LLM, augmented with
user profiles integrated into prompts, to reconstruct accurate and personalized
recommendations without needing to retrain the entire base model. Experiments
on three public datasets demonstrate that CRAGRU effectively unlearns targeted
user data, significantly mitigating unlearning bias by preventing adverse
impacts on non-target users, while maintaining recommendation performance
comparable to fully trained original models. Our work highlights the promise of
RAG-based architectures for building robust and privacy-preserving recommender
systems. The source code is available at:
https://github.com/zhanghaichao520/LLM_rec_unlearning.
Authors' comments: 10 pages, 4 figures. Accepted ICDM 2025 (IEEE International
Conference on Data Mining)
Minjong Yoo, Jinwoo Jang, Wei-jin Park, Honguk Woo
This study presents an Exploratory Retrieval-Augmented Planning (ExRAP)
framework, designed to tackle continual instruction following tasks of embodied
agents in dynamic, non-stationary environments. The framework enhances Large
Language Models' (LLMs) embodied reasoning capabilities by efficiently
exploring the physical environment and establishing the environmental context
memory, thereby effectively grounding the task planning process in time-varying
environment contexts. In ExRAP, given multiple continual instruction following
tasks, each instruction is decomposed into queries on the environmental context
memory and task executions conditioned on the query results. To efficiently
handle these multiple tasks that are performed continuously and simultaneously,
we implement an exploration-integrated task planning scheme by incorporating
the {information-based exploration} into the LLM-based planning process.
Combined with memory-augmented query evaluation, this integrated scheme not
only allows for a better balance between the validity of the environmental
context memory and the load of environment exploration, but also improves
overall task performance. Furthermore, we devise a {temporal consistency
refinement} scheme for query evaluation to address the inherent decay of
knowledge in the memory. Through experiments with VirtualHome, ALFRED, and
CARLA, our approach demonstrates robustness against a variety of embodied
instruction following scenarios involving different instruction scales and
types, and non-stationarity degrees, and it consistently outperforms other
state-of-the-art LLM-based task planning approaches in terms of both goal
success rate and execution efficiency.
Authors' comments: 21 pages. NeurIPS 2024
Kai Ye, Liangcai Su, Chenxiong Qian
Code generation has emerged as a pivotal capability of Large Language
Models(LLMs), revolutionizing development efficiency for programmers of all
skill levels. However, the complexity of data structures and algorithmic logic
often results in functional deficiencies and security vulnerabilities in
generated code, reducing it to a prototype requiring extensive manual
debugging. While Retrieval-Augmented Generation (RAG) can enhance correctness
and security by leveraging external code manuals, it simultaneously introduces
new attack surfaces.
In this paper, we pioneer the exploration of attack surfaces in
Retrieval-Augmented Code Generation (RACG), focusing on malicious dependency
hijacking. We demonstrate how poisoned documentation containing hidden
malicious dependencies (e.g., matplotlib_safe) can subvert RACG, exploiting
dual trust chains: LLM reliance on RAG and developers' blind trust in LLM
suggestions. To construct poisoned documents, we propose ImportSnare, a novel
attack framework employing two synergistic strategies: 1)Position-aware beam
search optimizes hidden ranking sequences to elevate poisoned documents in
retrieval results, and 2)Multilingual inductive suggestions generate
jailbreaking sequences to manipulate LLMs into recommending malicious
dependencies. Through extensive experiments across Python, Rust, and
JavaScript, ImportSnare achieves significant attack success rates (over 50% for
popular libraries such as matplotlib and seaborn) in general, and is also able
to succeed even when the poisoning ratio is as low as 0.01%, targeting both
custom and real-world malicious packages. Our findings reveal critical supply
chain risks in LLM-powered development, highlighting inadequate security
alignment for code generation tasks. To support future research, we will
release the multilingual benchmark suite and datasets. The project homepage is
https://importsnare.github.io.
Authors' comments: This paper has been accepted by the ACM Conference on Computer and
Communications Security (CCS) 2025
PrzemysÅaw StokÅosa, Janusz A. Starzyk, PaweÅ Raif, Adrian Horzyk, Marcin Kowalik
The paper addresses challenges in storing and retrieving sequences in contexts like anomaly detection, behavior prediction, and genetic information analysis. Associative Knowledge Graphs (AKGs) offer a promising approach by leveraging sparse graph structures to encode sequences. The objective was to develop a method for sequence storage and retrieval using AKGs that maintain high memory capacity and context-based retrieval accuracy while introducing algorithms for efficient element ordering. The study utilized Sequential Structural Associative Knowledge Graphs (SSAKGs). These graphs encode sequences as transitive tournaments with nodes representing objects and edges defining the order. Four ordering algorithms were developed and tested: Simple Sort, Node Ordering, Enhanced Node Ordering, and Weighted Edges Node Ordering. The evaluation was conducted on synthetic datasets consisting of random sequences of varying lengths and distributions, and real-world datasets, including sentence-based sequences from the NLTK library and miRNA sequences mapped symbolically with a window-based approach. Metrics such as precision, sensitivity, and specificity were employed to assess performance. SSAKGs exhibited quadratic growth in memory capacity relative to graph size. This study introduces a novel structural approach for sequence storage and retrieval. Key advantages include no training requirements, flexible context-based reconstruction, and high efficiency in sparse memory graphs. With broad applications in computational neuroscience and bioinformatics, the approach offers scalable solutions for sequence-based memory tasks.
Authors' comments: 13 pages, 6 figures
Wooseong Yang, Weizhi Zhang, Yuqing Liu, Yuwei Han, Yu Wang, Junhyun Lee, Philip S. Yu
Cold-start items remain a persistent challenge in recommender systems due to their lack of historical user interactions, which collaborative models rely on. While recent zero-shot methods leverage large language models (LLMs) to address this, they often struggle with sparse metadata and hallucinated or incomplete knowledge. We propose ColdRAG, a retrieval-augmented generation approach that builds a domain-specific knowledge graph dynamically to enhance LLM-based recommendation in cold-start scenarios, without requiring task-specific fine-tuning. ColdRAG begins by converting structured item attributes into rich natural-language profiles, from which it extracts entities and relationships to construct a unified knowledge graph capturing item semantics. Given a user's interaction history, it scores edges in the graph using an LLM, retrieves candidate items with supporting evidence, and prompts the LLM to rank them. By enabling multi-hop reasoning over this graph, ColdRAG grounds recommendations in verifiable evidence, reducing hallucinations and strengthening semantic connections. Experiments on three public benchmarks demonstrate that ColdRAG surpasses existing zero-shot baselines in both Recall and NDCG. This framework offers a practical solution to cold-start recommendation by combining knowledge-graph reasoning with retrieval-augmented LLM generation.
Authors' comments: 10 pages
Özgür Uğur, Musa Yılmaz, Esra Şavirdi, Özay Ezerceli, Mahmut El Huseyni, Selva Taş, Reyhan Bayraktar
The integration of Large Language Models (LLMs) into various applications has driven the need for structured and reliable responses. A key challenge in Retrieval-Augmented Generation (RAG) systems is ensuring that outputs align with expected formats while minimizing hallucinations. This study examines the role of guided decoding in RAG systems, comparing three methods, Outlines, XGrammar, and LM Format Enforcer, across different multi-turn prompting setups (0-turn, 1-turn, and 2-turn). By evaluating success rates, hallucination rates, and output quality, we provide insights into their performance and applicability. Our findings reveal how multi-turn interactions influence guided decoding, uncovering unexpected performance variations that can inform method selection for specific use cases. This work advances the understanding of structured output generation in RAG systems, offering both theoretical insights and practical guidance for LLM deployment.
Haike Xu, Tong Chen
The widely used retrieve-and-rerank pipeline faces two critical limitations: they are constrained by the initial retrieval quality of the top-k documents, and the growing computational demands of LLM-based rerankers restrict the number of documents that can be effectively processed. We introduce Reranker-Guided-Search (RGS), a novel approach that bypasses these limitations by directly retrieving documents according to reranker preferences rather than following the traditional sequential reranking method. Our method uses a greedy search on proximity graphs generated by approximate nearest neighbor algorithms, strategically prioritizing promising documents for reranking based on document similarity. Experimental results demonstrate substantial performance improvements across multiple benchmarks: 3.5 points on BRIGHT, 2.9 on FollowIR, and 5.1 on M-BEIR, all within a constrained reranker budget of 100 documents. Our analysis suggests that, given a fixed pair of embedding and reranker models, strategically selecting documents to rerank can significantly improve retrieval accuracy under limited reranker budget.
Xixi Wu, Yanchao Tan, Nan Hou, Ruiyang Zhang, Hong Cheng
Document Understanding is a foundational AI capability with broad
applications, and Document Question Answering (DocQA) is a key evaluation task.
Traditional methods convert the document into text for processing by Large
Language Models (LLMs), but this process strips away critical multi-modal
information like figures. While Large Vision-Language Models (LVLMs) address
this limitation, their constrained input size makes multi-page document
comprehension infeasible. Retrieval-augmented generation (RAG) methods mitigate
this by selecting relevant pages, but they rely solely on semantic relevance,
ignoring logical connections between pages and the query, which is essential
for reasoning.
To this end, we propose MoLoRAG, a logic-aware retrieval framework for
multi-modal, multi-page document understanding. By constructing a page graph
that captures contextual relationships between pages, a lightweight VLM
performs graph traversal to retrieve relevant pages, including those with
logical connections often overlooked. This approach combines semantic and
logical relevance to deliver more accurate retrieval. After retrieval, the
top-$K$ pages are fed into arbitrary LVLMs for question answering. To enhance
flexibility, MoLoRAG offers two variants: a training-free solution for easy
deployment and a fine-tuned version to improve logical relevance checking.
Experiments on four DocQA datasets demonstrate average improvements of 9.68% in
accuracy over LVLM direct inference and 7.44% in retrieval precision over
baselines. Codes and datasets are released at
https://github.com/WxxShirley/MoLoRAG.
Authors' comments: EMNLP Main 2025
Yushi Sun, Kai Sun, Yifan Ethan Xu, Xiao Yang, Xin Luna Dong, Nan Tang, Lei Chen
Retrieval-Augmented Generation (RAG) mitigates hallucination in Large Language Models (LLMs) by incorporating external data, with Knowledge Graphs (KGs) offering crucial information for question answering. Traditional Knowledge Graph Question Answering (KGQA) methods rely on semantic parsing, which typically retrieves knowledge strictly necessary for answer generation, thus often suffer from low coverage due to rigid schema requirements and semantic ambiguity. We present KERAG, a novel KG-based RAG pipeline that enhances QA coverage by retrieving a broader subgraph likely to contain relevant information. Our retrieval-filtering-summarization approach, combined with fine-tuned LLMs for Chain-of-Thought reasoning on knowledge sub-graphs, reduces noises and improves QA for both simple and complex questions. Experiments demonstrate that KERAG surpasses state-of-the-art solutions by about 7% in quality and exceeds GPT-4o (Tool) by 10-21%.
Authors' comments: Accepted by EMNLP Findings 2025
Ruohong Yang, Peng Hu, Yunfan Li, Xi Peng
Unsupervised cross-domain image retrieval (UCIR) aims to retrieve images of the same category across diverse domains without relying on annotations. Existing UCIR methods, which align cross-domain features for the entire image, often struggle with the domain gap, as the object features critical for retrieval are frequently entangled with domain-specific styles. To address this challenge, we propose DUDE, a novel UCIR method building upon feature disentanglement. In brief, DUDE leverages a text-to-image generative model to disentangle object features from domain-specific styles, thus facilitating semantical image retrieval. To further achieve reliable alignment of the disentangled object features, DUDE aligns mutual neighbors from within domains to across domains in a progressive manner. Extensive experiments demonstrate that DUDE achieves state-of-the-art performance across three benchmark datasets over 13 domains. The code will be released.
Qingbo Liu, Zhongyang Xu, Guangkui Tao, Xiuyuan Sun, Min Xue, Weihao Yuan, Shilong Pan
Although speckle is a powerful tool for high-precision metrology, large datasets and cumbersome training are always required to learn from the encoded speckle patterns, which is unfavorable for rapid deployment and multi-dimensional metrology. To enable high accuracy and fast training, physics-informed machine learning enforces physical laws to address high-dimensional problems. Here, we harness the modal fields in a few-mode fiber, which follow the law of beam propagation, to enable high-accuracy and fast-training parameter estimation. Anti-noise fast mode decomposition is implemented to retrieve the modal fields from the speckles. The accuracy is enhanced since the modal fields enable parameter estimation at random points in the continuous space-time domain. Artificial tactile perception and multi-dimensional metrology are achieved with high accuracy because the modal fields respond diversely to different parameters. Meanwhile, the number of specklegrams for training is reduced by around 5 times. The training time of machine learning is significantly reduced by 800 times, from 9 hours and 45 minutes to 40 seconds. Therefore, harnessing the modal fields paves a new way for the speckle-based metrology to develop efficient, low-cost, multi-dimensional sensors, making it suitable for intelligent wearable devices, industrial robots and healthcare applications.
Gowen Loo, Chang Liu, Qinghong Yin, Xiang Chen, Jiawei Chen, Jingyuan Zhang, Yu Tian
Smartphones have become indispensable in people's daily lives, permeating nearly every aspect of modern society. With the continuous advancement of large language models (LLMs), numerous LLM-based mobile agents have emerged. These agents are capable of accurately parsing diverse user queries and automatically assisting users in completing complex or repetitive operations. However, current agents 1) heavily rely on the comprehension ability of LLMs, which can lead to errors caused by misoperations or omitted steps during tasks, 2) lack interaction with the external environment, often terminating tasks when an app cannot fulfill user queries, and 3) lack memory capabilities, requiring each instruction to reconstruct the interface and being unable to learn from and correct previous mistakes. To alleviate the above issues, we propose MobileRAG, a mobile agents framework enhanced by Retrieval-Augmented Generation (RAG), which includes InterRAG, LocalRAG, and MemRAG. It leverages RAG to more quickly and accurately identify user queries and accomplish complex and long-sequence mobile tasks. Additionally, to more comprehensively assess the performance of MobileRAG, we introduce MobileRAG-Eval, a more challenging benchmark characterized by numerous complex, real-world mobile tasks that require external knowledge assistance. Extensive experimental results on MobileRAG-Eval demonstrate that MobileRAG can easily handle real-world mobile tasks, achieving 10.3\% improvement over state-of-the-art methods with fewer operational steps. Our code is publicly available at: https://github.com/liuxiaojieOutOfWorld/MobileRAG_arxiv
Marco Vetrano, Tiziano Zingales, G. Massimo Palma, Salvatore Lorenzo
The study of exoplanetary atmospheres traditionally relies on forward models to analytically compute the spectrum of an exoplanet by fine-tuning numerous chemical and physical parameters. However, the high-dimensionality of parameter space often results in a significant computational overhead. In this work, we introduce a novel approach to atmospheric retrieval leveraging on quantum extreme learning machines (QELMs). QELMs are quantum machine learning techniques that employ quantum systems as a black box for processing input data. In this work, we propose a framework for extracting exoplanetary atmospheric features using QELMs, employing an intrinsically fault-tolerant strategy suitable for near-term quantum devices, and we demonstrate such fault tolerance with a direct implementation on IBM Fez. The QELM architecture we present shows the potential of quantum computing in the analysis of astrophysical datasets and may, in the near-term future, unlock new computational tools to implement fast, efficient, and more accurate models in the study of exoplanetary atmospheres.
Rauf Aliev
Traditional e-commerce search systems often struggle with the semantic gap between user queries and product catalogs. In this paper, we propose a Category-Aligned Retrieval System (CARS) that improves search relevance by first predicting the product category from a user's query and then boosting products within that category. We introduce a novel method for creating "Trainable Category Prototypes" from query embeddings. We evaluate this method with two models: a lightweight all-MiniLM-L6-v2 and OpenAI's text-embedding-ada-002. Our offline evaluation shows this method is highly effective, with the OpenAI model increasing Top-3 category prediction accuracy from a zero-shot baseline of 43.8% to 83.2% after training. The end-to-end simulation, however, highlights the limitations of blindly applying category boosts in a complex retrieval pipeline: while accuracy is high, naive integration can negatively affect search relevance metrics such as nDCG@10. We argue that this is partly due to dataset-specific ambiguities (e.g., polysemous queries in the Amazon ESCI corpus) and partly due to the sensitivity of retrieval systems to over-constraining filters. Crucially, these results do not diminish the value of the approach; rather, they emphasize the need for confidence-aware and adaptive integration strategies.
Wang Chen, Guanqiang Qi, Weikang Li, Yang Li
Retrieval-Augmented Generation (RAG) has become a core paradigm in document question answering tasks. However, existing methods have limitations when dealing with multimodal documents: one category of methods relies on layout analysis and text extraction, which can only utilize explicit text information and struggle to capture images or unstructured content; the other category treats document segmentation as visual input and directly passes it to visual language models (VLMs) for processing, yet it ignores the semantic advantages of text, leading to suboptimal generation results. This paper proposes co-modality-based RAG (CMRAG), which can simultaneously leverage text and images for efficient retrieval and generation. Specifically, we first perform structured parsing on documents to obtain co-modality representations of text segments and image regions. Subsequently, in response to user queries, we retrieve candidate evidence from text and image channels, respectively, and aggregate the results at the cross-modal retrieval level. Finally, we prompt the VLM to generate the final response based on the co-modality retrieval results. Experiments demonstrate that our method significantly outperforms pure-vision-based RAG in visual document question answering tasks. The findings of this paper show that integrating co-modality information into the RAG framework in a unified manner is an effective approach to improving the performance of complex document visual question-answering (VQA) systems.
Amber Xie, Rahul Chand, Dorsa Sadigh, Joey Hejna
While large-scale robot datasets have propelled recent progress in imitation
learning, learning from smaller task specific datasets remains critical for
deployment in new environments and unseen tasks. One such approach to few-shot
imitation learning is retrieval-based imitation learning, which extracts
relevant samples from large, widely available prior datasets to augment a
limited demonstration dataset. To determine the relevant data from prior
datasets, retrieval-based approaches most commonly calculate a prior data
point's minimum distance to a point in the target dataset in latent space.
While retrieval-based methods have shown success using this metric for data
selection, we demonstrate its equivalence to the limit of a Gaussian kernel
density (KDE) estimate of the target data distribution. This reveals two
shortcomings of the retrieval rule used in prior work. First, it relies on
high-variance nearest neighbor estimates that are susceptible to noise. Second,
it does not account for the distribution of prior data when retrieving data. To
address these issues, we introduce Importance Weighted Retrieval (IWR), which
estimates importance weights, or the ratio between the target and prior data
distributions for retrieval, using Gaussian KDEs. By considering the
probability ratio, IWR seeks to mitigate the bias of previous selection rules,
and by using reasonable modeling parameters, IWR effectively smooths estimates
using all data points. Across both simulation environments and real-world
evaluations on the Bridge dataset we find that our method, IWR, consistently
improves performance of existing retrieval-based methods, despite only
requiring minor modifications.
Authors' comments: Conference on Robot Learning 2025