Wenyuan Zhao, Yu-Shin Huang, Chao Tian, Alex Sprintson
We study the problem of leaky private information retrieval (L-PIR), where
the amount of privacy leakage is measured by the pure differential privacy
parameter, referred to as the leakage ratio exponent. Unlike the previous L-PIR
scheme proposed by Samy et al., which only adjusted the probability allocation
to the clean (low-cost) retrieval pattern, we optimize the probabilities
assigned to all the retrieval patterns jointly. It is demonstrated that the
optimal retrieval pattern probability distribution is quite sophisticated and
has a layered structure: the retrieval patterns associated with the random key
values of lower Hamming weights should be assigned higher probabilities. This
new scheme provides a significant improvement, leading to an ${O}(\log K)$
leakage ratio exponent with fixed download cost $D$ and number of servers $N$,
in contrast to the previous art that only achieves a $\Theta(K)$ exponent,
where $K$ is the number of messages.
Authors' comments: Long version of the paper submitted to ISIT 2025. 8 pages, 2 figures
Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Junnan Dong, Hao Chen et al.
Large language models (LLMs) have demonstrated remarkable capabilities in a wide range of tasks, yet their application to specialized domains remains challenging due to the need for deep expertise. Retrieval-augmented generation (RAG) has emerged as a promising solution to customize LLMs for professional fields by seamlessly integrating external knowledge bases, enabling real-time access to domain-specific expertise during inference. Despite its potential, traditional RAG systems, based on flat text retrieval, face three critical challenges: (i) complex query understanding in professional contexts, (ii) difficulties in knowledge integration across distributed sources, and (iii) system efficiency bottlenecks at scale. This survey presents a systematic analysis of Graph-based Retrieval-Augmented Generation (GraphRAG), a new paradigm that revolutionizes domain-specific LLM applications. GraphRAG addresses traditional RAG limitations through three key innovations: (i) graph-structured knowledge representation that explicitly captures entity relationships and domain hierarchies, (ii) efficient graph-based retrieval techniques that enable context-preserving knowledge retrieval with multihop reasoning ability, and (iii) structure-aware knowledge integration algorithms that leverage retrieved knowledge for accurate and logical coherent generation of LLMs. In this survey, we systematically analyze the technical foundations of GraphRAG and examine current implementations across various professional domains, identifying key technical challenges and promising research directions. All the related resources of GraphRAG, including research papers, open-source data, and projects, are collected for the community in \textcolor{blue}{\url{https://github.com/DEEP-PolyU/Awesome-GraphRAG}}.
Jiacheng Zuo, Haibo Hu, Zikang Zhou, Yufei Cui, Ziquan Liu, Jianping Wang, Nan Guan, Jin Wang et al.
In the pursuit of robust autonomous driving systems, models trained on real-world datasets often struggle to adapt to new environments, particularly when confronted with corner cases such as extreme weather conditions. Collecting these corner cases in the real world is non-trivial, which necessitates the use of simulators for validation. However,the high computational cost and the domain gap in data distribution have hindered the seamless transition between real and simulated driving scenarios. To tackle this challenge, we propose Retrieval-Augmented Learning for Autonomous Driving (RALAD), a novel framework designed to bridge the real-to-sim gap at a low cost. RALAD features three primary designs, including (1) domain adaptation via an enhanced Optimal Transport (OT) method that accounts for both individual and grouped image distances, (2) a simple and unified framework that can be applied to various models, and (3) efficient fine-tuning techniques that freeze the computationally expensive layers while maintaining robustness. Experimental results demonstrate that RALAD compensates for the performance degradation in simulated environments while maintaining accuracy in real-world scenarios across three different models. Taking Cross View as an example, the mIOU and mAP metrics in real-world scenarios remain stable before and after RALAD fine-tuning, while in simulated environments,the mIOU and mAP metrics are improved by 10.30% and 12.29%, respectively. Moreover, the re-training cost of our approach is reduced by approximately 88.1%. Our code is available at https://github.com/JiachengZuo/RALAD.git.
Fatemeh Nazary, Yashar Deldjoo, Tommaso di Noia
This study presents Poison-RAG, a framework for adversarial data poisoning attacks targeting retrieval-augmented generation (RAG)-based recommender systems. Poison-RAG manipulates item metadata, such as tags and descriptions, to influence recommendation outcomes. Using item metadata generated through a large language model (LLM) and embeddings derived via the OpenAI API, we explore the impact of adversarial poisoning attacks on provider-side, where attacks are designed to promote long-tail items and demote popular ones. Two attack strategies are proposed: local modifications, which personalize tags for each item using BERT embeddings, and global modifications, applying uniform tags across the dataset. Experiments conducted on the MovieLens dataset in a black-box setting reveal that local strategies improve manipulation effectiveness by up to 50\%, while global strategies risk boosting already popular items. Results indicate that popular items are more susceptible to attacks, whereas long-tail items are harder to manipulate. Approximately 70\% of items lack tags, presenting a cold-start challenge; data augmentation and synthesis are proposed as potential defense mechanisms to enhance RAG-based systems' resilience. The findings emphasize the need for robust metadata management to safeguard recommendation frameworks. Code and data are available at https://github.com/atenanaz/Poison-RAG.
Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini, Leonardo Venuta
Learned sparse text embeddings have gained popularity due to their effectiveness in top-k retrieval and inherent interpretability. Their distributional idiosyncrasies, however, have long hindered their use in real-world retrieval systems. That changed with the recent development of approximate algorithms that leverage the distributional properties of sparse embeddings to speed up retrieval. Nonetheless, in much of the existing literature, evaluation has been limited to datasets with only a few million documents such as MSMARCO. It remains unclear how these systems behave on much larger datasets and what challenges lurk in larger scales. To bridge that gap, we investigate the behavior of state-of-the-art retrieval algorithms on massive datasets. We compare and contrast the recently-proposed Seismic and graph-based solutions adapted from dense retrieval. We extensively evaluate Splade embeddings of 138M passages from MsMarco-v2 and report indexing time and other efficiency and effectiveness metrics.
Zihan Wang, Jinyuan Fang, Giacomo Frisoni, Zhuyun Dai, Zaiqiao Meng, Gianluca Moro, Emine Yilmaz
Pretrained language models (PLMs) like BERT and GPT-4 have become the
foundation for modern information retrieval (IR) systems. However, existing
PLM-based IR models primarily rely on the knowledge learned during training for
prediction, limiting their ability to access and incorporate external,
up-to-date, or domain-specific information. Therefore, current information
retrieval systems struggle with semantic nuances, context relevance, and
domain-specific issues. To address these challenges, we propose the second
Knowledge-Enhanced Information Retrieval workshop (KEIR @ ECIR 2025) as a
platform to discuss innovative approaches that integrate external knowledge,
aiming to enhance the effectiveness of information retrieval in a rapidly
evolving technological landscape. The goal of this workshop is to bring
together researchers from academia and industry to discuss various aspects of
knowledge-enhanced information retrieval.
Authors' comments: KEIR @ ECIR 2025 workshop
Jonathan Lin, Aditya Joshi, Hye-young Paik, Tri Dung Doung, Deepti Gurdasani
Geocoding involves automatic extraction of location coordinates of incidents
reported in news articles, and can be used for epidemic intelligence or
disaster management. This paper introduces Retrieval-Augmented Coordinate
Capture Of Online News articles (RACCOON), an open-source geocoding approach
that extracts geolocations from news articles. RACCOON uses a
retrieval-augmented generation (RAG) approach where candidate locations and
associated information are retrieved in the form of context from a location
database, and a prompt containing the retrieved context, location mentions and
news articles is fed to an LLM to generate the location coordinates. Our
evaluation on three datasets, two underlying LLMs, three baselines and several
ablation tests based on the components of RACCOON demonstrate the utility of
RACCOON. To the best of our knowledge, RACCOON is the first RAG-based approach
for geocoding using pre-trained LLMs.
Authors' comments: Accepted at WWW 2025 as a short paper. 4 pages with references
Long Huang, Ming Zhao, Limin Xiao, Xiujun Zhang, Jungang Hu
The 3rd Generation Partnership Project (3GPP) documents is key standards in global telecommunications, while posing significant challenges for engineers and researchers in the telecommunications field due to the large volume and complexity of their contents as well as the frequent updates. Large language models (LLMs) have shown promise in natural language processing tasks, but their general-purpose nature limits their effectiveness in specific domains like telecommunications. To address this, we propose Chat3GPP, an open-source retrieval-augmented generation (RAG) framework tailored for 3GPP specifications. By combining chunking strategies, hybrid retrieval and efficient indexing methods, Chat3GPP can efficiently retrieve relevant information and generate accurate responses to user queries without requiring domain-specific fine-tuning, which is both flexible and scalable, offering significant potential for adapting to other technical standards beyond 3GPP. We evaluate Chat3GPP on two telecom-specific datasets and demonstrate its superior performance compared to existing methods, showcasing its potential for downstream tasks like protocol generation and code automation.
Zhaoxing Li, Vahid Yazdanpanah, Jindi Wang, Wen Gu, Lei Shi, Alexandra I. Cristea, Sarah Kiden, Sebastian Stein
The integration of AI in education offers significant potential to enhance learning efficiency. Large Language Models (LLMs), such as ChatGPT, Gemini, and Llama, allow students to query a wide range of topics, providing unprecedented flexibility. However, LLMs face challenges, such as handling varying content relevance and lack of personalization. To address these challenges, we propose TutorLLM, a personalized learning recommender LLM system based on Knowledge Tracing (KT) and Retrieval-Augmented Generation (RAG). The novelty of TutorLLM lies in its unique combination of KT and RAG techniques with LLMs, which enables dynamic retrieval of context-specific knowledge and provides personalized learning recommendations based on the student's personal learning state. Specifically, this integration allows TutorLLM to tailor responses based on individual learning states predicted by the Multi-Features with Latent Relations BERT-based KT (MLFBK) model and to enhance response accuracy with a Scraper model. The evaluation includes user assessment questionnaires and performance metrics, demonstrating a 10% improvement in user satisfaction and a 5\% increase in quiz scores compared to using general LLMs alone.
Keer Lu, Zheng Liang, Zhuoran Zhang, Da Pan, Shusen Zhang, Xin Wu, Zenan Zhou, Guosheng Dong et al.
Large Language Models (LLMs) have exhibited remarkable capabilities in clinical scenarios. Despite their potential, existing works face challenges when applying LLMs to medical settings. Strategies relying on training with medical datasets are highly cost-intensive and may suffer from outdated training data. Leveraging external knowledge bases is a suitable alternative, yet it faces obstacles such as limited retrieval precision and poor effectiveness in answer extraction. These issues collectively prevent LLMs from demonstrating the expected level of proficiency in mastering medical expertise. To address these challenges, we introduce Med-R^2, a novel LLM physician framework that adheres to the Evidence-Based Medicine (EBM) process, efficiently integrating retrieval mechanisms as well as the selection and reasoning processes of evidence, thereby enhancing the problem-solving capabilities of LLMs in healthcare scenarios and fostering a trustworthy LLM physician. Our comprehensive experiments indicate that Med-R^2 achieves a 14.74\% improvement over vanilla RAG methods and even a 3.32\% enhancement compared to fine-tuning strategies, without incurring additional training costs.
Pengcheng Zhao, Zhixian He, Fuwei Zhang, Shujin Lin, Fan Zhou
Video Moment Retrieval and Highlight Detection aim to find corresponding content in the video based on a text query. Existing models usually first use contrastive learning methods to align video and text features, then fuse and extract multimodal information, and finally use a Transformer Decoder to decode multimodal information. However, existing methods face several issues: (1) Overlapping semantic information between different samples in the dataset hinders the model's multimodal aligning performance; (2) Existing models are not able to efficiently extract local features of the video; (3) The Transformer Decoder used by the existing model cannot adequately decode multimodal features. To address the above issues, we proposed the LD-DETR model for Video Moment Retrieval and Highlight Detection tasks. Specifically, we first distilled the similarity matrix into the identity matrix to mitigate the impact of overlapping semantic information. Then, we designed a method that enables convolutional layers to extract multimodal local features more efficiently. Finally, we fed the output of the Transformer Decoder back into itself to adequately decode multimodal information. We evaluated LD-DETR on four public benchmarks and conducted extensive experiments to demonstrate the superiority and effectiveness of our approach. Our model outperforms the State-Of-The-Art models on QVHighlight, Charades-STA and TACoS datasets. Our code is available at https://github.com/qingchen239/ld-detr.
Amin Robatian, Mohammad Hajipour, Mohammad Reza Peyghan, Fatemeh Rajabi, Sajjad Amini, Shahrokh Ghaemmaghami, Iman Gholampour
Automatic Speech Recognition (ASR) systems have demonstrated remarkable
performance across various applications. However, limited data and the unique
language features of specific domains, such as low-resource languages,
significantly degrade their performance and lead to higher Word Error Rates
(WER). In this study, we propose Generative Error Correction via
Retrieval-Augmented Generation (GEC-RAG), a novel approach designed to improve
ASR accuracy for low-resource domains, like Persian. Our approach treats the
ASR system as a black-box, a common practice in cloud-based services, and
proposes a Retrieval-Augmented Generation (RAG) approach within the In-Context
Learning (ICL) scheme to enhance the quality of ASR predictions. By
constructing a knowledge base that pairs ASR predictions (1-best and 5-best
hypotheses) with their corresponding ground truths, GEC-RAG retrieves lexically
similar examples to the ASR transcription using the Term Frequency-Inverse
Document Frequency (TF-IDF) measure. This process provides relevant error
patterns of the system alongside the ASR transcription to the Generative Large
Language Model (LLM), enabling targeted corrections. Our results demonstrate
that this strategy significantly reduces WER in Persian and highlights a
potential for domain adaptation and low-resource scenarios. This research
underscores the effectiveness of using RAG in enhancing ASR systems without
requiring direct model modification or fine-tuning, making it adaptable to any
domain by simply updating the transcription knowledge base with domain-specific
data.
Authors' comments: 6 pages
Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Zien Xie, Youyao Jia, Sidan Du
Given a video and a linguistic query, video moment retrieval and highlight
detection (MR&HD) aim to locate all the relevant spans while simultaneously
predicting saliency scores. Most existing methods utilize RGB images as input,
overlooking the inherent multi-modal visual signals like optical flow and
depth. In this paper, we propose a Multi-modal Fusion and Query Refinement
Network (MRNet) to learn complementary information from multi-modal cues.
Specifically, we design a multi-modal fusion module to dynamically combine RGB,
optical flow, and depth map. Furthermore, to simulate human understanding of
sentences, we introduce a query refinement module that merges text at different
granularities, containing word-, phrase-, and sentence-wise levels.
Comprehensive experiments on QVHighlights and Charades datasets indicate that
MRNet outperforms current state-of-the-art methods, achieving notable
improvements in MR-mAP@Avg (+3.41) and HD-HIT@1 (+3.46) on QVHighlights.
Authors' comments: Accepted by ICME 2024
Weihang Zhang, Jihao Li, Shuoke Li, Ziqing Niu, Jialiang Chen, Wenkai Zhang
Remote sensing text--image retrieval (RSTIR) aims to retrieve the matched remote sensing (RS) images from the database according to the descriptive text. Recently, the rapid development of large visual-language pre-training models provides new insights for RSTIR. Nevertheless, as the complexity of models grows in RSTIR, the previous studies suffer from suboptimal resource efficiency during transfer learning. To address this issue, we propose a computation and memory-efficient retrieval (CMER) framework for RSTIR. To reduce the training memory consumption, we propose the Focus-Adapter module, which adopts a side branch structure. Its focus layer suppresses the interference of background pixels for small targets. Simultaneously, to enhance data efficacy, we regard the RS scene category as the metadata and design a concise augmentation technique. The scene label augmentation leverages the prior knowledge from land cover categories and shrinks the search space. We propose the negative sample recycling strategy to make the negative sample pool decoupled from the mini-batch size. It improves the generalization performance without introducing additional encoders. We have conducted quantitative and qualitative experiments on public datasets and expanded the benchmark with some advanced approaches, which demonstrates the competitiveness of the proposed CMER. Compared with the recent advanced methods, the overall retrieval performance of CMER is 2%--5% higher on RSITMD. Moreover, our proposed method reduces memory consumption by 49% and has a 1.4x data throughput during training. The code of the CMER and the dataset will be released at https://github.com/ZhangWeihang99/CMER.
Shuai Lyu, Zijing Tian, Zhonghong Ou, Yifan Zhu, Xiao Zhang, Qiankun Ha, Haoran Luo, Meina Song
Cross-modal retrieval maps data under different modality via semantic
relevance. Existing approaches implicitly assume that data pairs are
well-aligned and ignore the widely existing annotation noise, i.e., noisy
correspondence (NC). Consequently, it inevitably causes performance
degradation. Despite attempts that employ the co-teaching paradigm with
identical architectures to provide distinct data perspectives, the differences
between these architectures are primarily stemmed from random initialization.
Thus, the model becomes increasingly homogeneous along with the training
process. Consequently, the additional information brought by this paradigm is
severely limited. In order to resolve this problem, we introduce a Tripartite
learning with Semantic Variation Consistency (TSVC) for robust image-text
retrieval. We design a tripartite cooperative learning mechanism comprising a
Coordinator, a Master, and an Assistant model. The Coordinator distributes
data, and the Assistant model supports the Master model's noisy label
prediction with diverse data. Moreover, we introduce a soft label estimation
method based on mutual information variation, which quantifies the noise in new
samples and assigns corresponding soft labels. We also present a new loss
function to enhance robustness and optimize training effectiveness. Extensive
experiments on three widely used datasets demonstrate that, even at increasing
noise ratios, TSVC exhibits significant advantages in retrieval accuracy and
maintains stable training performance.
Authors' comments: This paper has been accepted to the Main Track of AAAI 2025. It
contains 9 pages, 7 figures, and is relevant to the areas of cross-modal
retrieval and machine learning. The work presents a novel approach in robust
image-text retrieval using a tripartite learning framework
Aarush Sinha, Viraj Virk, Dipshikha Chakraborty, P. S. Sreeja
Language Models [LMs] are now playing an increasingly large role in information generation and synthesis; the representation of scientific knowledge in these systems needs to be highly accurate. A prime challenge is hallucination; that is, generating apparently plausible but actually false information, including invented citations and nonexistent research papers. This kind of inaccuracy is dangerous in all the domains that require high levels of factual correctness, such as academia and education. This work presents a pipeline for evaluating the frequency with which language models hallucinate in generating responses in the scientific literature. We propose ArxEval, an evaluation pipeline with two tasks using ArXiv as a repository: Jumbled Titles and Mixed Titles. Our evaluation includes fifteen widely used language models and provides comparative insights into their reliability in handling scientific literature.
Vera Pavlova
This study examines the use of Natural Language Processing (NLP) technology within the Islamic domain, focusing on developing an Islamic neural retrieval model. By leveraging the robust XLM-R model, the research employs a language reduction technique to create a lightweight bilingual large language model (LLM). Our approach for domain adaptation addresses the unique challenges faced in the Islamic domain, where substantial in-domain corpora exist only in Arabic while limited in other languages, including English. The work utilizes a multi-stage training process for retrieval models, incorporating large retrieval datasets, such as MS MARCO, and smaller, in-domain datasets to improve retrieval performance. Additionally, we have curated an in-domain retrieval dataset in English by employing data augmentation techniques and involving a reliable Islamic source. This approach enhances the domain-specific dataset for retrieval, leading to further performance gains. The findings suggest that combining domain adaptation and a multi-stage training method for the bilingual Islamic neural retrieval model enables it to outperform monolingual models on downstream retrieval tasks.
Zengyi Gao, Yukun Cao, Hairu Wang, Ao Ke, Yuan Feng, Xike Xie, S Kevin Zhou
To mitigate the hallucination and knowledge deficiency in large language models (LLMs), Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) has shown promising potential by utilizing KGs as external resource to enhance LLMs reasoning.However, existing KG-RAG approaches struggle with a trade-off between flexibility and retrieval quality.Modular methods prioritize flexibility by avoiding the use of KG-fine-tuned models during retrieval, leading to fixed retrieval strategies and suboptimal retrieval quality.Conversely, coupled methods embed KG information within models to improve retrieval quality, but at the expense of flexibility.In this paper, we propose a novel flexible modular KG-RAG framework, termed FRAG, which synergizes the advantages of both approaches.FRAG estimates the hop range of reasoning paths based solely on the query and classify it as either simple or complex.To match the complexity of the query, tailored pipelines are applied to ensure efficient and accurate reasoning path retrieval, thus fostering the final reasoning process.By using the query text instead of the KG to infer the structural information of reasoning paths and employing adaptable retrieval strategies, FRAG improves retrieval quality while maintaining flexibility.Moreover, FRAG does not require extra LLMs fine-tuning or calls, significantly boosting efficiency and conserving resources.Extensive experiments show that FRAG achieves state-of-the-art performance with high efficiency and low resource consumption.
Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Jingyi Song, Hao Wang
Leveraging the autonomous decision-making capabilities of large language
models (LLMs) has demonstrated superior performance in reasoning tasks.
However, despite the success of iterative or recursive retrieval-augmented
generation (RAG) techniques, these methods are often constrained to a single
solution space when confronted with complex problems. In this paper, we propose
a novel thinking pattern in RAG that integrates system analysis with efficient
reasoning actions, significantly activating intrinsic reasoning capabilities
and expanding the solution space of specific tasks via Monte Carlo Tree Search
(MCTS), which we refer to as AirRAG. Specifically, our approach designs five
fundamental reasoning actions, which are expanded to a broad tree-based
reasoning space using MCTS. The approach also incorporates self-consistency
verification to explore potential reasoning paths and inference scaling law.
Additionally, computationally optimal strategies are employed to allocate more
inference resources to key actions, thereby enhancing overall performance.
Experimental results demonstrate the effectiveness of AirRAG, showing
significant performance gains on complex question-answering datasets.
Furthermore, AirRAG is flexible and lightweight, making it easy to integrate
with other advanced technologies.
Authors' comments: 17 pages, 14 figures
Reham Omar, Omij Mangukiya, Essam Mansour
Dialogue benchmarks are crucial in training and evaluating chatbots engaging
in domain-specific conversations. Knowledge graphs (KGs) represent semantically
rich and well-organized data spanning various domains, such as DBLP, DBpedia,
and YAGO. Traditionally, dialogue benchmarks have been manually created from
documents, neglecting the potential of KGs in automating this process. Some
question-answering benchmarks are automatically generated using extensive
preprocessing from KGs, but they do not support dialogue generation. This paper
introduces Chatty-Gen, a novel multi-stage retrieval-augmented generation
platform for automatically generating high-quality dialogue benchmarks tailored
to a specific domain using a KG. Chatty-Gen decomposes the generation process
into manageable stages and uses assertion rules for automatic validation
between stages. Our approach enables control over intermediate results to
prevent time-consuming restarts due to hallucinations. It also reduces reliance
on costly and more powerful commercial LLMs. Chatty-Gen eliminates upfront
processing of the entire KG using efficient query-based retrieval to find
representative subgraphs based on the dialogue context. Our experiments with
several real and large KGs demonstrate that Chatty-Gen significantly
outperforms state-of-the-art systems and ensures consistent model and system
performance across multiple LLMs of diverse capabilities, such as GPT-4o,
Gemini 1.5, Llama 3, and Mistral.
Authors' comments: The paper is publsihed in SIGMOD 2025