Yushi Sun, Kai Sun, Yifan Ethan Xu, Xiao Yang, Xin Luna Dong, Nan Tang, Lei Chen
Retrieval-Augmented Generation (RAG) mitigates hallucination in Large Language Models (LLMs) by incorporating external data, with Knowledge Graphs (KGs) offering crucial information for question answering. Traditional Knowledge Graph Question Answering (KGQA) methods rely on semantic parsing, which typically retrieves knowledge strictly necessary for answer generation, thus often suffer from low coverage due to rigid schema requirements and semantic ambiguity. We present KERAG, a novel KG-based RAG pipeline that enhances QA coverage by retrieving a broader subgraph likely to contain relevant information. Our retrieval-filtering-summarization approach, combined with fine-tuned LLMs for Chain-of-Thought reasoning on knowledge sub-graphs, reduces noises and improves QA for both simple and complex questions. Experiments demonstrate that KERAG surpasses state-of-the-art solutions by about 7% in quality and exceeds GPT-4o (Tool) by 10-21%.
Authors' comments: Accepted by EMNLP Findings 2025
Ruohong Yang, Peng Hu, Yunfan Li, Xi Peng
Unsupervised cross-domain image retrieval (UCIR) aims to retrieve images of the same category across diverse domains without relying on annotations. Existing UCIR methods, which align cross-domain features for the entire image, often struggle with the domain gap, as the object features critical for retrieval are frequently entangled with domain-specific styles. To address this challenge, we propose DUDE, a novel UCIR method building upon feature disentanglement. In brief, DUDE leverages a text-to-image generative model to disentangle object features from domain-specific styles, thus facilitating semantical image retrieval. To further achieve reliable alignment of the disentangled object features, DUDE aligns mutual neighbors from within domains to across domains in a progressive manner. Extensive experiments demonstrate that DUDE achieves state-of-the-art performance across three benchmark datasets over 13 domains. The code will be released.
Qingbo Liu, Zhongyang Xu, Guangkui Tao, Xiuyuan Sun, Min Xue, Weihao Yuan, Shilong Pan
Although speckle is a powerful tool for high-precision metrology, large datasets and cumbersome training are always required to learn from the encoded speckle patterns, which is unfavorable for rapid deployment and multi-dimensional metrology. To enable high accuracy and fast training, physics-informed machine learning enforces physical laws to address high-dimensional problems. Here, we harness the modal fields in a few-mode fiber, which follow the law of beam propagation, to enable high-accuracy and fast-training parameter estimation. Anti-noise fast mode decomposition is implemented to retrieve the modal fields from the speckles. The accuracy is enhanced since the modal fields enable parameter estimation at random points in the continuous space-time domain. Artificial tactile perception and multi-dimensional metrology are achieved with high accuracy because the modal fields respond diversely to different parameters. Meanwhile, the number of specklegrams for training is reduced by around 5 times. The training time of machine learning is significantly reduced by 800 times, from 9 hours and 45 minutes to 40 seconds. Therefore, harnessing the modal fields paves a new way for the speckle-based metrology to develop efficient, low-cost, multi-dimensional sensors, making it suitable for intelligent wearable devices, industrial robots and healthcare applications.
Gowen Loo, Chang Liu, Qinghong Yin, Xiang Chen, Jiawei Chen, Jingyuan Zhang, Yu Tian
Smartphones have become indispensable in people's daily lives, permeating nearly every aspect of modern society. With the continuous advancement of large language models (LLMs), numerous LLM-based mobile agents have emerged. These agents are capable of accurately parsing diverse user queries and automatically assisting users in completing complex or repetitive operations. However, current agents 1) heavily rely on the comprehension ability of LLMs, which can lead to errors caused by misoperations or omitted steps during tasks, 2) lack interaction with the external environment, often terminating tasks when an app cannot fulfill user queries, and 3) lack memory capabilities, requiring each instruction to reconstruct the interface and being unable to learn from and correct previous mistakes. To alleviate the above issues, we propose MobileRAG, a mobile agents framework enhanced by Retrieval-Augmented Generation (RAG), which includes InterRAG, LocalRAG, and MemRAG. It leverages RAG to more quickly and accurately identify user queries and accomplish complex and long-sequence mobile tasks. Additionally, to more comprehensively assess the performance of MobileRAG, we introduce MobileRAG-Eval, a more challenging benchmark characterized by numerous complex, real-world mobile tasks that require external knowledge assistance. Extensive experimental results on MobileRAG-Eval demonstrate that MobileRAG can easily handle real-world mobile tasks, achieving 10.3\% improvement over state-of-the-art methods with fewer operational steps. Our code is publicly available at: https://github.com/liuxiaojieOutOfWorld/MobileRAG_arxiv
Marco Vetrano, Tiziano Zingales, G. Massimo Palma, Salvatore Lorenzo
The study of exoplanetary atmospheres traditionally relies on forward models to analytically compute the spectrum of an exoplanet by fine-tuning numerous chemical and physical parameters. However, the high-dimensionality of parameter space often results in a significant computational overhead. In this work, we introduce a novel approach to atmospheric retrieval leveraging on quantum extreme learning machines (QELMs). QELMs are quantum machine learning techniques that employ quantum systems as a black box for processing input data. In this work, we propose a framework for extracting exoplanetary atmospheric features using QELMs, employing an intrinsically fault-tolerant strategy suitable for near-term quantum devices, and we demonstrate such fault tolerance with a direct implementation on IBM Fez. The QELM architecture we present shows the potential of quantum computing in the analysis of astrophysical datasets and may, in the near-term future, unlock new computational tools to implement fast, efficient, and more accurate models in the study of exoplanetary atmospheres.
Rauf Aliev
Traditional e-commerce search systems often struggle with the semantic gap between user queries and product catalogs. In this paper, we propose a Category-Aligned Retrieval System (CARS) that improves search relevance by first predicting the product category from a user's query and then boosting products within that category. We introduce a novel method for creating "Trainable Category Prototypes" from query embeddings. We evaluate this method with two models: a lightweight all-MiniLM-L6-v2 and OpenAI's text-embedding-ada-002. Our offline evaluation shows this method is highly effective, with the OpenAI model increasing Top-3 category prediction accuracy from a zero-shot baseline of 43.8% to 83.2% after training. The end-to-end simulation, however, highlights the limitations of blindly applying category boosts in a complex retrieval pipeline: while accuracy is high, naive integration can negatively affect search relevance metrics such as nDCG@10. We argue that this is partly due to dataset-specific ambiguities (e.g., polysemous queries in the Amazon ESCI corpus) and partly due to the sensitivity of retrieval systems to over-constraining filters. Crucially, these results do not diminish the value of the approach; rather, they emphasize the need for confidence-aware and adaptive integration strategies.
Wang Chen, Guanqiang Qi, Weikang Li, Yang Li
Retrieval-Augmented Generation (RAG) has become a core paradigm in document question answering tasks. However, existing methods have limitations when dealing with multimodal documents: one category of methods relies on layout analysis and text extraction, which can only utilize explicit text information and struggle to capture images or unstructured content; the other category treats document segmentation as visual input and directly passes it to visual language models (VLMs) for processing, yet it ignores the semantic advantages of text, leading to suboptimal generation results. This paper proposes co-modality-based RAG (CMRAG), which can simultaneously leverage text and images for efficient retrieval and generation. Specifically, we first perform structured parsing on documents to obtain co-modality representations of text segments and image regions. Subsequently, in response to user queries, we retrieve candidate evidence from text and image channels, respectively, and aggregate the results at the cross-modal retrieval level. Finally, we prompt the VLM to generate the final response based on the co-modality retrieval results. Experiments demonstrate that our method significantly outperforms pure-vision-based RAG in visual document question answering tasks. The findings of this paper show that integrating co-modality information into the RAG framework in a unified manner is an effective approach to improving the performance of complex document visual question-answering (VQA) systems.
Amber Xie, Rahul Chand, Dorsa Sadigh, Joey Hejna
While large-scale robot datasets have propelled recent progress in imitation
learning, learning from smaller task specific datasets remains critical for
deployment in new environments and unseen tasks. One such approach to few-shot
imitation learning is retrieval-based imitation learning, which extracts
relevant samples from large, widely available prior datasets to augment a
limited demonstration dataset. To determine the relevant data from prior
datasets, retrieval-based approaches most commonly calculate a prior data
point's minimum distance to a point in the target dataset in latent space.
While retrieval-based methods have shown success using this metric for data
selection, we demonstrate its equivalence to the limit of a Gaussian kernel
density (KDE) estimate of the target data distribution. This reveals two
shortcomings of the retrieval rule used in prior work. First, it relies on
high-variance nearest neighbor estimates that are susceptible to noise. Second,
it does not account for the distribution of prior data when retrieving data. To
address these issues, we introduce Importance Weighted Retrieval (IWR), which
estimates importance weights, or the ratio between the target and prior data
distributions for retrieval, using Gaussian KDEs. By considering the
probability ratio, IWR seeks to mitigate the bias of previous selection rules,
and by using reasonable modeling parameters, IWR effectively smooths estimates
using all data points. Across both simulation environments and real-world
evaluations on the Bridge dataset we find that our method, IWR, consistently
improves performance of existing retrieval-based methods, despite only
requiring minor modifications.
Authors' comments: Conference on Robot Learning 2025
Youchao Zhou, Heyan Huang, Yicheng Liu, Rui Dai, Xinglin Wang, Xingchen Zhang, Shumin Shi, Yang Deng
Existing Large Language Models (LLMs) occasionally generate plausible yet
factually incorrect responses, known as hallucinations. Researchers are
primarily using two approaches to mitigate hallucinations, namely Retrieval
Augmented Language Models (RALMs) and refusal post-training. However, current
research predominantly emphasizes their individual effectiveness while
overlooking the evaluation of the refusal capability of RALMs. In this study,
we ask the fundamental question: Do RALMs know when they don't know?
Specifically, we ask three questions. First, are RALMs well-calibrated
regarding different internal and external knowledge states? We examine the
influence of various factors. Contrary to expectations, we find that LLMs
exhibit significant \textbf{over-refusal} behavior. Then, how does refusal
post-training affect the over-refusal issue? We investigate the Refusal-aware
Instruction Tuning and In-Context Fine-tuning methods. Our results show that
the over-refusal problem is mitigated by In-context fine-tuning. but magnified
by R-tuning. However, we also find that the refusal ability may conflict with
the quality of the answer. Finally, we develop a simple yet effective refusal
method for refusal post-trained models to improve their overall answer quality
in terms of refusal and correct answers. Our study provides a more
comprehensive understanding of the influence of important factors on RALM
systems.
Authors' comments: under review
Long Zhang, Peipei Song, Jianfeng Dong, Kun Li, Xun Yang
Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos
partially relevant to a given query. The core challenge lies in learning robust
query-video alignment against spurious semantic correlations arising from
inherent data uncertainty: 1) query ambiguity, where the query incompletely
characterizes the target video and often contains uninformative tokens, and 2)
partial video relevance, where abundant query-irrelevant segments introduce
contextual noise in cross-modal alignment. Existing methods often focus on
enhancing multi-scale clip representations and retrieving the most relevant
clip. However, the inherent data uncertainty in PRVR renders them vulnerable to
distractor videos with spurious similarities, leading to suboptimal
performance. To fill this research gap, we propose Robust Alignment Learning
(RAL) framework, which explicitly models the uncertainty in data. Key
innovations include: 1) we pioneer probabilistic modeling for PRVR by encoding
videos and queries as multivariate Gaussian distributions. This not only
quantifies data uncertainty but also enables proxy-level matching to capture
the variability in cross-modal correspondences; 2) we consider the
heterogeneous informativeness of query words and introduce learnable confidence
gates to dynamically weight similarity. As a plug-and-play solution, RAL can be
seamlessly integrated into the existing architectures. Extensive experiments
across diverse retrieval backbones demonstrate its effectiveness.
Authors' comments: Accepted at EMNLP 2025
Yutian Xiao, Shukuan Wang, Binhao Wang, Zhao Zhang, Yanze Zhang, Shanqi Liu, Chao Feng, Xiang Li et al.
Click-through rate (CTR) prediction serves as a cornerstone of recommender systems. Despite the strong performance of current CTR models based on user behavior modeling, they are still severely limited by interaction sparsity, especially in low-active user scenarios. To address this issue, data augmentation of user behavior is a promising research direction. However, existing data augmentation methods heavily rely on collaborative signals while overlooking the rich multimodal features of items, leading to insufficient modeling of low-active users. To alleviate this problem, we propose a novel framework \textbf{MARS} (\textbf{M}odality-\textbf{A}ligned \textbf{R}etrieval for \textbf{S}equence Augmented CTR Prediction). MARS utilizes a Stein kernel-based approach to align text and image features into a unified and unbiased semantic space to construct multimodal user embeddings. Subsequently, each low-active user's behavior sequence is augmented by retrieving, filtering, and concentrating the most similar behavior sequence of high-active users via multimodal user embeddings. Validated by extensive offline experiments and online A/B tests, our framework MARS consistently outperforms state-of-the-art baselines and achieves substantial growth on core business metrics within Kuaishou~\footnote{https://www.kuaishou.com/}. Consequently, MARS has been successfully deployed, serving the main traffic for hundreds of millions of users. To ensure reproducibility, we provide anonymous access to the implementation code~\footnote{https://github.com/wangshukuan/MARS}.
Jinwen Chen, Hainan Zhang, Liang Pang, Yongxin Tong, Haibo Zhou, Yuan Zhan, Wei Lin, Zhiming Zheng
The current RAG system requires uploading plaintext documents to the cloud, risking private data leakage. Parametric RAG (PRAG) addresses this by encoding documents as LoRA within LLMs, enabling reasoning without exposing raw content. However, it still faces two issues: (1) PRAG demands synthesizing QA pairs and fine-tuning LLM for each individual document to create its corresponding LoRA, leading to unacceptable inference latency. (2) The performance of PRAG relies solely on synthetic QA data, lacking internal alignment with standard RAG, resulting in poor generalization on out-of-distribution(OOD) inputs. Therefore, achieving high-efficiency parameterization while maintaining RAG-level performance remains a critical challenge for privacy-preserving reasoning. In this paper, we propose DistilledPRAG, a generalizable knowledge-distilled parametric RAG model aligned with standard RAG in document structure and parameter activation. We first synthesize QA pairs from single and multi-documents to enhance cross-document reasoning. Then, we mask the plaintext documents with a special token and translate them to LoRA via a parameter generator, maintaining the standard RAG document structure. Finally, guided by synthetic QA data, we train the parameter generator to match standard RAG's hidden states and output logits, enabling RAG-style reasoning without original documents. Experiments on four QA datasets show that DistilledPRAG outperforms baselines in accuracy and generalizes well on OOD data.
Alexis Horde Vo, Matt Duckham, Estrid He, Rafe Benli
Who is the "Batman" behind "Batman Street" in Melbourne? Understanding the historical, cultural, and societal narratives behind place names can reveal the rich context that has shaped a community. Although place names serve as essential spatial references in gazetteers, they often lack information about place name origins. Enriching these place names in today's gazetteers is a time-consuming, manual process that requires extensive exploration of a vast archive of documents and text sources. Recent advances in natural language processing and language models (LMs) hold the promise of significant automation of identifying place name origins due to their powerful capability to exploit the semantics of the stored documents. This chapter presents a retrieval augmented generation pipeline designed to search for place name origins over a broad knowledge base, DBpedia. Given a spatial query, our approach first extracts sub-graphs that may contain knowledge relevant to the query; then ranks the extracted sub-graphs to generate the final answer to the query using fine-tuned LM-based models (i.e., ColBERTv2 and Llama2). Our results highlight the key challenges facing automated retrieval of place name origins, especially the tendency of language models to under-use the spatial information contained in texts as a discriminating factor. Our approach also frames the wider implications for geographic information retrieval using retrieval augmented generation.
Haomiao Tang, Wenjie Li, Yixiang Qiu, Genping Wang, Shu-Tao Xia
Despite the ubiquity of modern face retrieval systems, their retrieval stage is often outsourced to third-party entities, posing significant risks to user portrait privacy. Although homomorphic encryption (HE) offers strong security guarantees by enabling arithmetic computations in the cipher space, its high computational inefficiency makes it unsuitable for real-time, real-world applications. To address this issue, we propose Cancelable Product Quantization, a highly efficient framework for secure face representation retrieval. Our hierarchical two-stage framework comprises: (i) a high-throughput cancelable PQ indexing module for fast candidate filtering, and (ii) a fine-grained cipher-space retrieval module for final precise face ranking. A tailored protection mechanism is designed to secure the indexing module for cancelable biometric authentication while ensuring efficiency. Experiments on benchmark datasets demonstrate that our method achieves an decent balance between effectiveness, efficiency and security.
Authors' comments: 14 pages and 2 figures, accepted by PRCV2025
Sri Ram Macharla, Sridhar Murthy J, Anjaneyulu Pasala
MultiFluxAI is an innovative AI platform developed to address the challenges
of managing and integrating vast, disparate data sources in product engineering
across application domains. It addresses both current and new service related
queries that enhance user engagement in the digital ecosystem. This platform
leverages advanced AI techniques, such as Generative AI, vectorization, and
agentic orchestration to provide dynamic and context-aware responses to complex
user queries.
Authors' comments: Abstract accepted for presentation at ACM ISEC 2025
Orion Weller, Michael Boratko, Iftekhar Naim, Jinhyuk Lee
Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation.
Haoyu Wu, Qingcheng Zeng, Kaize Ding
Dense retrievers and rerankers are central to retrieval-augmented generation
(RAG) pipelines, where accurately retrieving factual information is crucial for
maintaining system trustworthiness and defending against RAG poisoning.
However, little is known about how much factual competence these components
inherit or lose from the large language models (LLMs) they are based on. We
pair 12 publicly released embedding checkpoints with their original base LLMs
and evaluate both sets on a factuality benchmark. Across every model evaluated,
the embedding variants achieve markedly lower accuracy than their bases, with
absolute drops ranging from 12 to 43 percentage points (median 28 pts) and
typical retriever accuracies collapsing into the 25-35 % band versus the 60-70
% attained by the generative models. This degradation intensifies under a more
demanding condition: when the candidate pool per question is expanded from four
options to one thousand, the strongest retriever's top-1 accuracy falls from 33
% to 26 %, revealing acute sensitivity to distractor volume. Statistical tests
further show that, for every embedding model, cosine-similarity scores between
queries and correct completions are significantly higher than those for
incorrect ones (p < 0.01), indicating decisions driven largely by surface-level
semantic proximity rather than factual reasoning. To probe this weakness, we
employed GPT-4.1 to paraphrase each correct completion, creating a rewritten
test set that preserved factual truth while masking lexical cues, and observed
that over two-thirds of previously correct predictions flipped to wrong,
reducing overall accuracy to roughly one-third of its original level. Taken
together, these findings reveal a systematic trade-off introduced by
contrastive learning for retrievers: gains in semantic retrieval are paid for
with losses in parametric factual knowledge......
Authors' comments: Proceedings of the 34th ACM International Conference on Information
and Knowledge Management
Yijia Sun, Shanshan Huang, Linxiao Che, Haitao Lu, Qiang Luo, Kun Gai, Guorui Zhou
Modern industrial recommendation systems encounter a core challenge of
multi-stage optimization misalignment: a significant semantic gap exists
between the multi-objective optimization paradigm widely used in the ranking
phase and the single-objective modeling in the retrieve phase. Although the
mainstream industry solution achieves multi-objective coverage through parallel
multi-path single-objective retrieval, this approach leads to linear growth of
training and serving resources with the number of objectives and has inherent
limitations in handling loosely coupled objectives. This paper proposes the
MPFormer, a dynamic multi-task Transformer framework, which systematically
addresses the aforementioned issues through three innovative mechanisms. First,
an objective-conditioned transformer that jointly encodes user behavior
sequences and multi-task semantics through learnable attention modulation;
second, personalized target weights are introduced to achieve dynamic
adjustment of retrieval results; finally, user personalization information is
incorporated into token representations and the Transformer structure to
further enhance the model's representation ability. This framework has been
successfully integrated into Kuaishou short video recommendation system, stably
serving over 400 million daily active users. It significantly improves user
daily engagement and system operational efficiency. Practical deployment
verification shows that, compared with traditional solutions, it effectively
optimizes the iterative paradigm of multi-objective retrieval while maintaining
service response speed, providing a scalable multi-objective solution for
industrial recommendation systems.
Authors' comments: CIKM 2025
Boheng Mao
Legal text classification is a fundamental NLP task in the legal domain. Benchmark datasets in this area often exhibit a long-tail label distribution, where many labels are underrepresented, leading to poor model performance on rare classes. This paper proposes Selective Retrieval-Augmentation (SRA) as a solution to this problem. SRA focuses on augmenting samples belonging to low-frequency labels in the training set, preventing the introduction of noise for well-represented classes, and requires no changes to the model architecture. Retrieval is performed only from the training data to ensure there is no potential information leakage, removing the need for external corpora simultaneously. The proposed SRA method is tested on two legal text classification benchmark datasets with long-tail distributions: LEDGAR (single-label) and UNFAIR-ToS (multi-label). The results indicate that SRA attains higher micro-F1 and macro-F1 scores compared to all current LexGLUE baselines across both datasets, illustrating consistent improvements in long-tail legal text classification. The code repository is available at: https://github.com/Boheng-Mao/sra-legal
Felix Nützel, Mischa Dombrowski, Bernhard Kainz
Retrieval-augmented learning based on radiology reports has emerged as a
promising direction to improve performance on long-tail medical imaging tasks,
such as rare disease detection in chest X-rays. Most existing methods rely on
comparing high-dimensional text embeddings from models like CLIP or CXR-BERT,
which are often difficult to interpret, computationally expensive, and not
well-aligned with the structured nature of medical knowledge. We propose a
novel, ontology-driven alternative for comparing radiology report texts based
on clinically grounded concepts from the Unified Medical Language System
(UMLS). Our method extracts standardised medical entities from free-text
reports using an enhanced pipeline built on RadGraph-XL and SapBERT. These
entities are linked to UMLS concepts (CUIs), enabling a transparent,
interpretable set-based representation of each report. We then define a
task-adaptive similarity measure based on a modified and weighted version of
the Tversky Index that accounts for synonymy, negation, and hierarchical
relationships between medical entities. This allows efficient and semantically
meaningful similarity comparisons between reports. We demonstrate that our
approach outperforms state-of-the-art embedding-based retrieval methods in a
radiograph classification task on MIMIC-CXR, particularly in long-tail
settings. Additionally, we use our pipeline to generate ontology-backed disease
labels for MIMIC-CXR, offering a valuable new resource for downstream learning
tasks. Our work provides more explainable, reliable, and task-specific
retrieval strategies in clinical AI systems, especially when interpretability
and domain knowledge integration are essential. Our code is available at
https://github.com/Felix-012/ontology-concept-distillation
Authors' comments: 10 pages, 3 figures, Preprint (submitted version, de-anonymized).
Accepted at MLMI (MICCAI Workshop) 2025. Version of Record to appear in
Springer LNCS; This preprint has not undergone peer review or any
post-submission improvements or corrections