Youchao Zhou, Heyan Huang, Yicheng Liu, Rui Dai, Xinglin Wang, Xingchen Zhang, Shumin Shi, Yang Deng
Existing Large Language Models (LLMs) occasionally generate plausible yet
factually incorrect responses, known as hallucinations. Researchers are
primarily using two approaches to mitigate hallucinations, namely Retrieval
Augmented Language Models (RALMs) and refusal post-training. However, current
research predominantly emphasizes their individual effectiveness while
overlooking the evaluation of the refusal capability of RALMs. In this study,
we ask the fundamental question: Do RALMs know when they don't know?
Specifically, we ask three questions. First, are RALMs well-calibrated
regarding different internal and external knowledge states? We examine the
influence of various factors. Contrary to expectations, we find that LLMs
exhibit significant \textbf{over-refusal} behavior. Then, how does refusal
post-training affect the over-refusal issue? We investigate the Refusal-aware
Instruction Tuning and In-Context Fine-tuning methods. Our results show that
the over-refusal problem is mitigated by In-context fine-tuning. but magnified
by R-tuning. However, we also find that the refusal ability may conflict with
the quality of the answer. Finally, we develop a simple yet effective refusal
method for refusal post-trained models to improve their overall answer quality
in terms of refusal and correct answers. Our study provides a more
comprehensive understanding of the influence of important factors on RALM
systems.
Authors' comments: under review
Long Zhang, Peipei Song, Jianfeng Dong, Kun Li, Xun Yang
Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos
partially relevant to a given query. The core challenge lies in learning robust
query-video alignment against spurious semantic correlations arising from
inherent data uncertainty: 1) query ambiguity, where the query incompletely
characterizes the target video and often contains uninformative tokens, and 2)
partial video relevance, where abundant query-irrelevant segments introduce
contextual noise in cross-modal alignment. Existing methods often focus on
enhancing multi-scale clip representations and retrieving the most relevant
clip. However, the inherent data uncertainty in PRVR renders them vulnerable to
distractor videos with spurious similarities, leading to suboptimal
performance. To fill this research gap, we propose Robust Alignment Learning
(RAL) framework, which explicitly models the uncertainty in data. Key
innovations include: 1) we pioneer probabilistic modeling for PRVR by encoding
videos and queries as multivariate Gaussian distributions. This not only
quantifies data uncertainty but also enables proxy-level matching to capture
the variability in cross-modal correspondences; 2) we consider the
heterogeneous informativeness of query words and introduce learnable confidence
gates to dynamically weight similarity. As a plug-and-play solution, RAL can be
seamlessly integrated into the existing architectures. Extensive experiments
across diverse retrieval backbones demonstrate its effectiveness.
Authors' comments: Accepted at EMNLP 2025
Yutian Xiao, Shukuan Wang, Binhao Wang, Zhao Zhang, Yanze Zhang, Shanqi Liu, Chao Feng, Xiang Li et al.
Click-through rate (CTR) prediction serves as a cornerstone of recommender systems. Despite the strong performance of current CTR models based on user behavior modeling, they are still severely limited by interaction sparsity, especially in low-active user scenarios. To address this issue, data augmentation of user behavior is a promising research direction. However, existing data augmentation methods heavily rely on collaborative signals while overlooking the rich multimodal features of items, leading to insufficient modeling of low-active users. To alleviate this problem, we propose a novel framework \textbf{MARS} (\textbf{M}odality-\textbf{A}ligned \textbf{R}etrieval for \textbf{S}equence Augmented CTR Prediction). MARS utilizes a Stein kernel-based approach to align text and image features into a unified and unbiased semantic space to construct multimodal user embeddings. Subsequently, each low-active user's behavior sequence is augmented by retrieving, filtering, and concentrating the most similar behavior sequence of high-active users via multimodal user embeddings. Validated by extensive offline experiments and online A/B tests, our framework MARS consistently outperforms state-of-the-art baselines and achieves substantial growth on core business metrics within Kuaishou~\footnote{https://www.kuaishou.com/}. Consequently, MARS has been successfully deployed, serving the main traffic for hundreds of millions of users. To ensure reproducibility, we provide anonymous access to the implementation code~\footnote{https://github.com/wangshukuan/MARS}.
Jinwen Chen, Hainan Zhang, Liang Pang, Yongxin Tong, Haibo Zhou, Yuan Zhan, Wei Lin, Zhiming Zheng
The current RAG system requires uploading plaintext documents to the cloud, risking private data leakage. Parametric RAG (PRAG) addresses this by encoding documents as LoRA within LLMs, enabling reasoning without exposing raw content. However, it still faces two issues: (1) PRAG demands synthesizing QA pairs and fine-tuning LLM for each individual document to create its corresponding LoRA, leading to unacceptable inference latency. (2) The performance of PRAG relies solely on synthetic QA data, lacking internal alignment with standard RAG, resulting in poor generalization on out-of-distribution(OOD) inputs. Therefore, achieving high-efficiency parameterization while maintaining RAG-level performance remains a critical challenge for privacy-preserving reasoning. In this paper, we propose DistilledPRAG, a generalizable knowledge-distilled parametric RAG model aligned with standard RAG in document structure and parameter activation. We first synthesize QA pairs from single and multi-documents to enhance cross-document reasoning. Then, we mask the plaintext documents with a special token and translate them to LoRA via a parameter generator, maintaining the standard RAG document structure. Finally, guided by synthetic QA data, we train the parameter generator to match standard RAG's hidden states and output logits, enabling RAG-style reasoning without original documents. Experiments on four QA datasets show that DistilledPRAG outperforms baselines in accuracy and generalizes well on OOD data.
Alexis Horde Vo, Matt Duckham, Estrid He, Rafe Benli
Who is the "Batman" behind "Batman Street" in Melbourne? Understanding the historical, cultural, and societal narratives behind place names can reveal the rich context that has shaped a community. Although place names serve as essential spatial references in gazetteers, they often lack information about place name origins. Enriching these place names in today's gazetteers is a time-consuming, manual process that requires extensive exploration of a vast archive of documents and text sources. Recent advances in natural language processing and language models (LMs) hold the promise of significant automation of identifying place name origins due to their powerful capability to exploit the semantics of the stored documents. This chapter presents a retrieval augmented generation pipeline designed to search for place name origins over a broad knowledge base, DBpedia. Given a spatial query, our approach first extracts sub-graphs that may contain knowledge relevant to the query; then ranks the extracted sub-graphs to generate the final answer to the query using fine-tuned LM-based models (i.e., ColBERTv2 and Llama2). Our results highlight the key challenges facing automated retrieval of place name origins, especially the tendency of language models to under-use the spatial information contained in texts as a discriminating factor. Our approach also frames the wider implications for geographic information retrieval using retrieval augmented generation.
Haomiao Tang, Wenjie Li, Yixiang Qiu, Genping Wang, Shu-Tao Xia
Despite the ubiquity of modern face retrieval systems, their retrieval stage is often outsourced to third-party entities, posing significant risks to user portrait privacy. Although homomorphic encryption (HE) offers strong security guarantees by enabling arithmetic computations in the cipher space, its high computational inefficiency makes it unsuitable for real-time, real-world applications. To address this issue, we propose Cancelable Product Quantization, a highly efficient framework for secure face representation retrieval. Our hierarchical two-stage framework comprises: (i) a high-throughput cancelable PQ indexing module for fast candidate filtering, and (ii) a fine-grained cipher-space retrieval module for final precise face ranking. A tailored protection mechanism is designed to secure the indexing module for cancelable biometric authentication while ensuring efficiency. Experiments on benchmark datasets demonstrate that our method achieves an decent balance between effectiveness, efficiency and security.
Authors' comments: 14 pages and 2 figures, accepted by PRCV2025
Sri Ram Macharla, Sridhar Murthy J, Anjaneyulu Pasala
MultiFluxAI is an innovative AI platform developed to address the challenges
of managing and integrating vast, disparate data sources in product engineering
across application domains. It addresses both current and new service related
queries that enhance user engagement in the digital ecosystem. This platform
leverages advanced AI techniques, such as Generative AI, vectorization, and
agentic orchestration to provide dynamic and context-aware responses to complex
user queries.
Authors' comments: Abstract accepted for presentation at ACM ISEC 2025
Orion Weller, Michael Boratko, Iftekhar Naim, Jinhyuk Lee
Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation.
Haoyu Wu, Qingcheng Zeng, Kaize Ding
Dense retrievers and rerankers are central to retrieval-augmented generation
(RAG) pipelines, where accurately retrieving factual information is crucial for
maintaining system trustworthiness and defending against RAG poisoning.
However, little is known about how much factual competence these components
inherit or lose from the large language models (LLMs) they are based on. We
pair 12 publicly released embedding checkpoints with their original base LLMs
and evaluate both sets on a factuality benchmark. Across every model evaluated,
the embedding variants achieve markedly lower accuracy than their bases, with
absolute drops ranging from 12 to 43 percentage points (median 28 pts) and
typical retriever accuracies collapsing into the 25-35 % band versus the 60-70
% attained by the generative models. This degradation intensifies under a more
demanding condition: when the candidate pool per question is expanded from four
options to one thousand, the strongest retriever's top-1 accuracy falls from 33
% to 26 %, revealing acute sensitivity to distractor volume. Statistical tests
further show that, for every embedding model, cosine-similarity scores between
queries and correct completions are significantly higher than those for
incorrect ones (p < 0.01), indicating decisions driven largely by surface-level
semantic proximity rather than factual reasoning. To probe this weakness, we
employed GPT-4.1 to paraphrase each correct completion, creating a rewritten
test set that preserved factual truth while masking lexical cues, and observed
that over two-thirds of previously correct predictions flipped to wrong,
reducing overall accuracy to roughly one-third of its original level. Taken
together, these findings reveal a systematic trade-off introduced by
contrastive learning for retrievers: gains in semantic retrieval are paid for
with losses in parametric factual knowledge......
Authors' comments: Proceedings of the 34th ACM International Conference on Information
and Knowledge Management
Yijia Sun, Shanshan Huang, Linxiao Che, Haitao Lu, Qiang Luo, Kun Gai, Guorui Zhou
Modern industrial recommendation systems encounter a core challenge of
multi-stage optimization misalignment: a significant semantic gap exists
between the multi-objective optimization paradigm widely used in the ranking
phase and the single-objective modeling in the retrieve phase. Although the
mainstream industry solution achieves multi-objective coverage through parallel
multi-path single-objective retrieval, this approach leads to linear growth of
training and serving resources with the number of objectives and has inherent
limitations in handling loosely coupled objectives. This paper proposes the
MPFormer, a dynamic multi-task Transformer framework, which systematically
addresses the aforementioned issues through three innovative mechanisms. First,
an objective-conditioned transformer that jointly encodes user behavior
sequences and multi-task semantics through learnable attention modulation;
second, personalized target weights are introduced to achieve dynamic
adjustment of retrieval results; finally, user personalization information is
incorporated into token representations and the Transformer structure to
further enhance the model's representation ability. This framework has been
successfully integrated into Kuaishou short video recommendation system, stably
serving over 400 million daily active users. It significantly improves user
daily engagement and system operational efficiency. Practical deployment
verification shows that, compared with traditional solutions, it effectively
optimizes the iterative paradigm of multi-objective retrieval while maintaining
service response speed, providing a scalable multi-objective solution for
industrial recommendation systems.
Authors' comments: CIKM 2025
Boheng Mao
Legal text classification is a fundamental NLP task in the legal domain. Benchmark datasets in this area often exhibit a long-tail label distribution, where many labels are underrepresented, leading to poor model performance on rare classes. This paper proposes Selective Retrieval-Augmentation (SRA) as a solution to this problem. SRA focuses on augmenting samples belonging to low-frequency labels in the training set, preventing the introduction of noise for well-represented classes, and requires no changes to the model architecture. Retrieval is performed only from the training data to ensure there is no potential information leakage, removing the need for external corpora simultaneously. The proposed SRA method is tested on two legal text classification benchmark datasets with long-tail distributions: LEDGAR (single-label) and UNFAIR-ToS (multi-label). The results indicate that SRA attains higher micro-F1 and macro-F1 scores compared to all current LexGLUE baselines across both datasets, illustrating consistent improvements in long-tail legal text classification. The code repository is available at: https://github.com/Boheng-Mao/sra-legal
Felix Nützel, Mischa Dombrowski, Bernhard Kainz
Retrieval-augmented learning based on radiology reports has emerged as a
promising direction to improve performance on long-tail medical imaging tasks,
such as rare disease detection in chest X-rays. Most existing methods rely on
comparing high-dimensional text embeddings from models like CLIP or CXR-BERT,
which are often difficult to interpret, computationally expensive, and not
well-aligned with the structured nature of medical knowledge. We propose a
novel, ontology-driven alternative for comparing radiology report texts based
on clinically grounded concepts from the Unified Medical Language System
(UMLS). Our method extracts standardised medical entities from free-text
reports using an enhanced pipeline built on RadGraph-XL and SapBERT. These
entities are linked to UMLS concepts (CUIs), enabling a transparent,
interpretable set-based representation of each report. We then define a
task-adaptive similarity measure based on a modified and weighted version of
the Tversky Index that accounts for synonymy, negation, and hierarchical
relationships between medical entities. This allows efficient and semantically
meaningful similarity comparisons between reports. We demonstrate that our
approach outperforms state-of-the-art embedding-based retrieval methods in a
radiograph classification task on MIMIC-CXR, particularly in long-tail
settings. Additionally, we use our pipeline to generate ontology-backed disease
labels for MIMIC-CXR, offering a valuable new resource for downstream learning
tasks. Our work provides more explainable, reliable, and task-specific
retrieval strategies in clinical AI systems, especially when interpretability
and domain knowledge integration are essential. Our code is available at
https://github.com/Felix-012/ontology-concept-distillation
Authors' comments: 10 pages, 3 figures, Preprint (submitted version, de-anonymized).
Accepted at MLMI (MICCAI Workshop) 2025. Version of Record to appear in
Springer LNCS; This preprint has not undergone peer review or any
post-submission improvements or corrections
Eric López, Artemis Llabrés, Ernest Valveny
Document Visual Question Answering (Document VQA) must cope with documents
that span dozens of pages, yet leading systems still concatenate every page or
rely on very large vision-language models, both of which are memory-hungry.
Retrieval-Augmented Generation (RAG) offers an attractive alternative, first
retrieving a concise set of relevant segments before generating answers from
this selected evidence. In this paper, we systematically evaluate the impact of
incorporating RAG into Document VQA through different retrieval variants -
text-based retrieval using OCR tokens and purely visual retrieval without OCR -
across multiple models and benchmarks. Evaluated on the multi-page datasets
MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the
"concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant
achieves +5.0 ANLS improvement without requiring any text extraction. An
ablation confirms that retrieval and reranking components drive most of the
gain, whereas the layout-guided chunking strategy - proposed in several recent
works to leverage page structure - fails to help on these datasets. Our
experiments demonstrate that careful evidence selection consistently boosts
accuracy across multiple model sizes and multi-page benchmarks, underscoring
its practical value for real-world Document VQA.
Authors' comments: Accepted at Workshop on Machine Learning in Document Analysis and
Recognition (ICDAR WML 2025), Wuhan, China
Yali Dong, Rui Liu, Heying Wang
In this paper, we focus on the problem of phase retrieval from intensity measurements of the Short-Time Linear Canonical Transform (STLCT). Specifically, we show that the STLCT allows for the unique recovery of any square-integrable function through phaseless STLCT sampling on rectangular square-root lattices. When turning to the uniform lattices, we establish counterexamples about the STLCT phase retrieval problems in L2(R). Nevertheless, for functions in band-limited function spaces, phase retrieval results on uniform lattices can still be accomplished.
Runpeng Geng, Yanting Wang, Ying Chen, Jinyuan Jia
Retrieval-augmented generation (RAG) systems are widely deployed in
real-world applications in diverse domains such as finance, healthcare, and
cybersecurity. However, many studies showed that they are vulnerable to
knowledge corruption attacks, where an attacker can inject adversarial texts
into the knowledge database of a RAG system to induce the LLM to generate
attacker-desired outputs. Existing studies mainly focus on attacking specific
queries or queries with similar topics (or keywords). In this work, we propose
UniC-RAG, a universal knowledge corruption attack against RAG systems. Unlike
prior work, UniC-RAG jointly optimizes a small number of adversarial texts that
can simultaneously attack a large number of user queries with diverse topics
and domains, enabling an attacker to achieve various malicious objectives, such
as directing users to malicious websites, triggering harmful command execution,
or launching denial-of-service attacks. We formulate UniC-RAG as an
optimization problem and further design an effective solution to solve it,
including a balanced similarity-based clustering method to enhance the attack's
effectiveness. Our extensive evaluations demonstrate that UniC-RAG is highly
effective and significantly outperforms baselines. For instance, UniC-RAG could
achieve over 90% attack success rate by injecting 100 adversarial texts into a
knowledge database with millions of texts to simultaneously attack a large set
of user queries (e.g., 2,000). Additionally, we evaluate existing defenses and
show that they are insufficient to defend against UniC-RAG, highlighting the
need for new defense mechanisms in RAG systems.
Authors' comments: 21 pages, 4 figures
Yuqicheng Zhu, Nico Potyka, Daniel Hernández, Yuan He, Zifeng Ding, Bo Xiong, Dongzhuoran Zhou, Evgeny Kharlamov et al.
Retrieval-Augmented Generation (RAG) enhances large language models by incorporating external knowledge, yet suffers from critical limitations in high-stakes domains -- namely, sensitivity to noisy or contradictory evidence and opaque, stochastic decision-making. We propose ArgRAG, an explainable, and contestable alternative that replaces black-box reasoning with structured inference using a Quantitative Bipolar Argumentation Framework (QBAF). ArgRAG constructs a QBAF from retrieved documents and performs deterministic reasoning under gradual semantics. This allows faithfully explaining and contesting decisions. Evaluated on two fact verification benchmarks, PubHealth and RAGuard, ArgRAG achieves strong accuracy while significantly improving transparency.
Jiyoon Myung, Jihyeon Park, Joohyung Han
User queries in real-world recommendation systems often combine structured
constraints (e.g., category, attributes) with unstructured preferences (e.g.,
product descriptions or reviews). We introduce HyST (Hybrid retrieval over
Semi-structured Tabular data), a hybrid retrieval framework that combines
LLM-powered structured filtering with semantic embedding search to support
complex information needs over semi-structured tabular data. HyST extracts
attribute-level constraints from natural language using large language models
(LLMs) and applies them as metadata filters, while processing the remaining
unstructured query components via embedding-based retrieval. Experiments on a
semi-structured benchmark show that HyST consistently outperforms tradtional
baselines, highlighting the importance of structured filtering in improving
retrieval precision, offering a scalable and accurate solution for real-world
user queries.
Authors' comments: Accepted at the 2nd EARL Workshop on Evaluating and Applying
Recommender Systems with Large Language Models (RecSys 2025)
Wei Huang, Keping Bi, Yinqiong Cai, Wei Chen, Jiafeng Guo, Xueqi Cheng
As more content generated by large language models (LLMs) floods into the Internet, information retrieval (IR) systems now face the challenge of distinguishing and handling a blend of human-authored and machine-generated texts. Recent studies suggest that neural retrievers may exhibit a preferential inclination toward LLM-generated content, while classic term-based retrievers like BM25 tend to favor human-written documents. This paper investigates the influence of LLM-generated content on term-based retrieval models, which are valued for their efficiency and robust generalization across domains. Our linguistic analysis reveals that LLM-generated texts exhibit smoother high-frequency and steeper low-frequency Zipf slopes, higher term specificity, and greater document-level diversity. These traits are aligned with LLMs being trained to optimize reader experience through diverse and precise expressions. Our study further explores whether term-based retrieval models demonstrate source bias, concluding that these models prioritize documents whose term distributions closely correspond to those of the queries, rather than displaying an inherent source bias. This work provides a foundation for understanding and addressing potential biases in term-based IR systems managing mixed-source content.
Jacob Portes, Connor Jennings, Erica Ji Yuen, Sasha Doubov, Michael Carbin
How does retrieval performance scale with pretraining FLOPs? We benchmark
retrieval performance across LLM model sizes from 125 million parameters to 7
billion parameters pretrained on datasets ranging from 1 billion tokens to more
than 2 trillion tokens. We find that retrieval performance on zero-shot BEIR
tasks predictably scales with LLM size, training duration, and estimated FLOPs.
We also show that In-Context Learning scores are strongly correlated with
retrieval scores across retrieval tasks. Finally, we highlight the implications
this has for the development of LLM-based retrievers.
Authors' comments: 15 pages, 4 figures
Xiaqiang Tang, Yi Wang, Keyu Hu, Rui Xu, Chuang Li, Weigao Sun, Jian Li, Sihong Xie
Retrieval-Augmented Generation (RAG) systems require Large Language Models
(LLMs) to generate responses that are faithful to the retrieved context.
However, faithfulness hallucination remains a critical challenge, as existing
methods often require costly supervision and post-training or significant
inference burdens. To overcome these limitations, we introduce Self-Supervised
Faithfulness Optimization (SSFO), the first self-supervised alignment approach
for enhancing RAG faithfulness. SSFO constructs preference data pairs by
contrasting the model's outputs generated with and without the context.
Leveraging Direct Preference Optimization (DPO), SSFO aligns model faithfulness
without incurring labeling costs or additional inference burden. We
theoretically and empirically demonstrate that SSFO leverages a benign form of
\emph{likelihood displacement}, transferring probability mass from
parametric-based tokens to context-aligned tokens. Based on this insight, we
propose a modified DPO loss function to encourage likelihood displacement.
Comprehensive evaluations show that SSFO significantly outperforms existing
methods, achieving state-of-the-art faithfulness on multiple context-based
question-answering datasets. Notably, SSFO exhibits strong generalization,
improving cross-lingual faithfulness and preserving general
instruction-following capabilities. We release our code and model at the
anonymous link: https://github.com/chkwy/SSFO
Authors' comments: Working in progress