Eric López, Artemis Llabrés, Ernest Valveny
Document Visual Question Answering (Document VQA) must cope with documents
that span dozens of pages, yet leading systems still concatenate every page or
rely on very large vision-language models, both of which are memory-hungry.
Retrieval-Augmented Generation (RAG) offers an attractive alternative, first
retrieving a concise set of relevant segments before generating answers from
this selected evidence. In this paper, we systematically evaluate the impact of
incorporating RAG into Document VQA through different retrieval variants -
text-based retrieval using OCR tokens and purely visual retrieval without OCR -
across multiple models and benchmarks. Evaluated on the multi-page datasets
MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the
"concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant
achieves +5.0 ANLS improvement without requiring any text extraction. An
ablation confirms that retrieval and reranking components drive most of the
gain, whereas the layout-guided chunking strategy - proposed in several recent
works to leverage page structure - fails to help on these datasets. Our
experiments demonstrate that careful evidence selection consistently boosts
accuracy across multiple model sizes and multi-page benchmarks, underscoring
its practical value for real-world Document VQA.
Authors' comments: Accepted at Workshop on Machine Learning in Document Analysis and
Recognition (ICDAR WML 2025), Wuhan, China
Yali Dong, Rui Liu, Heying Wang
In this paper, we focus on the problem of phase retrieval from intensity measurements of the Short-Time Linear Canonical Transform (STLCT). Specifically, we show that the STLCT allows for the unique recovery of any square-integrable function through phaseless STLCT sampling on rectangular square-root lattices. When turning to the uniform lattices, we establish counterexamples about the STLCT phase retrieval problems in L2(R). Nevertheless, for functions in band-limited function spaces, phase retrieval results on uniform lattices can still be accomplished.
Runpeng Geng, Yanting Wang, Ying Chen, Jinyuan Jia
Retrieval-augmented generation (RAG) systems are widely deployed in
real-world applications in diverse domains such as finance, healthcare, and
cybersecurity. However, many studies showed that they are vulnerable to
knowledge corruption attacks, where an attacker can inject adversarial texts
into the knowledge database of a RAG system to induce the LLM to generate
attacker-desired outputs. Existing studies mainly focus on attacking specific
queries or queries with similar topics (or keywords). In this work, we propose
UniC-RAG, a universal knowledge corruption attack against RAG systems. Unlike
prior work, UniC-RAG jointly optimizes a small number of adversarial texts that
can simultaneously attack a large number of user queries with diverse topics
and domains, enabling an attacker to achieve various malicious objectives, such
as directing users to malicious websites, triggering harmful command execution,
or launching denial-of-service attacks. We formulate UniC-RAG as an
optimization problem and further design an effective solution to solve it,
including a balanced similarity-based clustering method to enhance the attack's
effectiveness. Our extensive evaluations demonstrate that UniC-RAG is highly
effective and significantly outperforms baselines. For instance, UniC-RAG could
achieve over 90% attack success rate by injecting 100 adversarial texts into a
knowledge database with millions of texts to simultaneously attack a large set
of user queries (e.g., 2,000). Additionally, we evaluate existing defenses and
show that they are insufficient to defend against UniC-RAG, highlighting the
need for new defense mechanisms in RAG systems.
Authors' comments: 21 pages, 4 figures
Yuqicheng Zhu, Nico Potyka, Daniel Hernández, Yuan He, Zifeng Ding, Bo Xiong, Dongzhuoran Zhou, Evgeny Kharlamov et al.
Retrieval-Augmented Generation (RAG) enhances large language models by incorporating external knowledge, yet suffers from critical limitations in high-stakes domains -- namely, sensitivity to noisy or contradictory evidence and opaque, stochastic decision-making. We propose ArgRAG, an explainable, and contestable alternative that replaces black-box reasoning with structured inference using a Quantitative Bipolar Argumentation Framework (QBAF). ArgRAG constructs a QBAF from retrieved documents and performs deterministic reasoning under gradual semantics. This allows faithfully explaining and contesting decisions. Evaluated on two fact verification benchmarks, PubHealth and RAGuard, ArgRAG achieves strong accuracy while significantly improving transparency.
Jiyoon Myung, Jihyeon Park, Joohyung Han
User queries in real-world recommendation systems often combine structured
constraints (e.g., category, attributes) with unstructured preferences (e.g.,
product descriptions or reviews). We introduce HyST (Hybrid retrieval over
Semi-structured Tabular data), a hybrid retrieval framework that combines
LLM-powered structured filtering with semantic embedding search to support
complex information needs over semi-structured tabular data. HyST extracts
attribute-level constraints from natural language using large language models
(LLMs) and applies them as metadata filters, while processing the remaining
unstructured query components via embedding-based retrieval. Experiments on a
semi-structured benchmark show that HyST consistently outperforms tradtional
baselines, highlighting the importance of structured filtering in improving
retrieval precision, offering a scalable and accurate solution for real-world
user queries.
Authors' comments: Accepted at the 2nd EARL Workshop on Evaluating and Applying
Recommender Systems with Large Language Models (RecSys 2025)
Wei Huang, Keping Bi, Yinqiong Cai, Wei Chen, Jiafeng Guo, Xueqi Cheng
As more content generated by large language models (LLMs) floods into the Internet, information retrieval (IR) systems now face the challenge of distinguishing and handling a blend of human-authored and machine-generated texts. Recent studies suggest that neural retrievers may exhibit a preferential inclination toward LLM-generated content, while classic term-based retrievers like BM25 tend to favor human-written documents. This paper investigates the influence of LLM-generated content on term-based retrieval models, which are valued for their efficiency and robust generalization across domains. Our linguistic analysis reveals that LLM-generated texts exhibit smoother high-frequency and steeper low-frequency Zipf slopes, higher term specificity, and greater document-level diversity. These traits are aligned with LLMs being trained to optimize reader experience through diverse and precise expressions. Our study further explores whether term-based retrieval models demonstrate source bias, concluding that these models prioritize documents whose term distributions closely correspond to those of the queries, rather than displaying an inherent source bias. This work provides a foundation for understanding and addressing potential biases in term-based IR systems managing mixed-source content.
Jacob Portes, Connor Jennings, Erica Ji Yuen, Sasha Doubov, Michael Carbin
How does retrieval performance scale with pretraining FLOPs? We benchmark
retrieval performance across LLM model sizes from 125 million parameters to 7
billion parameters pretrained on datasets ranging from 1 billion tokens to more
than 2 trillion tokens. We find that retrieval performance on zero-shot BEIR
tasks predictably scales with LLM size, training duration, and estimated FLOPs.
We also show that In-Context Learning scores are strongly correlated with
retrieval scores across retrieval tasks. Finally, we highlight the implications
this has for the development of LLM-based retrievers.
Authors' comments: 15 pages, 4 figures
Xiaqiang Tang, Yi Wang, Keyu Hu, Rui Xu, Chuang Li, Weigao Sun, Jian Li, Sihong Xie
Retrieval-Augmented Generation (RAG) systems require Large Language Models
(LLMs) to generate responses that are faithful to the retrieved context.
However, faithfulness hallucination remains a critical challenge, as existing
methods often require costly supervision and post-training or significant
inference burdens. To overcome these limitations, we introduce Self-Supervised
Faithfulness Optimization (SSFO), the first self-supervised alignment approach
for enhancing RAG faithfulness. SSFO constructs preference data pairs by
contrasting the model's outputs generated with and without the context.
Leveraging Direct Preference Optimization (DPO), SSFO aligns model faithfulness
without incurring labeling costs or additional inference burden. We
theoretically and empirically demonstrate that SSFO leverages a benign form of
\emph{likelihood displacement}, transferring probability mass from
parametric-based tokens to context-aligned tokens. Based on this insight, we
propose a modified DPO loss function to encourage likelihood displacement.
Comprehensive evaluations show that SSFO significantly outperforms existing
methods, achieving state-of-the-art faithfulness on multiple context-based
question-answering datasets. Notably, SSFO exhibits strong generalization,
improving cross-lingual faithfulness and preserving general
instruction-following capabilities. We release our code and model at the
anonymous link: https://github.com/chkwy/SSFO
Authors' comments: Working in progress
Jiale Liu, Jiahao Zhang, Suhang Wang
Retrieval-Augmented Generation (RAG) is a powerful technique for enhancing Large Language Models (LLMs) with external, up-to-date knowledge. Graph RAG has emerged as an advanced paradigm that leverages graph-based knowledge structures to provide more coherent and contextually rich answers. However, the move from plain document retrieval to structured graph traversal introduces new, under-explored privacy risks. This paper investigates the data extraction vulnerabilities of the Graph RAG systems. We design and execute tailored data extraction attacks to probe their susceptibility to leaking both raw text and structured data, such as entities and their relationships. Our findings reveal a critical trade-off: while Graph RAG systems may reduce raw text leakage, they are significantly more vulnerable to the extraction of structured entity and relationship information. We also explore potential defense mechanisms to mitigate these novel attack surfaces. This work provides a foundational analysis of the unique privacy challenges in Graph RAG and offers insights for building more secure systems.
Ziqiang Cui, Yunpeng Weng, Xing Tang, Peiyang Liu, Shiwei Li, Bowei He, Jiamin Chen, Xiuqiang He et al.
Retrieval-Augmented Generation (RAG) has emerged as a promising approach to enhance the timeliness of knowledge and the factual accuracy of responses in Large Language Models (LLMs). However, the inclusion of excessive retrieved documents substantially increases the input length, leading to higher computational costs. Previous studies have attempted to compress retrieved documents into shorter texts before in-context integration, but such methods often compromise end-task performance. The lack of well-defined compression targets forces many approaches to rely on fixed heuristics, which cannot guarantee that the compressed content will effectively support the end task. To address these limitations, we propose CORE, a novel method designed to achieve lossless context compression for RAG. CORE employs reinforcement learning to optimize the compression process without relying on predefined compression labels. Specifically, it utilizes end-task performance as a reward signal and applies Generalized Reinforcement Learning Policy Optimization (GRPO) to train the compressor. This end-to-end training framework enables the compressor to generate summaries that maximize the accuracy of answers generated by the LLM. Extensive experiments on four datasets demonstrate the superiority of our approach. With a high compression ratio of 3\%, our method not only avoids performance degradation compared to prepending full documents across all datasets but also improves the average Exact Match (EM) score by 3.3 points. The code will be released soon.
Yejin Choi, Jaewoo Park, Janghan Yoon, Saejin Kim, Jaehyun Jeon, Youngjae Yu
Rapid advances in Multimodal Large Language Models (MLLMs) have expanded information retrieval beyond purely textual inputs, enabling retrieval from complex real world documents that combine text and visuals. However, most documents are private either owned by individuals or confined within corporate silos and current retrievers struggle when faced with unseen domains or languages. To address this gap, we introduce PREMIR, a simple yet effective framework that leverages the broad knowledge of an MLLM to generate cross modal pre questions (preQs) before retrieval. Unlike earlier multimodal retrievers that compare embeddings in a single vector space, PREMIR leverages preQs from multiple complementary modalities to expand the scope of matching to the token level. Experiments show that PREMIR achieves state of the art performance on out of distribution benchmarks, including closed domain and multilingual settings, outperforming strong baselines across all retrieval metrics. We confirm the contribution of each component through in depth ablation studies, and qualitative analyses of the generated preQs further highlight the model's robustness in real world settings.
Hongtao Lin, Haoyu Chen, Jaewon Jang, Jiajing Xu
User-to-item retrieval has been an active research area in recommendation system, and two tower models are widely adopted due to model simplicity and serving efficiency. In this work, we focus on a variant called \textit{conditional retrieval}, where we expect retrieved items to be relevant to a condition (e.g. topic). We propose a method that uses the same training data as standard two tower models but incorporates item-side information as conditions in query. This allows us to bootstrap new conditional retrieval use cases and encourages feature interactions between user and condition. Experiments show that our method can retrieve highly relevant items and outperforms standard two tower models with filters on engagement metrics. The proposed model is deployed to power a topic-based notification feed at Pinterest and led to +0.26\% weekly active users.
Jiawen Lyu, Manu Ramesh, Madison Simonds, Jacquelyn P. Boerman, Amy R. Reibman
Few automated video systems are described in the open literature that enable
hands-free cataloging and identification (ID) of cows in a dairy herd. In this
work, we describe our system, composed of an AutoCattloger, which builds a
Cattlog of dairy cows in a herd with a single input video clip per cow, an
eidetic cow recognizer which uses no deep learning to ID cows, and a CowFinder,
which IDs cows in a continuous stream of video. We demonstrate its value in
finding individuals in unlabeled, unsegmented videos of cows walking
unconstrained through the holding area of a milking parlor.
Authors' comments: Extended abstract. Presented at the 3rd US Conference on Precision
Livestock Farming (USPLF), 2025, Lincoln NE
Ren Qin, Chai Zheng, Xiao Xijun, Zheng Yuchao, Wu Di
Precisely modeling user ultra-long sequences is critical for industrial recommender systems. Current approaches predominantly focus on leveraging ultra-long sequences in the ranking stage, whereas research for the candidate retrieval stage remains under-explored. This paper presents LongRetriever, a practical framework for incorporating ultra-long sequences into the retrieval stage of recommenders. Specifically, we propose in-context training and multi-context retrieval, which enable candidate-specific interaction between user sequence and candidate item, and ensure training-serving consistency under the search-based paradigm. Extensive online A/B testing conducted on a large-scale e-commerce platform demonstrates statistically significant improvements, confirming the framework's effectiveness. Currently, LongRetriever has been fully deployed in the platform, impacting billions of users.
Mandeep Rathee, Venktesh V, Sean MacAvaney, Avishek Anand
Retrieval-Augmented Generation (RAG) has emerged as a standard framework for
knowledge-intensive NLP tasks, combining large language models (LLMs) with
document retrieval from external corpora. Despite its widespread use, most RAG
pipelines continue to treat retrieval and reasoning as isolated components,
retrieving documents once and then generating answers without further
interaction. This static design often limits performance on complex tasks that
require iterative evidence gathering or high-precision retrieval. Recent work
in both the information retrieval (IR) and NLP communities has begun to close
this gap by introducing adaptive retrieval and ranking methods that incorporate
feedback. In this survey, we present a structured overview of advanced
retrieval and ranking mechanisms that integrate such feedback. We categorize
feedback signals based on their source and role in improving the query,
retrieved context, or document pool. By consolidating these developments, we
aim to bridge IR and NLP perspectives and highlight retrieval as a dynamic,
learnable component of end-to-end RAG systems.
Authors' comments: 18 pages, 1 figure
Eunseong Choi, June Park, Hyeri Lee, Jongwuk Lee
Retrieval-augmented generation (RAG) enhances the capabilities of large
language models (LLMs) by incorporating external knowledge into their input
prompts. However, when the retrieved context contradicts the LLM's parametric
knowledge, it often fails to resolve the conflict between incorrect external
context and correct parametric knowledge, known as context-memory conflict. To
tackle this problem, we introduce Conflict-Aware REtrieval-Augmented Generation
(CARE), consisting of a context assessor and a base LLM. The context assessor
encodes compact memory token embeddings from raw context tokens. Through
grounded/adversarial soft prompting, the context assessor is trained to discern
unreliable context and capture a guidance signal that directs reasoning toward
the more reliable knowledge source. Extensive experiments show that CARE
effectively mitigates context-memory conflicts, leading to an average
performance gain of 5.0\% on QA and fact-checking benchmarks, establishing a
promising direction for trustworthy and adaptive RAG systems.
Authors' comments: Accepted to EMNLP 2025; 14 pages; 5 figures, 11 tables
Shiyi Yang, Xinshu Li, Guanglin Zhou, Chen Wang, Xiwei Xu, Liming Zhu, Lina Yao
Recent studies have shown that recommender systems (RSs) are highly vulnerable to data poisoning attacks, where malicious actors inject fake user profiles, including a group of well-designed fake ratings, to manipulate recommendations. Due to security and privacy constraints in practice, attackers typically possess limited knowledge of the victim system and thus need to craft profiles that have transferability across black-box RSs. To maximize the attack impact, the profiles often remains imperceptible. However, generating such high-quality profiles with the restricted resources is challenging. Some works suggest incorporating fake textual reviews to strengthen the profiles; yet, the poor quality of the reviews largely undermines the attack effectiveness and imperceptibility under the practical setting. To tackle the above challenges, in this paper, we propose to enhance the quality of the review text by harnessing in-context learning (ICL) capabilities of multimodal foundation models. To this end, we introduce a demonstration retrieval algorithm and a text style transfer strategy to augment the navie ICL. Specifically, we propose a novel practical attack framework named RAGAN to generate high-quality fake user profiles, which can gain insights into the robustness of RSs. The profiles are generated by a jailbreaker and collaboratively optimized on an instructional agent and a guardian to improve the attack transferability and imperceptibility. Comprehensive experiments on various real-world datasets demonstrate that RAGAN achieves the state-of-the-art poisoning attack performance.
Omkar Thawakar, Dmitry Demidov, Ritesh Thawkar, Rao Muhammad Anwer, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan
Composed video retrieval is a challenging task that strives to retrieve a
target video based on a query video and a textual description detailing
specific modifications. Standard retrieval frameworks typically struggle to
handle the complexity of fine-grained compositional queries and variations in
temporal understanding limiting their retrieval ability in the fine-grained
setting. To address this issue, we introduce a novel dataset that captures both
fine-grained and composed actions across diverse video segments, enabling more
detailed compositional changes in retrieved video content. The proposed
dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense
modification text that is around seven times more than its existing
counterpart. We further develop a new model that integrates visual and textual
information through Cross-Attention (CA) fusion using grounded text encoder,
enabling precise alignment between dense query modifications and target videos.
The proposed model achieves state-of-the-art results surpassing existing
methods on all metrics. Notably, it achieves 71.3\% Recall@1 in visual+text
setting and outperforms the state-of-the-art by 3.4\%, highlighting its
efficacy in terms of leveraging detailed video descriptions and dense
modification texts. Our proposed dataset, code, and model are available at
:https://github.com/OmkarThawakar/BSE-CoVR
Authors' comments: Accepted to ICCV-2025
Matey Krastev, Miklos Hamar, Danilo Toapanta, Jesse Brouwers, Yibin Lei
This work revisits and extends synthetic query generation pipelines for Neural Information Retrieval (NIR) by leveraging the InPars Toolkit, a reproducible, end-to-end framework for generating training data using large language models (LLMs). We first assess the reproducibility of the original InPars, InPars-V2, and Promptagator pipelines on the SciFact benchmark and validate their effectiveness using open-source reranker and generator models. Building on this foundation, we introduce two key extensions to the pipeline: (1) fine-tuning a query generator LLM via Contrastive Preference Optimization (CPO) to improve the signal quality in generated queries, and (2) replacing static prompt templates with dynamic, Chain-of-Thought (CoT) optimized prompts using the DSPy framework. Our results show that both extensions reduce the need for aggressive filtering while improving retrieval performance. All code, models, and synthetic datasets are publicly released to support further research at: \href{https://github.com/danilotpnta/IR2-project}{this https URL}.
Yihao Lu, Hao Tang
Embodied AI (EAI) agents continuously interact with the physical world, generating vast, heterogeneous multimodal data streams that traditional management systems are ill-equipped to handle. In this survey, we first systematically evaluate five storage architectures (Graph Databases, Multi-Model Databases, Data Lakes, Vector Databases, and Time-Series Databases), focusing on their suitability for addressing EAI's core requirements, including physical grounding, low-latency access, and dynamic scalability. We then analyze five retrieval paradigms (Fusion Strategy-Based Retrieval, Representation Alignment-Based Retrieval, Graph-Structure-Based Retrieval, Generation Model-Based Retrieval, and Efficient Retrieval-Based Optimization), revealing a fundamental tension between achieving long-term semantic coherence and maintaining real-time responsiveness. Based on this comprehensive analysis, we identify key bottlenecks, spanning from the foundational Physical Grounding Gap to systemic challenges in cross-modal integration, dynamic adaptation, and open-world generalization. Finally, we outline a forward-looking research agenda encompassing physics-aware data models, adaptive storage-retrieval co-optimization, and standardized benchmarking, to guide future research toward principled data management solutions for EAI. Our survey is based on a comprehensive review of more than 180 related studies, providing a rigorous roadmap for designing the robust, high-performance data management frameworks essential for the next generation of autonomous embodied systems.