Jiale Liu, Jiahao Zhang, Suhang Wang
Retrieval-Augmented Generation (RAG) is a powerful technique for enhancing Large Language Models (LLMs) with external, up-to-date knowledge. Graph RAG has emerged as an advanced paradigm that leverages graph-based knowledge structures to provide more coherent and contextually rich answers. However, the move from plain document retrieval to structured graph traversal introduces new, under-explored privacy risks. This paper investigates the data extraction vulnerabilities of the Graph RAG systems. We design and execute tailored data extraction attacks to probe their susceptibility to leaking both raw text and structured data, such as entities and their relationships. Our findings reveal a critical trade-off: while Graph RAG systems may reduce raw text leakage, they are significantly more vulnerable to the extraction of structured entity and relationship information. We also explore potential defense mechanisms to mitigate these novel attack surfaces. This work provides a foundational analysis of the unique privacy challenges in Graph RAG and offers insights for building more secure systems.
Ziqiang Cui, Yunpeng Weng, Xing Tang, Peiyang Liu, Shiwei Li, Bowei He, Jiamin Chen, Xiuqiang He et al.
Retrieval-Augmented Generation (RAG) has emerged as a promising approach to enhance the timeliness of knowledge and the factual accuracy of responses in Large Language Models (LLMs). However, the inclusion of excessive retrieved documents substantially increases the input length, leading to higher computational costs. Previous studies have attempted to compress retrieved documents into shorter texts before in-context integration, but such methods often compromise end-task performance. The lack of well-defined compression targets forces many approaches to rely on fixed heuristics, which cannot guarantee that the compressed content will effectively support the end task. To address these limitations, we propose CORE, a novel method designed to achieve lossless context compression for RAG. CORE employs reinforcement learning to optimize the compression process without relying on predefined compression labels. Specifically, it utilizes end-task performance as a reward signal and applies Generalized Reinforcement Learning Policy Optimization (GRPO) to train the compressor. This end-to-end training framework enables the compressor to generate summaries that maximize the accuracy of answers generated by the LLM. Extensive experiments on four datasets demonstrate the superiority of our approach. With a high compression ratio of 3\%, our method not only avoids performance degradation compared to prepending full documents across all datasets but also improves the average Exact Match (EM) score by 3.3 points. The code will be released soon.
Yejin Choi, Jaewoo Park, Janghan Yoon, Saejin Kim, Jaehyun Jeon, Youngjae Yu
Rapid advances in Multimodal Large Language Models (MLLMs) have expanded information retrieval beyond purely textual inputs, enabling retrieval from complex real world documents that combine text and visuals. However, most documents are private either owned by individuals or confined within corporate silos and current retrievers struggle when faced with unseen domains or languages. To address this gap, we introduce PREMIR, a simple yet effective framework that leverages the broad knowledge of an MLLM to generate cross modal pre questions (preQs) before retrieval. Unlike earlier multimodal retrievers that compare embeddings in a single vector space, PREMIR leverages preQs from multiple complementary modalities to expand the scope of matching to the token level. Experiments show that PREMIR achieves state of the art performance on out of distribution benchmarks, including closed domain and multilingual settings, outperforming strong baselines across all retrieval metrics. We confirm the contribution of each component through in depth ablation studies, and qualitative analyses of the generated preQs further highlight the model's robustness in real world settings.
Hongtao Lin, Haoyu Chen, Jaewon Jang, Jiajing Xu
User-to-item retrieval has been an active research area in recommendation system, and two tower models are widely adopted due to model simplicity and serving efficiency. In this work, we focus on a variant called \textit{conditional retrieval}, where we expect retrieved items to be relevant to a condition (e.g. topic). We propose a method that uses the same training data as standard two tower models but incorporates item-side information as conditions in query. This allows us to bootstrap new conditional retrieval use cases and encourages feature interactions between user and condition. Experiments show that our method can retrieve highly relevant items and outperforms standard two tower models with filters on engagement metrics. The proposed model is deployed to power a topic-based notification feed at Pinterest and led to +0.26\% weekly active users.
Jiawen Lyu, Manu Ramesh, Madison Simonds, Jacquelyn P. Boerman, Amy R. Reibman
Few automated video systems are described in the open literature that enable
hands-free cataloging and identification (ID) of cows in a dairy herd. In this
work, we describe our system, composed of an AutoCattloger, which builds a
Cattlog of dairy cows in a herd with a single input video clip per cow, an
eidetic cow recognizer which uses no deep learning to ID cows, and a CowFinder,
which IDs cows in a continuous stream of video. We demonstrate its value in
finding individuals in unlabeled, unsegmented videos of cows walking
unconstrained through the holding area of a milking parlor.
Authors' comments: Extended abstract. Presented at the 3rd US Conference on Precision
Livestock Farming (USPLF), 2025, Lincoln NE
Ren Qin, Chai Zheng, Xiao Xijun, Zheng Yuchao, Wu Di
Precisely modeling user ultra-long sequences is critical for industrial recommender systems. Current approaches predominantly focus on leveraging ultra-long sequences in the ranking stage, whereas research for the candidate retrieval stage remains under-explored. This paper presents LongRetriever, a practical framework for incorporating ultra-long sequences into the retrieval stage of recommenders. Specifically, we propose in-context training and multi-context retrieval, which enable candidate-specific interaction between user sequence and candidate item, and ensure training-serving consistency under the search-based paradigm. Extensive online A/B testing conducted on a large-scale e-commerce platform demonstrates statistically significant improvements, confirming the framework's effectiveness. Currently, LongRetriever has been fully deployed in the platform, impacting billions of users.
Mandeep Rathee, Venktesh V, Sean MacAvaney, Avishek Anand
Retrieval-Augmented Generation (RAG) has emerged as a standard framework for
knowledge-intensive NLP tasks, combining large language models (LLMs) with
document retrieval from external corpora. Despite its widespread use, most RAG
pipelines continue to treat retrieval and reasoning as isolated components,
retrieving documents once and then generating answers without further
interaction. This static design often limits performance on complex tasks that
require iterative evidence gathering or high-precision retrieval. Recent work
in both the information retrieval (IR) and NLP communities has begun to close
this gap by introducing adaptive retrieval and ranking methods that incorporate
feedback. In this survey, we present a structured overview of advanced
retrieval and ranking mechanisms that integrate such feedback. We categorize
feedback signals based on their source and role in improving the query,
retrieved context, or document pool. By consolidating these developments, we
aim to bridge IR and NLP perspectives and highlight retrieval as a dynamic,
learnable component of end-to-end RAG systems.
Authors' comments: 18 pages, 1 figure
Eunseong Choi, June Park, Hyeri Lee, Jongwuk Lee
Retrieval-augmented generation (RAG) enhances the capabilities of large
language models (LLMs) by incorporating external knowledge into their input
prompts. However, when the retrieved context contradicts the LLM's parametric
knowledge, it often fails to resolve the conflict between incorrect external
context and correct parametric knowledge, known as context-memory conflict. To
tackle this problem, we introduce Conflict-Aware REtrieval-Augmented Generation
(CARE), consisting of a context assessor and a base LLM. The context assessor
encodes compact memory token embeddings from raw context tokens. Through
grounded/adversarial soft prompting, the context assessor is trained to discern
unreliable context and capture a guidance signal that directs reasoning toward
the more reliable knowledge source. Extensive experiments show that CARE
effectively mitigates context-memory conflicts, leading to an average
performance gain of 5.0\% on QA and fact-checking benchmarks, establishing a
promising direction for trustworthy and adaptive RAG systems.
Authors' comments: Accepted to EMNLP 2025; 14 pages; 5 figures, 11 tables
Shiyi Yang, Xinshu Li, Guanglin Zhou, Chen Wang, Xiwei Xu, Liming Zhu, Lina Yao
Recent studies have shown that recommender systems (RSs) are highly vulnerable to data poisoning attacks, where malicious actors inject fake user profiles, including a group of well-designed fake ratings, to manipulate recommendations. Due to security and privacy constraints in practice, attackers typically possess limited knowledge of the victim system and thus need to craft profiles that have transferability across black-box RSs. To maximize the attack impact, the profiles often remains imperceptible. However, generating such high-quality profiles with the restricted resources is challenging. Some works suggest incorporating fake textual reviews to strengthen the profiles; yet, the poor quality of the reviews largely undermines the attack effectiveness and imperceptibility under the practical setting. To tackle the above challenges, in this paper, we propose to enhance the quality of the review text by harnessing in-context learning (ICL) capabilities of multimodal foundation models. To this end, we introduce a demonstration retrieval algorithm and a text style transfer strategy to augment the navie ICL. Specifically, we propose a novel practical attack framework named RAGAN to generate high-quality fake user profiles, which can gain insights into the robustness of RSs. The profiles are generated by a jailbreaker and collaboratively optimized on an instructional agent and a guardian to improve the attack transferability and imperceptibility. Comprehensive experiments on various real-world datasets demonstrate that RAGAN achieves the state-of-the-art poisoning attack performance.
Omkar Thawakar, Dmitry Demidov, Ritesh Thawkar, Rao Muhammad Anwer, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan
Composed video retrieval is a challenging task that strives to retrieve a
target video based on a query video and a textual description detailing
specific modifications. Standard retrieval frameworks typically struggle to
handle the complexity of fine-grained compositional queries and variations in
temporal understanding limiting their retrieval ability in the fine-grained
setting. To address this issue, we introduce a novel dataset that captures both
fine-grained and composed actions across diverse video segments, enabling more
detailed compositional changes in retrieved video content. The proposed
dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense
modification text that is around seven times more than its existing
counterpart. We further develop a new model that integrates visual and textual
information through Cross-Attention (CA) fusion using grounded text encoder,
enabling precise alignment between dense query modifications and target videos.
The proposed model achieves state-of-the-art results surpassing existing
methods on all metrics. Notably, it achieves 71.3\% Recall@1 in visual+text
setting and outperforms the state-of-the-art by 3.4\%, highlighting its
efficacy in terms of leveraging detailed video descriptions and dense
modification texts. Our proposed dataset, code, and model are available at
:https://github.com/OmkarThawakar/BSE-CoVR
Authors' comments: Accepted to ICCV-2025
Matey Krastev, Miklos Hamar, Danilo Toapanta, Jesse Brouwers, Yibin Lei
This work revisits and extends synthetic query generation pipelines for Neural Information Retrieval (NIR) by leveraging the InPars Toolkit, a reproducible, end-to-end framework for generating training data using large language models (LLMs). We first assess the reproducibility of the original InPars, InPars-V2, and Promptagator pipelines on the SciFact benchmark and validate their effectiveness using open-source reranker and generator models. Building on this foundation, we introduce two key extensions to the pipeline: (1) fine-tuning a query generator LLM via Contrastive Preference Optimization (CPO) to improve the signal quality in generated queries, and (2) replacing static prompt templates with dynamic, Chain-of-Thought (CoT) optimized prompts using the DSPy framework. Our results show that both extensions reduce the need for aggressive filtering while improving retrieval performance. All code, models, and synthetic datasets are publicly released to support further research at: \href{https://github.com/danilotpnta/IR2-project}{this https URL}.
Yihao Lu, Hao Tang
Embodied AI (EAI) agents continuously interact with the physical world, generating vast, heterogeneous multimodal data streams that traditional management systems are ill-equipped to handle. In this survey, we first systematically evaluate five storage architectures (Graph Databases, Multi-Model Databases, Data Lakes, Vector Databases, and Time-Series Databases), focusing on their suitability for addressing EAI's core requirements, including physical grounding, low-latency access, and dynamic scalability. We then analyze five retrieval paradigms (Fusion Strategy-Based Retrieval, Representation Alignment-Based Retrieval, Graph-Structure-Based Retrieval, Generation Model-Based Retrieval, and Efficient Retrieval-Based Optimization), revealing a fundamental tension between achieving long-term semantic coherence and maintaining real-time responsiveness. Based on this comprehensive analysis, we identify key bottlenecks, spanning from the foundational Physical Grounding Gap to systemic challenges in cross-modal integration, dynamic adaptation, and open-world generalization. Finally, we outline a forward-looking research agenda encompassing physics-aware data models, adaptive storage-retrieval co-optimization, and standardized benchmarking, to guide future research toward principled data management solutions for EAI. Our survey is based on a comprehensive review of more than 180 related studies, providing a rigorous roadmap for designing the robust, high-performance data management frameworks essential for the next generation of autonomous embodied systems.
Ali Abdari, Alex Falcon, Giuseppe Serra
Every day, a large amount of educational content is uploaded online across
different areas, including agriculture and gardening. When these videos or
materials are grouped meaningfully, they can make learning easier and more
effective. One promising way to organize and enrich such content is through the
Metaverse, which allows users to explore educational experiences in an
interactive and immersive environment. However, searching for relevant
Metaverse scenarios and finding those matching users' interests remains a
challenging task. A first step in this direction has been done recently, but
existing datasets are small and not sufficient for training advanced models. In
this work, we make two main contributions: first, we introduce a new dataset
containing 457 agricultural-themed virtual museums (AgriMuseums), each enriched
with textual descriptions; and second, we propose a hierarchical
vision-language model to represent and retrieve relevant AgriMuseums using
natural language queries. In our experimental setting, the proposed method
achieves up to about 62\% R@1 and 78\% MRR, confirming its effectiveness, and
it also leads to improvements on existing benchmarks by up to 6\% R@1 and 11\%
MRR. Moreover, an extensive evaluation validates our design choices. Code and
dataset are available at
https://github.com/aliabdari/Agricultural_Metaverse_Retrieval .
Authors' comments: Accepted for publication at the 23rd International Conference on
Image Analysis and Processing (ICIAP 2025)
Lam Thanh Do, Linh Van Nguyen, David Fu, Kevin Chen-Chuan Chang
The exponential growth of scientific literature has made it increasingly
difficult for researchers to keep up with the literature. In an attempt to
alleviate this problem, we propose CASPER, a sparse retrieval model for
scientific search that utilizes tokens and keyphrases as representation units
(i.e. dimensions in the sparse embedding space), enabling it to represent
queries and documents with research concepts and match them at both granular
and conceptual levels. To overcome the lack of suitable training data, we
propose mining training data by leveraging scholarly references (i.e. signals
that capture how research concepts of papers are expressed in different
settings), including titles, citation contexts, author-assigned keyphrases, and
co-citations. CASPER outperforms strong dense and sparse retrieval baselines on
eight scientific retrieval benchmarks. Moreover, we demonstrate that through
simple post-processing, CASPER can be effectively used for the keyphrase
generation tasks, achieving competitive performance with the established
CopyRNN while producing more diverse keyphrases and being nearly four times
faster.
Authors' comments: 11 Pages. Code: https://github.com/louisdo/CASPER
Ziyang Chen, Erxue Min, Xiang Zhao, Yunxin Li, Xin Jia, Jinzhi Liao, Jichao Li, Shuaiqiang Wang et al.
We introduce ChronoQA, a large-scale benchmark dataset for Chinese question
answering, specifically designed to evaluate temporal reasoning in
Retrieval-Augmented Generation (RAG) systems. ChronoQA is constructed from over
300,000 news articles published between 2019 and 2024, and contains 5,176
high-quality questions covering absolute, aggregate, and relative temporal
types with both explicit and implicit time expressions. The dataset supports
both single- and multi-document scenarios, reflecting the real-world
requirements for temporal alignment and logical consistency. ChronoQA features
comprehensive structural annotations and has undergone multi-stage validation,
including rule-based, LLM-based, and human evaluation, to ensure data quality.
By providing a dynamic, reliable, and scalable resource, ChronoQA enables
structured evaluation across a wide range of temporal tasks, and serves as a
robust benchmark for advancing time-sensitive retrieval-augmented question
answering systems.
Authors' comments: 10 pages, 5 figures
Zida Liang, Changfa Wu, Dunxian Huang, Weiqiang Sun, Ziyang Wang, Yuliang Yan, Jian Wu, Yuning Jiang et al.
Recommendation systems are essential tools in modern e-commerce, facilitating
personalized user experiences by suggesting relevant products. Recent
advancements in generative models have demonstrated potential in enhancing
recommendation systems; however, these models often exhibit limitations in
optimizing retrieval tasks, primarily due to their reliance on autoregressive
generation mechanisms. Conventional approaches introduce sequential
dependencies that impede efficient retrieval, as they are inherently unsuitable
for generating multiple items without positional constraints within a single
request session. To address these limitations, we propose TBGRecall, a
framework integrating Next Session Prediction (NSP), designed to enhance
generative retrieval models for e-commerce applications. Our framework
reformulation involves partitioning input samples into multi-session sequences,
where each sequence comprises a session token followed by a set of item tokens,
and then further incorporate multiple optimizations tailored to the generative
task in retrieval scenarios. In terms of training methodology, our pipeline
integrates limited historical data pre-training with stochastic partial
incremental training, significantly improving training efficiency and
emphasizing the superiority of data recency over sheer data volume. Our
extensive experiments, conducted on public benchmarks alongside a large-scale
industrial dataset from TaoBao, show TBGRecall outperforms the state-of-the-art
recommendation methods, and exhibits a clear scaling law trend. Ultimately, NSP
represents a significant advancement in the effectiveness of generative
recommendation systems for e-commerce applications.
Authors' comments: Both authors contributed equally to this research. Work done during
internship at Alibaba. Corresponding author: Dunxian Huang
(dunxian.hdx@alibaba-inc.com). Affiliations: (1) Shanghai Jiaotong
University, Shanghai, China; (2) Alibaba Inc
Jun Li, Kai Li, Shaoguo Liu, Tingting Gao
Composed Image Retrieval (CIR) presents a significant challenge as it requires jointly understanding a reference image and a modified textual instruction to find relevant target images. Some existing methods attempt to use a two-stage approach to further refine retrieval results. However, this often requires additional training of a ranking model. Despite the success of Chain-of-Thought (CoT) techniques in reducing training costs for language models, their application in CIR tasks remains limited -- compressing visual information into text or relying on elaborate prompt designs. Besides, existing works only utilize it for zero-shot CIR, as it is challenging to achieve satisfactory results in supervised CIR with a well-trained model. In this work, we proposed a framework that includes the Pyramid Matching Model with Training-Free Refinement (PMTFR) to address these challenges. Through a simple but effective module called Pyramid Patcher, we enhanced the Pyramid Matching Model's understanding of visual information at different granularities. Inspired by representation engineering, we extracted representations from COT data and injected them into the LVLMs. This approach allowed us to obtain refined retrieval scores in the Training-Free Refinement paradigm without relying on explicit textual reasoning, further enhancing performance. Extensive experiments on CIR benchmarks demonstrate that PMTFR surpasses state-of-the-art methods in supervised CIR tasks. The code will be made public.
Wonjune Kang, Deb Roy
We introduce the task of expressive speech retrieval, where the goal is to
retrieve speech utterances spoken in a given style based on a natural language
description of that style. While prior work has primarily focused on performing
speech retrieval based on what was said in an utterance, we aim to do so based
on how something was said. We train speech and text encoders to embed speech
and text descriptions of speaking styles into a joint latent space, which
enables using free-form text prompts describing emotions or styles as queries
to retrieve matching expressive speech segments. We perform detailed analyses
of various aspects of our proposed framework, including encoder architectures,
training criteria for effective cross-modal alignment, and prompt augmentation
for improved generalization to arbitrary text queries. Experiments on multiple
datasets encompassing 22 speaking styles demonstrate that our approach achieves
strong retrieval performance as measured by Recall@k.
Authors' comments: Accepted to ASRU 2025
Weilin Ruan, Xilin Dang, Ziyu Zhou, Sisuo Lyu, Yuxuan Liang
Traffic prediction is a cornerstone of modern intelligent transportation systems and a critical task in spatio-temporal forecasting. Although advanced Spatio-temporal Graph Neural Networks (STGNNs) and pre-trained models have achieved significant progress in traffic prediction, two key challenges remain: (i) limited contextual capacity when modeling complex spatio-temporal dependencies, and (ii) low predictability at fine-grained spatio-temporal points due to heterogeneous patterns. Inspired by Retrieval-Augmented Generation (RAG), we propose RAST, a universal framework that integrates retrieval-augmented mechanisms with spatio-temporal modeling to address these challenges. Our framework consists of three key designs: 1) Decoupled Encoder and Query Generator to capture decoupled spatial and temporal features and construct a fusion query via residual fusion; 2) Spatio-temporal Retrieval Store and Retrievers to maintain and retrieve vectorized fine-grained patterns; and 3) Universal Backbone Predictor that flexibly accommodates pre-trained STGNNs or simple MLP predictors. Extensive experiments on six real-world traffic networks, including large-scale datasets, demonstrate that RAST achieves superior performance while maintaining computational efficiency.
Shan Zhong, A. J. Sudler, D. Blume, Alberto M. Marino
Highly-efficient quantum memories are essential for advancing quantum information processing technologies, including scalable quantum computing and quantum networks. We experimentally demonstrate a light storage and retrieval protocol in a tripod system using an ensemble of laser-cooled $^{87}$Rb atoms. The tripod system, which consists of three ground states and an excited state, offers rich dynamics: its use to coherently store and retrieve a weak probe pulse in the $^{87}$Rb $F=1$ ground state manifold leads to the interference of two spin-wave excitations during storage time that translate to an interference in the peak intensity of the retrieved probe pulse. Our work shows that these interferences, which manifest when varying the pulse sequence or energy level structure, can be controlled experimentally by varying the storage time, optical phase, and magnetic field strength. Theoretical simulations exhibit excellent agreement with the experimental results. This work demonstrates the rich dynamics and versatile capabilities of atomic tripod systems for light storage and retrieval, with key advantages over conventional $\Lambda$-systems, highlighting the potential of atomic tripod systems for applications in quantum information processing, quantum synchronization, and atomic memory protocols.