Ganlin Xu, Zhoujia Zhang, Wangyi Mei, Jiaqing Liang, Weijia Lu, Xiaodong Zhang, Zhifei Yang, Xiaofeng Ma et al.
Information retrieval plays a crucial role in resource localization. Current dense retrievers retrieve the relevant documents within a corpus via embedding similarities, which compute similarities between dense vectors mainly depending on word co-occurrence between queries and documents, but overlook the real query intents. Thus, they often retrieve numerous irrelevant documents. Particularly in the scenarios of complex queries such as \emph{negative-constraint queries}, their retrieval performance could be catastrophic. To address the issue, we propose a neuro-symbolic information retrieval method, namely \textbf{NS-IR}, that leverages first-order logic (FOL) to optimize the embeddings of naive natural language by considering the \emph{logical consistency} between queries and documents. Specifically, we introduce two novel techniques, \emph{logic alignment} and \emph{connective constraint}, to rerank candidate documents, thereby enhancing retrieval relevance. Furthermore, we construct a new dataset \textbf{NegConstraint} including negative-constraint queries to evaluate our NS-IR's performance on such complex IR scenarios. Our extensive experiments demonstrate that NS-IR not only achieves superior zero-shot retrieval performance on web search and low-resource retrieval tasks, but also performs better on negative-constraint queries. Our scource code and dataset are available at https://github.com/xgl-git/NS-IR-main.
A. Ploshkin, V. Tytskiy, A. Pismenny, V. Baikalov, E. Taychinov, A. Permiakov, D. Burlakov, E. Krofto et al.
We present Yambda-5B, a large-scale open dataset sourced from the Yandex.Music streaming platform. Yambda-5B contains 4.79 billion user-item interactions from 1 million users across 9.39 million tracks. The dataset includes two primary types of interactions: implicit feedback (listening events) and explicit feedback (likes, dislikes, unlikes and undislikes). In addition, we provide audio embeddings for most tracks, generated by a convolutional neural network trained on audio spectrograms. A key distinguishing feature of Yambda-5B is the inclusion of the is_organic flag, which separates organic user actions from recommendation-driven events. This distinction is critical for developing and evaluating machine learning algorithms, as Yandex.Music relies on recommender systems to personalize track selection for users. To support rigorous benchmarking, we introduce an evaluation protocol based on a Global Temporal Split, allowing recommendation algorithms to be assessed in conditions that closely mirror real-world use. We report benchmark results for standard baselines (ItemKNN, iALS) and advanced models (SANSA, SASRec) using a variety of evaluation metrics. By releasing Yambda-5B to the community, we aim to provide a readily accessible, industrial-scale resource to advance research, foster innovation, and promote reproducible results in recommender systems.
Alan Ramponi, Marco Rovera, Robert Moro, Sara Tonelli
Retrieval of previously fact-checked claims is a well-established task, whose automation can assist professional fact-checkers in the initial steps of information verification. Previous works have mostly tackled the task monolingually, i.e., having both the input and the retrieved claims in the same language. However, especially for languages with a limited availability of fact-checks and in case of global narratives, such as pandemics, wars, or international politics, it is crucial to be able to retrieve claims across languages. In this work, we examine strategies to improve the multilingual and crosslingual performance, namely selection of negative examples (in the supervised) and re-ranking (in the unsupervised setting). We evaluate all approaches on a dataset containing posts and claims in 47 languages (283 language combinations). We observe that the best results are obtained by using LLM-based re-ranking, followed by fine-tuning with negative examples sampled using a sentence similarity-based strategy. Most importantly, we show that crosslinguality is a setup with its own unique characteristics compared to the multilingual setup.
Junhuan Liu, San Jiang, Wei Ge, Wei Huang, Bingxuan Guo, Qingquan Li
The primary contribution of this paper is a challenging benchmark dataset, UAVPairs, and a training pipeline designed for match pair retrieval of large-scale UAV images. First, the UAVPairs dataset, comprising 21,622 high-resolution images across 30 diverse scenes, is constructed; the 3D points and tracks generated by SfM-based 3D reconstruction are employed to define the geometric similarity of image pairs, ensuring genuinely matchable image pairs are used for training. Second, to solve the problem of expensive mining cost for global hard negative mining, a batched nontrivial sample mining strategy is proposed, leveraging the geometric similarity and multi-scene structure of the UAVPairs to generate training samples as to accelerate training. Third, recognizing the limitation of pair-based losses, the ranked list loss is designed to improve the discrimination of image retrieval models, which optimizes the global similarity structure constructed from the positive set and negative set. Finally, the effectiveness of the UAVPairs dataset and training pipeline is validated through comprehensive experiments on three distinct large-scale UAV datasets. The experiment results demonstrate that models trained with the UAVPairs dataset and the ranked list loss achieve significantly improved retrieval accuracy compared to models trained on existing datasets or with conventional losses. Furthermore, these improvements translate to enhanced view graph connectivity and higher quality of reconstructed 3D models. The models trained by the proposed approach perform more robustly compared with hand-crafted global features, particularly in challenging repetitively textured scenes and weakly textured scenes. For match pair retrieval of large-scale UAV images, the trained image retrieval models offer an effective solution. The dataset would be made publicly available at https://github.com/json87/UAVPairs.
Yujin Choi, Youngjoo Park, Junyoung Byun, Jaewook Lee, Jinseong Park
Retrieval-augmented generation (RAG) mitigates the hallucination problem in large language models (LLMs) and has proven effective for specific, personalized applications. However, passing private retrieved documents directly to LLMs introduces vulnerability to membership inference attacks (MIAs), which try to determine whether the target datum exists in the private external database or not. Based on the insight that MIA queries typically exhibit high similarity to only one target document, we introduce Mirabel, a similarity-based MIA detection framework designed for the RAG system. With the proposed Mirabel, we show that simple detect-and-hide strategies can successfully obfuscate attackers, maintain data utility, and remain system-agnostic. We experimentally prove its detection and defense against various state-of-the-art MIA methods and its adaptability to existing private RAG systems.
Seongwan Park, Taeklim Kim, Youngjoong Ko
Despite their strong performance, Dense Passage Retrieval (DPR) models suffer from a lack of interpretability. In this work, we propose a novel interpretability framework that leverages Sparse Autoencoders (SAEs) to decompose previously uninterpretable dense embeddings from DPR models into distinct, interpretable latent concepts. We generate natural language descriptions for each latent concept, enabling human interpretations of both the dense embeddings and the query-document similarity scores of DPR models. We further introduce Concept-Level Sparse Retrieval (CL-SR), a retrieval framework that directly utilizes the extracted latent concepts as indexing units. CL-SR effectively combines the semantic expressiveness of dense embeddings with the transparency and efficiency of sparse representations. We show that CL-SR achieves high index-space and computational efficiency while maintaining robust performance across vocabulary and semantic mismatches.
Hoang Pham, Thuy-Duong Nguyen, Khac-Hoai Nam Bui
This paper presents a novel approach for unified retrieval-augmented generation (RAG) systems using the recent emerging large language model (LLM) agent concept. Specifically, Agent LLM, which utilizes LLM as fundamental controllers, has become a promising approach to enable the interpretability of RAG tasks, especially for complex reasoning question-answering systems (e.g., multi-hop queries). Nonetheless, previous works mainly focus on solving RAG systems with either single-hop or multi-hop approaches separately, which limits the application of those approaches to real-world applications. In this study, we propose a trainable agent framework called Agent-UniRAG for unified retrieval-augmented LLM systems, which enhances the effectiveness and interpretability of RAG systems. The main idea is to design an LLM agent framework to solve RAG tasks step-by-step based on the complexity of the inputs, simultaneously including single-hop and multi-hop queries in an end-to-end manner. Furthermore, we introduce SynAgent-RAG, a synthetic dataset to enable the proposed agent framework for small open-source LLMs (e.g., Llama-3-8B). The results show comparable performances with closed-source and larger open-source LLMs across various RAG benchmarks. Our source code and dataset are publicly available for further exploitation.
Sinchana Ramakanth Bhat, Max Rudat, Jannis Spiekermann, Nicolas Flores-Herr
Chunking is a crucial preprocessing step in retrieval-augmented generation (RAG) systems, significantly impacting retrieval effectiveness across diverse datasets. In this study, we systematically evaluate fixed-size chunking strategies and their influence on retrieval performance using multiple embedding models. Our experiments, conducted on both short-form and long-form datasets, reveal that chunk size plays a critical role in retrieval effectiveness -- smaller chunks (64-128 tokens) are optimal for datasets with concise, fact-based answers, whereas larger chunks (512-1024 tokens) improve retrieval in datasets requiring broader contextual understanding. We also analyze the impact of chunking on different embedding models, finding that they exhibit distinct chunking sensitivities. While models like Stella benefit from larger chunks, leveraging global context for long-range retrieval, Snowflake performs better with smaller chunks, excelling at fine-grained, entity-based matching. Our results underscore the trade-offs between chunk size, embedding models, and dataset characteristics, emphasizing the need for improved chunk quality measures, and more comprehensive datasets to advance chunk-based retrieval in long-document Information Retrieval (IR).
Dipam Goswami, Liying Wang, Bartłomiej Twardowski, Joost van de Weijer
Text embedding models enable semantic search, powering several NLP
applications like Retrieval Augmented Generation by efficient information
retrieval (IR). However, text embedding models are commonly studied in
scenarios where the training data is static, thus limiting its applications to
dynamic scenarios where new training data emerges over time. IR methods
generally encode a huge corpus of documents to low-dimensional embeddings and
store them in a database index. During retrieval, a semantic search over the
corpus is performed and the document whose embedding is most similar to the
query embedding is returned. When updating an embedding model with new training
data, using the already indexed corpus is suboptimal due to the
non-compatibility issue, since the model which was used to obtain the
embeddings of the corpus has changed. While re-indexing of old corpus documents
using the updated model enables compatibility, it requires much higher
computation and time. Thus, it is critical to study how the already indexed
corpus can still be effectively used without the need of re-indexing. In this
work, we establish a continual learning benchmark with large-scale datasets and
continually train dense retrieval embedding models on query-document pairs from
new datasets in each task and observe forgetting on old tasks due to
significant drift of embeddings. We employ embedding distillation on both query
and document embeddings to maintain stability and propose a novel query drift
compensation method during retrieval to project new model query embeddings to
the old embedding space. This enables compatibility with previously indexed
corpus embeddings extracted using the old model and thus reduces the
forgetting. We show that the proposed method significantly improves performance
without any re-indexing. Code is available at
https://github.com/dipamgoswami/QDC.
Authors' comments: Accepted at CoLLAs 2025
Ekaterina Fadeeva, Aleksandr Rubashevskii, Roman Vashurin, Shehzaad Dhuliawala, Artem Shelmanov, Timothy Baldwin, Preslav Nakov, Mrinmaya Sachan et al.
Large Language Models (LLMs) enhanced with external knowledge retrieval, an approach known as Retrieval-Augmented Generation (RAG), have shown strong performance in open-domain question answering. However, RAG systems remain susceptible to hallucinations: factually incorrect outputs that may arise either from inconsistencies in the model's internal knowledge or incorrect use of the retrieved context. Existing approaches often conflate factuality with faithfulness to the retrieved context, misclassifying factually correct statements as hallucinations if they are not directly supported by the retrieval. In this paper, we introduce FRANQ (Faithfulness-based Retrieval Augmented UNcertainty Quantification), a novel method for hallucination detection in RAG outputs. FRANQ applies different Uncertainty Quantification (UQ) techniques to estimate factuality based on whether a statement is faithful to the retrieved context or not. To evaluate FRANQ and other UQ techniques for RAG, we present a new long-form Question Answering (QA) dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging examples. Extensive experiments on long- and short-form QA across multiple datasets and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing methods.
Eric Xing, Pranavi Kolouju, Robert Pless, Abby Stylianou, Nathan Jacobs
Composed image retrieval (CIR) is the task of retrieving a target image
specified by a query image and a relative text that describes a semantic
modification to the query image. Existing methods in CIR struggle to accurately
represent the image and the text modification, resulting in subpar performance.
To address this limitation, we introduce a CIR framework, ConText-CIR, trained
with a Text Concept-Consistency loss that encourages the representations of
noun phrases in the text modification to better attend to the relevant parts of
the query image. To support training with this loss function, we also propose a
synthetic data generation pipeline that creates training data from existing CIR
datasets or unlabeled images. We show that these components together enable
stronger performance on CIR tasks, setting a new state-of-the-art in composed
image retrieval in both the supervised and zero-shot settings on multiple
benchmark datasets, including CIRR and CIRCO. Source code, model checkpoints,
and our new datasets are available at https://github.com/mvrl/ConText-CIR.
Authors' comments: 15 pages, 8 figures, 6 tables. CVPR 2025
Yanran Tang, Ruihong Qiu, Zi Huang
Legal case retrieval plays a pivotal role in the legal domain by facilitating the efficient identification of relevant cases, supporting legal professionals and researchers to propose legal arguments and make informed decision-making. To improve retrieval accuracy, the Competition on Legal Information Extraction and Entailment (COLIEE) is held annually, offering updated benchmark datasets for evaluation. This paper presents a detailed description of CaseLink, the method employed by UQLegalAI, the second highest team in Task 1 of COLIEE 2025. The CaseLink model utilises inductive graph learning and Global Case Graphs to capture the intrinsic case connectivity to improve the accuracy of legal case retrieval. Specifically, a large language model specialized in text embedding is employed to transform legal texts into embeddings, which serve as the feature representations of the nodes in the constructed case graph. A new contrastive objective, incorporating a regularization on the degree of case nodes, is proposed to leverage the information within the case reference relationship for model optimization. The main codebase used in our method is based on an open-sourced repo of CaseLink: https://github.com/yanran-tang/CaseLink.
Shahrooz Pouryousef
User-item interactions contain rich collaborative signals that form the backbone of many successful recommender systems. While recent work has explored the use of large language models (LLMs) for recommendation, it remains unclear whether LLMs can effectively reason over this type of collaborative information. In this paper, we conduct a systematic comparison between LLMs and classical matrix factorization (MF) models to assess LLMs' ability to leverage user-item interaction data. We further introduce a simple retrieval-augmented generation (RAG) method that enhances LLMs by grounding their predictions in structured interaction data. Our experiments reveal that current LLMs often fall short in capturing collaborative patterns inherent to MF models, but that our RAG-based approach substantially improves recommendation quality-highlighting a promising direction for future LLM-based recommenders.
Xu Kang, Siqi Jiang, Kangwei Xu, Jiahao Li, Ruibo Wu
Terpenoids are a crucial class of natural products that have been studied for
over 150 years, but their interdisciplinary nature (spanning chemistry,
pharmacology, and biology) complicates knowledge integration. To address this,
the authors developed TeroSeek, a curated knowledge base (KB) built from two
decades of terpenoid literature, coupled with an AI-powered question-answering
chatbot and web service. Leveraging a retrieval-augmented generation (RAG)
framework, TeroSeek provides structured, high-quality information and
outperforms general-purpose large language models (LLMs) in terpenoid-related
queries. It serves as a domain-specific expert tool for multidisciplinary
research and is publicly available at http://teroseek.qmclab.com.
Authors' comments: 18 pages, 4 figures
Funing Li, Yuan Tian, Ruben Noortwyck, Jifeng Zhou, Liming Kuang, Robert Schulz
In modern industrial and logistics environments, the rapid expansion of fast delivery services has heightened the demand for storage systems that combine high efficiency with increased density. Multi-deep autonomous vehicle storage and retrieval systems (AVS/RS) present a viable solution for achieving greater storage density. However, these systems encounter significant challenges during retrieval operations due to lane blockages. A conventional approach to mitigate this issue involves storing items with homogeneous characteristics in a single lane, but this strategy restricts the flexibility and adaptability of multi-deep storage systems. In this study, we propose a deep reinforcement learning-based framework to address the retrieval problem in multi-deep storage systems with heterogeneous item configurations. Each item is associated with a specific due date, and the objective is to minimize total tardiness. To effectively capture the system's topology, we introduce a graph-based state representation that integrates both item attributes and the local topological structure of the multi-deep warehouse. To process this representation, we design a novel neural network architecture that combines a Graph Neural Network (GNN) with a Transformer model. The GNN encodes topological and item-specific information into embeddings for all directly accessible items, while the Transformer maps these embeddings into global priority assignments. The Transformer's strong generalization capability further allows our approach to be applied to storage systems with diverse layouts. Extensive numerical experiments, including comparisons with heuristic methods, demonstrate the superiority of the proposed neural network architecture and the effectiveness of the trained agent in optimizing retrieval tardiness.
Jihoon Lee, Min Song
Despite significant advancements in Large Vision-Language Models, Object
Hallucination (OH) remains a persistent challenge. Building upon prior studies
on contrastive decoding that address this issue without requiring additional
model training, we introduce RVCD (Retrieval Visual Contrastive Decoding), an
advanced method to suppress OH. RVCD leverages both negative and positive
images at the logit level, explicitly referencing AI-generated images designed
to represent a single concept. Our approach demonstrates substantial
improvements over existing decoding-based methods.
Authors' comments: ACL Findings camera-ready version. Code is released at
https://github.com/JiHoonLee9898/RVCD
Matthew Hong, Anthony Liang, Kevin Kim, Harshitha Rajaprakash, Jesse Thomason, Erdem Bıyık, Jesse Zhang
We hand the community HAND, a simple and time-efficient method for teaching robots new manipulation tasks through human hand demonstrations. Instead of relying on task-specific robot demonstrations collected via teleoperation, HAND uses easy-to-provide hand demonstrations to retrieve relevant behaviors from task-agnostic robot play data. Using a visual tracking pipeline, HAND extracts the motion of the human hand from the hand demonstration and retrieves robot sub-trajectories in two stages: first filtering by visual similarity, then retrieving trajectories with similar behaviors to the hand. Fine-tuning a policy on the retrieved data enables real-time learning of tasks in under four minutes, without requiring calibrated cameras or detailed hand pose estimation. Experiments also show that HAND outperforms retrieval baselines by over 2x in average task success rates on real robots. Videos can be found at our project website: https://liralab.usc.edu/handretrieval/.
Qiming Liu, Yilin Chen, Peng Xu, Huizhong Shen, Zelin Mai, Ruixin Zhang, Peng Guo, Zhiyu Zheng et al.
Ammonia (NH3) emissions significantly contribute to atmospheric pollution,
yet discrepancies exist between bottom-up inventories and satellite-constrained
top-down estimates, with the latter typically one-third higher. This study
quantifies how assumptions about NH3 vertical distribution in satellite
retrievals contribute to this gap. By implementing spatially and temporally
resolved vertical profiles from the Community Multiscale Air Quality model to
replace steep gradients in Infrared Atmospheric Sounding Interferometer (IASI)
retrievals, we reduced satellite-model column discrepancies from 71% to 18%. We
subsequently constrained NH3 emissions across China using a hybrid inversion
framework combining iterative mass balance and four-dimensional variational
methods. Our posterior emissions showed agreement with the a priori inventory
(7.9% lower), suggesting that discrepancies between inventory approaches were
amplified by overestimation of near-surface NH3 in baseline satellite
retrievals, potentially causing a 43% overestimation of growing season
emissions. Evaluation against ground-based measurements confirmed improved
model performance, with normalized root-mean-square error reductions of 1-27%
across six months. These findings demonstrate that accurate representation of
vertical profiles in satellite retrievals is critical for robust NH3 emission
estimates and can reconcile the long-standing discrepancy between bottom-up and
top-down approaches. Our hybrid inversion methodology, leveraging
profile-corrected satellite data, reveals that China's NH3 emissions exhibit
greater spatial concentration than previously recognized, reflecting
agricultural intensification. This advancement enables timely and accurate
characterization of rapidly changing agricultural emission patterns, critical
for implementing effective nitrogen pollution control measures.
Authors' comments: 44 pages, 14 figures, 1 table. The main text spans pages 1-28 and
includes 5 figures. The supplementary information (SI) spans pages 29-44,
containing 9 figures and 1 table
Chunyang Li, Junwei Zhang, Anda Cheng, Zhuo Ma, Xinghua Li, Jianfeng Ma
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, but its openness introduces vulnerabilities that can be exploited by poisoning attacks. Existing poisoning methods for RAG systems have limitations, such as poor generalization and lack of fluency in adversarial texts. In this paper, we propose CPA-RAG, a black-box adversarial framework that generates query-relevant texts capable of manipulating the retrieval process to induce target answers. The proposed method integrates prompt-based text generation, cross-guided optimization through multiple LLMs, and retriever-based scoring to construct high-quality adversarial samples. We conduct extensive experiments across multiple datasets and LLMs to evaluate its effectiveness. Results show that the framework achieves over 90\% attack success when the top-k retrieval setting is 5, matching white-box performance, and maintains a consistent advantage of approximately 5 percentage points across different top-k values. It also outperforms existing black-box baselines by 14.5 percentage points under various defense strategies. Furthermore, our method successfully compromises a commercial RAG system deployed on Alibaba's BaiLian platform, demonstrating its practical threat in real-world applications. These findings underscore the need for more robust and secure RAG frameworks to defend against poisoning attacks.
Wenqing Zhou, Yuxuan Yan, Qianqian Yang
Retrieval-Augmented Generation (RAG) has emerged as a promising approach to enhance the capabilities of language models by integrating external knowledge. Due to the diversity of data sources and the constraints of memory and computing resources, real-world data is often scattered in multiple devices. Conventional RAGs that store massive amounts of scattered data centrally face increasing privacy concerns and high computational costs. Additionally, RAG in a central node raises latency issues when searching over a large-scale knowledge base. To address these challenges, we propose a distributed Knowledge Graph-based RAG approach, referred to as DGRAG, in an edge-cloud system, where each edge device maintains a local knowledge base without the need to share it with the cloud, instead sharing only summaries of its knowledge. Specifically, DGRAG has two main phases. In the Distributed Knowledge Construction phase, DGRAG organizes local knowledge using knowledge graphs, generating subgraph summaries and storing them in a summary database in the cloud as information sharing. In the Collaborative Retrieval and Generation phase, DGRAG first performs knowledge retrieval and answer generation locally, and a gate mechanism determines whether the query is beyond the scope of local knowledge or processing capabilities. For queries that exceed the local knowledge scope, the cloud retrieves knowledge from the most relevant edges based on the summaries and generates a more precise answer. Experimental results demonstrate the effectiveness of the proposed DGRAG approach in significantly improving the quality of question-answering tasks over baseline approaches.