Ali Abdari, Alex Falcon, Giuseppe Serra
Every day, a large amount of educational content is uploaded online across
different areas, including agriculture and gardening. When these videos or
materials are grouped meaningfully, they can make learning easier and more
effective. One promising way to organize and enrich such content is through the
Metaverse, which allows users to explore educational experiences in an
interactive and immersive environment. However, searching for relevant
Metaverse scenarios and finding those matching users' interests remains a
challenging task. A first step in this direction has been done recently, but
existing datasets are small and not sufficient for training advanced models. In
this work, we make two main contributions: first, we introduce a new dataset
containing 457 agricultural-themed virtual museums (AgriMuseums), each enriched
with textual descriptions; and second, we propose a hierarchical
vision-language model to represent and retrieve relevant AgriMuseums using
natural language queries. In our experimental setting, the proposed method
achieves up to about 62\% R@1 and 78\% MRR, confirming its effectiveness, and
it also leads to improvements on existing benchmarks by up to 6\% R@1 and 11\%
MRR. Moreover, an extensive evaluation validates our design choices. Code and
dataset are available at
https://github.com/aliabdari/Agricultural_Metaverse_Retrieval .
Authors' comments: Accepted for publication at the 23rd International Conference on
Image Analysis and Processing (ICIAP 2025)
Lam Thanh Do, Linh Van Nguyen, David Fu, Kevin Chen-Chuan Chang
The exponential growth of scientific literature has made it increasingly
difficult for researchers to keep up with the literature. In an attempt to
alleviate this problem, we propose CASPER, a sparse retrieval model for
scientific search that utilizes tokens and keyphrases as representation units
(i.e. dimensions in the sparse embedding space), enabling it to represent
queries and documents with research concepts and match them at both granular
and conceptual levels. To overcome the lack of suitable training data, we
propose mining training data by leveraging scholarly references (i.e. signals
that capture how research concepts of papers are expressed in different
settings), including titles, citation contexts, author-assigned keyphrases, and
co-citations. CASPER outperforms strong dense and sparse retrieval baselines on
eight scientific retrieval benchmarks. Moreover, we demonstrate that through
simple post-processing, CASPER can be effectively used for the keyphrase
generation tasks, achieving competitive performance with the established
CopyRNN while producing more diverse keyphrases and being nearly four times
faster.
Authors' comments: 11 Pages. Code: https://github.com/louisdo/CASPER
Ziyang Chen, Erxue Min, Xiang Zhao, Yunxin Li, Xin Jia, Jinzhi Liao, Jichao Li, Shuaiqiang Wang et al.
We introduce ChronoQA, a large-scale benchmark dataset for Chinese question
answering, specifically designed to evaluate temporal reasoning in
Retrieval-Augmented Generation (RAG) systems. ChronoQA is constructed from over
300,000 news articles published between 2019 and 2024, and contains 5,176
high-quality questions covering absolute, aggregate, and relative temporal
types with both explicit and implicit time expressions. The dataset supports
both single- and multi-document scenarios, reflecting the real-world
requirements for temporal alignment and logical consistency. ChronoQA features
comprehensive structural annotations and has undergone multi-stage validation,
including rule-based, LLM-based, and human evaluation, to ensure data quality.
By providing a dynamic, reliable, and scalable resource, ChronoQA enables
structured evaluation across a wide range of temporal tasks, and serves as a
robust benchmark for advancing time-sensitive retrieval-augmented question
answering systems.
Authors' comments: 10 pages, 5 figures
Zida Liang, Changfa Wu, Dunxian Huang, Weiqiang Sun, Ziyang Wang, Yuliang Yan, Jian Wu, Yuning Jiang et al.
Recommendation systems are essential tools in modern e-commerce, facilitating
personalized user experiences by suggesting relevant products. Recent
advancements in generative models have demonstrated potential in enhancing
recommendation systems; however, these models often exhibit limitations in
optimizing retrieval tasks, primarily due to their reliance on autoregressive
generation mechanisms. Conventional approaches introduce sequential
dependencies that impede efficient retrieval, as they are inherently unsuitable
for generating multiple items without positional constraints within a single
request session. To address these limitations, we propose TBGRecall, a
framework integrating Next Session Prediction (NSP), designed to enhance
generative retrieval models for e-commerce applications. Our framework
reformulation involves partitioning input samples into multi-session sequences,
where each sequence comprises a session token followed by a set of item tokens,
and then further incorporate multiple optimizations tailored to the generative
task in retrieval scenarios. In terms of training methodology, our pipeline
integrates limited historical data pre-training with stochastic partial
incremental training, significantly improving training efficiency and
emphasizing the superiority of data recency over sheer data volume. Our
extensive experiments, conducted on public benchmarks alongside a large-scale
industrial dataset from TaoBao, show TBGRecall outperforms the state-of-the-art
recommendation methods, and exhibits a clear scaling law trend. Ultimately, NSP
represents a significant advancement in the effectiveness of generative
recommendation systems for e-commerce applications.
Authors' comments: Both authors contributed equally to this research. Work done during
internship at Alibaba. Corresponding author: Dunxian Huang
(dunxian.hdx@alibaba-inc.com). Affiliations: (1) Shanghai Jiaotong
University, Shanghai, China; (2) Alibaba Inc
Jun Li, Kai Li, Shaoguo Liu, Tingting Gao
Composed Image Retrieval (CIR) presents a significant challenge as it requires jointly understanding a reference image and a modified textual instruction to find relevant target images. Some existing methods attempt to use a two-stage approach to further refine retrieval results. However, this often requires additional training of a ranking model. Despite the success of Chain-of-Thought (CoT) techniques in reducing training costs for language models, their application in CIR tasks remains limited -- compressing visual information into text or relying on elaborate prompt designs. Besides, existing works only utilize it for zero-shot CIR, as it is challenging to achieve satisfactory results in supervised CIR with a well-trained model. In this work, we proposed a framework that includes the Pyramid Matching Model with Training-Free Refinement (PMTFR) to address these challenges. Through a simple but effective module called Pyramid Patcher, we enhanced the Pyramid Matching Model's understanding of visual information at different granularities. Inspired by representation engineering, we extracted representations from COT data and injected them into the LVLMs. This approach allowed us to obtain refined retrieval scores in the Training-Free Refinement paradigm without relying on explicit textual reasoning, further enhancing performance. Extensive experiments on CIR benchmarks demonstrate that PMTFR surpasses state-of-the-art methods in supervised CIR tasks. The code will be made public.
Wonjune Kang, Deb Roy
We introduce the task of expressive speech retrieval, where the goal is to
retrieve speech utterances spoken in a given style based on a natural language
description of that style. While prior work has primarily focused on performing
speech retrieval based on what was said in an utterance, we aim to do so based
on how something was said. We train speech and text encoders to embed speech
and text descriptions of speaking styles into a joint latent space, which
enables using free-form text prompts describing emotions or styles as queries
to retrieve matching expressive speech segments. We perform detailed analyses
of various aspects of our proposed framework, including encoder architectures,
training criteria for effective cross-modal alignment, and prompt augmentation
for improved generalization to arbitrary text queries. Experiments on multiple
datasets encompassing 22 speaking styles demonstrate that our approach achieves
strong retrieval performance as measured by Recall@k.
Authors' comments: Accepted to ASRU 2025
Weilin Ruan, Xilin Dang, Ziyu Zhou, Sisuo Lyu, Yuxuan Liang
Traffic prediction is a cornerstone of modern intelligent transportation systems and a critical task in spatio-temporal forecasting. Although advanced Spatio-temporal Graph Neural Networks (STGNNs) and pre-trained models have achieved significant progress in traffic prediction, two key challenges remain: (i) limited contextual capacity when modeling complex spatio-temporal dependencies, and (ii) low predictability at fine-grained spatio-temporal points due to heterogeneous patterns. Inspired by Retrieval-Augmented Generation (RAG), we propose RAST, a universal framework that integrates retrieval-augmented mechanisms with spatio-temporal modeling to address these challenges. Our framework consists of three key designs: 1) Decoupled Encoder and Query Generator to capture decoupled spatial and temporal features and construct a fusion query via residual fusion; 2) Spatio-temporal Retrieval Store and Retrievers to maintain and retrieve vectorized fine-grained patterns; and 3) Universal Backbone Predictor that flexibly accommodates pre-trained STGNNs or simple MLP predictors. Extensive experiments on six real-world traffic networks, including large-scale datasets, demonstrate that RAST achieves superior performance while maintaining computational efficiency.
Shan Zhong, A. J. Sudler, D. Blume, Alberto M. Marino
Highly-efficient quantum memories are essential for advancing quantum information processing technologies, including scalable quantum computing and quantum networks. We experimentally demonstrate a light storage and retrieval protocol in a tripod system using an ensemble of laser-cooled $^{87}$Rb atoms. The tripod system, which consists of three ground states and an excited state, offers rich dynamics: its use to coherently store and retrieve a weak probe pulse in the $^{87}$Rb $F=1$ ground state manifold leads to the interference of two spin-wave excitations during storage time that translate to an interference in the peak intensity of the retrieved probe pulse. Our work shows that these interferences, which manifest when varying the pulse sequence or energy level structure, can be controlled experimentally by varying the storage time, optical phase, and magnetic field strength. Theoretical simulations exhibit excellent agreement with the experimental results. This work demonstrates the rich dynamics and versatile capabilities of atomic tripod systems for light storage and retrieval, with key advantages over conventional $\Lambda$-systems, highlighting the potential of atomic tripod systems for applications in quantum information processing, quantum synchronization, and atomic memory protocols.
Caleb G. Abbott, Justin R. Crepp, Brian Sands
The family of multi-plane phase retrieval sensors, such as the curvature and
nonlinear curvature wavefront sensors (WFS), contain tip/tilt information
embedded in their signals. We have built a nonlinear curvature WFS to study
different wavefront reconstruction methods and test the ability to extract
tip/tilt information. Using reliable and fast centroiding algorithms, combined
with knowledge of the measured z-distance to each measurement plane, we
demonstrate that image jitter may be sensed and compensated for using a fast
steering mirror and the WFS alone, i.e. without the need for peripheral
components such as quad-cells or access to a separate scientific imaging
channel. This approach, which is both precise and accurate, corroborates
previous numerical simulations and is expected to improve the overall
reconstruction accuracy of multi-plane phase retrieval sensors including higher
order spatial modes.
Authors' comments: 10 pages, 8 figures, SPIE conference paper
Yao Ding, Yuqing Wu, Ziyang Ding
With the acceleration of technological innovation efficient retrieval and classification of patent literature have become essential for intellectual property management and enterprise RD Traditional keyword and rulebased retrieval methods often fail to address complex query intents or capture semantic associations across technical domains resulting in incomplete and lowrelevance results This study presents an automated patent retrieval framework integrating Large Language Models LLMs with RetrievalAugmented Generation RAG technology The system comprises three components: 1) a preprocessing module for patent data standardization, 2) a highefficiency vector retrieval engine leveraging LLMgenerated embeddings, and 3) a RAGenhanced query module that combines external document retrieval with contextaware response generation Evaluations were conducted on the Google Patents dataset 20062024 containing millions of global patent records with metadata such as filing date domain and status The proposed gpt35turbo0125RAG configuration achieved 805 semantic matching accuracy and 92.1% recall surpassing baseline LLM methods by 28 percentage points The framework also demonstrated strong generalization in crossdomain classification and semantic clustering tasks These results validate the effectiveness of LLMRAG integration for intelligent patent retrieval providing a foundation for nextgeneration AIdriven intellectual property analysis platforms
Bongsu Kim
In dense retrieval, effective training hinges on selecting high quality hard
negatives while avoiding false negatives. Recent methods apply heuristics based
on positive document scores to identify hard negatives, improving both
performance and interpretability. However, these global, example agnostic
strategies often miss instance specific false negatives. To address this, we
propose a learnable adapter module that monitors Bi-Encoder representations to
estimate the likelihood that a hard negative is actually a false negative. This
probability is modeled dynamically and contextually, enabling fine-grained,
query specific judgments. The predicted scores are used in two downstream
components: (1) resampling, where negatives are reweighted during training, and
(2) reranking, where top-k retrieved documents are reordered at inference.
Empirical results on standard benchmarks show that our adapter-enhanced
framework consistently outperforms strong Bi-Encoder baselines, underscoring
the benefit of explicit false negative modeling in dense retrieval.
Authors' comments: 8 pages, 4 figures, submitted to AAAI 2026
Antoine Chaffin, Raphaël Sourty
Neural ranking has become a cornerstone of modern information retrieval.
While single vector search remains the dominant paradigm, it suffers from the
shortcoming of compressing all the information into a single vector. This
compression leads to notable performance degradation in out-of-domain,
long-context, and reasoning-intensive retrieval tasks. Multi-vector approaches
pioneered by ColBERT aim to address these limitations by preserving individual
token embeddings and computing similarity via the MaxSim operator. This
architecture has demonstrated superior empirical advantages, including enhanced
out-of-domain generalization, long-context handling, and performance in complex
retrieval scenarios. Despite these compelling empirical results and clear
theoretical advantages, the practical adoption and public availability of late
interaction models remain low compared to their single-vector counterparts,
primarily due to a lack of accessible and modular tools for training and
experimenting with such models. To bridge this gap, we introduce PyLate, a
streamlined library built on top of Sentence Transformers to support
multi-vector architectures natively, inheriting its efficient training,
advanced logging, and automated model card generation while requiring minimal
code changes to code templates users are already familiar with. By offering
multi-vector-specific features such as efficient indexes, PyLate aims to
accelerate research and real-world application of late interaction models,
thereby unlocking their full potential in modern IR systems. Finally, PyLate
has already enabled the development of state-of-the-art models, including
GTE-ModernColBERT and Reason-ModernColBERT, demonstrating its practical utility
for both research and production environments.
Authors' comments: 5 pages
Pengcheng Wang, Sheng Li, Takahiro Shinozaki
In this paper, we propose RAG-Boost (ST-ShinozakiLab Task I system), which enhances the baseline LLM-based ASR system of the MLC-SLM Challenge (task I) with a retrieval-augmented generation (RAG) module on the fly. Each partial ASR hypothesis queries a vector store of audio-text pairs and domain terms, and the retrieved results are fused with the live ASR hypotheses to fix recognition errors. The fused hypotheses are passed to the LLM, yielding improved responses.
Authors' comments: accepted at Interspeech2025 MLC-SLM Challenge workshop (task I system description)
Jiawei Li, Chengye Yang, Yaochen Zhang, Weilin Sun, Lei Meng, Xiangxu Meng
The goal of construction site risk and hazard identification is to enhance
safety management through automation. Existing research based on large language
models falls into two categories: image-text matching for collaborative
reasoning, which struggles with complex hazard features, and instruction
fine-tuning or dialogue guidance using professional datasets, which suffers
from high training costs and poor generalization.To address this, we propose a
hazard identification method using similar case retrieval enhancement. By
integrating external knowledge and retrieved case contexts via prompt
fine-tuning, we mitigate misjudgments caused by limited domain knowledge and
weak feature associations. Our method includes three modules: retrieval
library, image similarity retrieval, and large model retrieval enhancement,
enabling efficient recognition without training. Experiments on real
construction data show significant improvements. For instance, GLM-4V's
recognition accuracy increased to 50\%, a 35.49\% boost. The method enhances
accuracy, context understanding, and stability, offering new theoretical and
technical support for hazard detection.
Authors' comments: in Chinese language
Kennedy Edemacu, Vinay M. Shashidhar, Micheal Tuape, Dan Abudu, Beakcheol Jang, Jong Wook Kim
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to boost the capabilities of large language models (LLMs) by incorporating external, up-to-date knowledge sources. However, this introduces a potential vulnerability to knowledge poisoning attacks, where attackers can compromise the knowledge source to mislead the generation model. One such attack is the PoisonedRAG in which the injected adversarial texts steer the model to generate an attacker-chosen response to a target question. In this work, we propose novel defense methods, FilterRAG and ML-FilterRAG, to mitigate the PoisonedRAG attack. First, we propose a new property to uncover distinct properties to differentiate between adversarial and clean texts in the knowledge data source. Next, we employ this property to filter out adversarial texts from clean ones in the design of our proposed approaches. Evaluation of these methods using benchmark datasets demonstrate their effectiveness, with performances close to those of the original RAG systems.
Authors' comments: Preprint for Submission
Rubin Wei, Jiaqi Cao, Jiarui Wang, Jushi Kai, Qipeng Guo, Bowen Zhou, Zhouhan Lin
While modern decoder-only LLMs achieve superior performance across various domains, hallucinations have risen to be a common problem in their generated text, hindering their application in knowledge-intensive tasks. Retriever-augmented generation (RAG) offers a solution, but the non-parametric nature of the retriever hinders its deep interaction with LLM. In this work, we propose to decouple memorization from the LLM decoder using a pretrained, differentiable external memory. The external memory is an MLP pretrained by imitating the behavior of a retriever on the entire pretraining dataset. Our resulting architecture, which comprises a transformer decoder and an external MLP memory pretrained on language modeling and retriever imitation respectively, demonstrates strong perplexity and performance on downstream tasks. Experiments show our architecture exhibits steeper power-law scaling with model size, achieving 17.5% and 24.1% improvement on WikiText-103 and Web datasets compared to decoder-only models while benefiting from added training without overfitting. We demonstrate superior performance on three hallucination benchmarks and nine memory-intensive tasks. Additionally, our approach delivers $80\times$ speedup over $k$NN-LM (500M tokens) and $1.3\times$ faster inference than decoder-only models. Unlike $k$NN-LM, which impairs reasoning, our MLP memory improves StrategyQA performance. We will open-source our code and models in the future.
Yi Jiang, Sendong Zhao, Jianbo Li, Haochun Wang, Lizhe Zhang, Yan Liu, Bin Qin
Retrieval-Augmented Generation (RAG) has emerged as a promising framework for
enhancing the capabilities of Large Language Models (LLMs), especially in
knowledge-intensive tasks. Despite its advantages, current RAG methods often
struggle to *fully exploit knowledge during generation*. In particular, the
synergy between the model's internal parametric knowledge and external
retrieved knowledge remains limited. Retrieved contents may sometimes mislead
generation, while certain generated content can guide the model toward more
accurate outputs. In this work, we propose Collaborative Chain-of-Agents, a
framework designed to enhance explicitly synergy over both parametric and
retrieved knowledge. Specifically, we first introduce CoCoA-zero, a multi-agent
RAG framework that first performs conditional knowledge induction and then
reasons answers. Building on this, we develop CoCoA, a long-chain training
strategy that synthesizes extended multi-agent reasoning trajectories from
CoCoA-zero to fine-tune the LLM. This strategy enhances the model's capability
to explicitly integrate and jointly leverage parametric and retrieved
knowledge. Experiments results show that CoCoA-zero and CoCoA achieve superior
performance on open-domain and multi-hop QA tasks.
Authors' comments: code available at https://github.com/liunian-Jay/CoCoA
Sateesh Kumar, Shivin Dass, Georgios Pavlakos, Roberto Martín-Martín
In this work, we study the problem of data retrieval for few-shot imitation
learning: selecting data from a large dataset to train a performant policy for
a specific task, given only a few target demonstrations. Prior methods retrieve
data using a single-feature distance heuristic, assuming that the best
demonstrations are those that most closely resemble the target examples in
visual, semantic, or motion space. However, this approach captures only a
subset of the relevant information and can introduce detrimental
demonstrations, e.g., retrieving data from unrelated tasks due to similar scene
layouts, or selecting similar motions from tasks with divergent goals. We
present COLLAGE, a method for COLLective data AGgrEgation in few-shot imitation
learning that uses an adaptive late fusion mechanism to guide the selection of
relevant demonstrations based on a task-specific combination of multiple cues.
COLLAGE follows a simple, flexible, and efficient recipe: it assigns weights to
subsets of the dataset that are pre-selected using a single feature (e.g.,
appearance, shape, or language similarity), based on how well a policy trained
on each subset predicts actions in the target demonstrations. These weights are
then used to perform importance sampling during policy training, sampling data
more densely or sparsely according to estimated relevance. COLLAGE is general
and feature-agnostic, allowing it to combine any number of subsets selected by
any retrieval heuristic, and to identify which subsets provide the greatest
benefit for the target task. In extensive experiments, COLLAGE outperforms
state-of-the-art retrieval and multi-task learning approaches by 5.1% in
simulation across 10 tasks, and by 16.6% in the real world across 6 tasks,
where we perform retrieval from the large-scale DROID dataset. More information
at https://robin-lab.cs.utexas.edu/COLLAGE .
Authors' comments: Accepted at the Conference on Robot Learning (CoRL), 2025. Project
page: https://robin-lab.cs.utexas.edu/COLLAGE
Yiqun Chen, Erhan Zhang, Lingyong Yan, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, Jiaxin Mao
In question-answering (QA) systems, Retrieval-Augmented Generation (RAG) has become pivotal in enhancing response accuracy and reducing hallucination issues. The architecture of RAG systems varies significantly, encompassing single-round RAG, iterative RAG, and reasoning RAG, each tailored to address different types of queries. Due to the varying complexity of real-world queries, a fixed RAG pipeline often struggles to balance performance and cost efficiency across different queries. To address this challenge, we propose an adaptive RAG framework called MAO-ARAG, which leverages multi-agent orchestration. Our adaptive RAG is conceived as a multi-turn framework. Specifically, we define multiple executor agents, representing typical RAG modules such as query reformulation agents, document selection agent, and generation agents. A planner agent intelligently selects and integrates the appropriate agents from these executors into a suitable workflow tailored for each query, striving for high-quality answers while maintaining reasonable costs. During each turn, the planner agent is trained using reinforcement learning, guided by an outcome-based reward (F1 score) and a cost-based penalty, continuously improving answer quality while keeping costs within a reasonable range. Experiments conducted on multiple QA datasets demonstrate that our approach, which dynamically plans workflows for each query, not only achieves high answer quality but also maintains both cost and latency within acceptable limits.The code of MAO-ARAG is on https://github.com/chenyiqun/Agentic-RAG.
Sebastian Wind, Jeta Sopa, Daniel Truhn, Mahshad Lotfinia, Tri-Thien Nguyen, Keno Bressem, Lisa Adams, Mirabela Rusu et al.
Clinical decision-making in radiology increasingly benefits from artificial intelligence (AI), particularly through large language models (LLMs). However, traditional retrieval-augmented generation (RAG) systems for radiology question answering (QA) typically rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks. Here we propose an agentic RAG framework enabling LLMs to autonomously decompose radiology questions, iteratively retrieve targeted clinical evidence from Radiopaedia, and dynamically synthesize evidence-based responses. We evaluated 24 LLMs spanning diverse architectures, parameter scales (0.5B to >670B), and training paradigms (general-purpose, reasoning-optimized, clinically fine-tuned), using 104 expert-curated radiology questions from previously established RSNA-RadioQA and ExtendedQA datasets. Agentic retrieval significantly improved mean diagnostic accuracy over zero-shot prompting (73% vs. 64%; P<0.001) and conventional online RAG (73% vs. 68%; P<0.001). The greatest gains occurred in mid-sized models (e.g., Mistral Large improved from 72% to 81%) and small-scale models (e.g., Qwen 2.5-7B improved from 55% to 71%), while very large models (>200B parameters) demonstrated minimal changes (<2% improvement). Additionally, agentic retrieval reduced hallucinations (mean 9.4%) and retrieved clinically relevant context in 46% of cases, substantially aiding factual grounding. Even clinically fine-tuned models exhibited meaningful improvements (e.g., MedGemma-27B improved from 71% to 81%), indicating complementary roles of retrieval and fine-tuning. These results highlight the potential of agentic frameworks to enhance factuality and diagnostic accuracy in radiology QA, particularly among mid-sized LLMs, warranting future studies to validate their clinical utility.