Quang Hoang Trung, Le Trung Hoang, Nguyen Van Hoang Phuc
Efficient text retrieval is critical for applications such as legal document analysis, particularly in specialized contexts like Japanese legal systems. Existing retrieval methods often underperform in such domain-specific scenarios, necessitating tailored approaches. In this paper, we introduce a novel two-phase text retrieval pipeline optimized for Japanese legal datasets. Our method leverages advanced language models to achieve state-of-the-art performance, significantly improving retrieval efficiency and accuracy. To further enhance robustness and adaptability, we incorporate an ensemble model that integrates multiple retrieval strategies, resulting in superior outcomes across diverse tasks. Extensive experiments validate the effectiveness of our approach, demonstrating strong performance on both Japanese legal datasets and widely recognized benchmarks like MS-MARCO. Our work establishes new standards for text retrieval in domain-specific and general contexts, providing a comprehensive solution for addressing complex queries in legal and multilingual environments.
Xinkai Du, Quanjie Han, Chao Lv, Yan Liu, Yalin Sun, Hao Shu, Hongbo Shan, Maosong Sun
Open-domain Question Answering (QA) has garnered substantial interest by
combining the advantages of faithfully retrieved passages and relevant passages
generated through Large Language Models (LLMs). However, there is a lack of
definitive labels available to pair these sources of knowledge. In order to
address this issue, we propose an unsupervised and simple framework called
Bi-Reranking for Merging Generated and Retrieved Knowledge (BRMGR), which
utilizes re-ranking methods for both retrieved passages and LLM-generated
passages. We pair the two types of passages using two separate re-ranking
methods and then combine them through greedy matching. We demonstrate that
BRMGR is equivalent to employing a bipartite matching loss when assigning each
retrieved passage with a corresponding LLM-generated passage. The application
of our model yielded experimental results from three datasets, improving their
performance by +1.7 and +1.6 on NQ and WebQ datasets, respectively, and
obtaining comparable result on TriviaQA dataset when compared to competitive
baselines.
Authors' comments: Accepted by ICASSP 2025
Hila Levi, Guy Heller, Dan Levi
As working with large datasets becomes standard, the task of accurately
retrieving images containing objects of interest by an open set textual query
gains practical importance. The current leading approach utilizes a pre-trained
CLIP model without any adaptation to the target domain, balancing accuracy and
efficiency through additional post-processing. In this work, we propose FOR:
Finetuning for Object-centric Open-vocabulary Image Retrieval, which allows
finetuning on a target dataset using closed-set labels while keeping the
visual-language association crucial for open vocabulary retrieval. FOR is based
on two design elements: a specialized decoder variant of the CLIP head
customized for the intended task, and its coupling within a multi-objective
training framework. Together, these design choices result in a significant
increase in accuracy, showcasing improvements of up to 8 mAP@50 points over
SoTA across three datasets. Additionally, we demonstrate that FOR is also
effective in a semi-supervised setting, achieving impressive results even when
only a small portion of the dataset is labeled.
Authors' comments: WACV 2025
Zhili Shen, Chenxin Diao, Pavlos Vougiouklis, Pascual Merita, Shriram Piramanayagam, Damien Graux, Dandan Tu, Zeren Jiang et al.
Retrieval-augmented generation systems rely on effective document retrieval capabilities. By design, conventional sparse or dense retrievers face challenges in multi-hop retrieval scenarios. In this paper, we present GeAR, which advances RAG performance through two key innovations: (i) graph expansion, which enhances any conventional base retriever, such as BM25, and (ii) an agent framework that incorporates graph expansion. Our evaluation demonstrates GeAR's superior retrieval performance on three multi-hop question answering datasets. Additionally, our system achieves state-of-the-art results with improvements exceeding 10% on the challenging MuSiQue dataset, while requiring fewer tokens and iterations compared to other multi-step retrieval systems.
Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Changjiang Zhou, Maarten de Rijke, Xueqi Cheng
Generative information retrieval methods retrieve documents by directly
generating their identifiers. Much effort has been devoted to developing
effective generative IR models. Less attention has been paid to the robustness
of these models. It is critical to assess the out-of-distribution (OOD)
generalization of generative IR models, i.e., how would such models generalize
to new distributions? To answer this question, we focus on OOD scenarios from
four perspectives in retrieval problems: (i)query variations; (ii)unseen query
types; (iii)unseen tasks; and (iv)corpus expansion. Based on this taxonomy, we
conduct empirical studies to analyze the OOD robustness of representative
generative IR models against dense retrieval models. Our empirical results
indicate that the OOD robustness of generative IR models is in need of
improvement. By inspecting the OOD robustness of generative IR models we aim to
contribute to the development of more reliable IR models. The code is available
at \url{https://github.com/Davion-Liu/GR_OOD}.
Authors' comments: Accepted by ECIR 2025. arXiv admin note: substantial text overlap
with arXiv:2306.12756
Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang
Long Context Language Models (LCLMs) have emerged as a new paradigm to
perform Information Retrieval (IR), which enables the direct ingestion and
retrieval of information by processing an entire corpus in their single
context, showcasing the potential to surpass traditional sparse and dense
retrieval methods. However, processing a large number of passages within
in-context for retrieval is computationally expensive, and handling their
representations during inference further exacerbates the processing time; thus,
we aim to make LCLM retrieval more efficient and potentially more effective
with passage compression. Specifically, we propose a new compression approach
tailored for LCLM retrieval, which is trained to maximize the retrieval
performance while minimizing the length of the compressed passages. To
accomplish this, we generate the synthetic data, where compressed passages are
automatically created and labeled as chosen or rejected according to their
retrieval success for a given query, and we train the proposed Compression
model for Long context Retrieval (CoLoR) with this data via preference
optimization while adding the length regularization loss on top of it to
enforce brevity. Through extensive experiments on 9 datasets, we show that
CoLoR improves the retrieval performance by 6% while compressing the in-context
size by a factor of 1.91. Our code is available at:
https://github.com/going-doer/CoLoR.
Authors' comments: ACL 2025
Chengbing Wang, Yang Zhang, Fengbin Zhu, Jizhi Zhang, Tianhao Shi, Fuli Feng
Leveraging Large Language Models (LLMs) to harness user-item interaction histories for item generation has emerged as a promising paradigm in generative recommendation. However, the limited context window of LLMs often restricts them to focusing on recent user interactions only, leading to the neglect of long-term interests involved in the longer histories. To address this challenge, we propose a novel Automatic Memory-Retrieval framework (AutoMR), which is capable of storing long-term interests in the memory and extracting relevant information from it for next-item generation within LLMs. Extensive experimental results on two real-world datasets demonstrate the effectiveness of our proposed AutoMR framework in utilizing long-term interests for generative recommendation.
Xiaopeng Li, Xiangyang Li, Hao Zhang, Zhaocheng Du, Pengyue Jia, Yichao Wang, Xiangyu Zhao, Huifeng Guo et al.
The performance of Dense retrieval (DR) is significantly influenced by the quality of negative sampling. Traditional DR methods primarily depend on naive negative sampling techniques or on mining hard negatives through external retriever and meticulously crafted strategies. However, naive negative sampling often fails to adequately capture the accurate boundaries between positive and negative samples, whereas existing hard negative sampling methods are prone to false negatives, resulting in performance degradation and training instability. Recent advancements in large language models (LLMs) offer an innovative solution to these challenges by generating contextually rich and diverse negative samples. In this work, we present a framework that harnesses LLMs to synthesize high-quality hard negative samples. We first devise a \textit{multi-attribute self-reflection prompting strategy} to direct LLMs in hard negative sample generation. Then, we implement a \textit{hybrid sampling strategy} that integrates these synthetic negatives with traditionally retrieved negatives, thereby stabilizing the training process and improving retrieval performance. Extensive experiments on five benchmark datasets demonstrate the efficacy of our approach, and code is also publicly available.
Arnav M. Das, Gantavya Bhatt, Lilly Kumari, Sahil Verma, Jeff Bilmes
Retrieval augmentation, the practice of retrieving additional data from large
auxiliary pools, has emerged as an effective technique for enhancing model
performance in the low-data regime. Prior approaches have employed only
nearest-neighbor based strategies for data selection, which retrieve auxiliary
samples with high similarity to instances in the target task. However, these
approaches are prone to selecting highly redundant samples, since they fail to
incorporate any notion of diversity. In our work, we first demonstrate that
data selection strategies used in prior retrieval-augmented few-shot adaptation
settings can be generalized using a class of functions known as Combinatorial
Mutual Information (CMI) measures. We then propose COBRA (COmBinatorial
Retrieval Augmentation), which employs an alternative CMI measure that
considers both diversity and similarity to a target dataset. COBRA consistently
outperforms previous retrieval approaches across image classification tasks and
few-shot learning techniques when used to retrieve samples from LAION-2B. COBRA
introduces negligible computational overhead to the cost of retrieval while
providing significant gains in downstream model performance.
Authors' comments: Accepted at CVPR 2025
Lovisa Hagström, Sara Vera Marjanović, Haeun Yu, Arnav Arora, Christina Lioma, Maria Maistro, Pepa Atanasova, Isabelle Augenstein
Retrieval-augmented generation (RAG) helps address the limitations of the
parametric knowledge embedded within a language model (LM). However,
investigations of how LMs utilise retrieved information of varying complexity
in real-world scenarios have been limited to synthetic contexts. We introduce
DRUID (Dataset of Retrieved Unreliable, Insufficient and
Difficult-to-understand contexts) with real-world queries and contexts manually
annotated for stance. The dataset is based on the prototypical task of
automated claim verification, for which automated retrieval of real-world
evidence is crucial. We compare DRUID to synthetic datasets (CounterFact,
ConflictQA) and find that artificial datasets often fail to represent the
complex and diverse real-world context settings. We show that synthetic
datasets exaggerate context characteristics rare in real retrieved data, which
leads to inflated context utilisation results, as measured by our novel ACU
score. Moreover, while previous work has mainly focused on singleton context
characteristics to explain context utilisation, correlations between singleton
context properties and ACU on DRUID are surprisingly small compared to other
properties related to context source. Overall, our work underscores the need
for real-world aligned context utilisation studies to represent and improve
performance in real-world RAG settings.
Authors' comments: 43 pages, 18 figures
Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang et al.
Universal Multimodal Retrieval (UMR) aims to enable search across various
modalities using a unified model, where queries and candidates can consist of
pure text, images, or a combination of both. Previous work has attempted to
adopt multimodal large language models (MLLMs) to realize UMR using only text
data. However, our preliminary experiments demonstrate that more diverse
multimodal training data can further unlock the potential of MLLMs. Despite its
effectiveness, the existing multimodal training data is highly imbalanced in
terms of modality, which motivates us to develop a training data synthesis
pipeline and construct a large-scale, high-quality fused-modal training
dataset. Based on the synthetic training data, we develop the General
Multimodal Embedder (GME), an MLLM-based dense retriever designed for UMR.
Furthermore, we construct a comprehensive UMR Benchmark (UMRB) to evaluate the
effectiveness of our approach. Experimental results show that our method
achieves state-of-the-art performance among existing UMR methods. Last, we
provide in-depth analyses of model scaling, training strategies, and perform
ablation studies on both the model and synthetic data.
Authors' comments: 32 pages, models at
https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct
Majd Zayyad, Yossi Adi
The integration of retrieval-augmented techniques with LLMs has shown promise in improving performance across various domains. However, their utility in tasks requiring advanced reasoning, such as generating and evaluating mathematical statements and proofs, remains underexplored. This study explores the use of Lean, a programming language for writing mathematical proofs, to populate the knowledge corpus used by RAG systems. We hope for this to lay the foundation to exploring different methods of using RAGs to improve the performance of LLMs in advanced logical reasoning tasks.
Silin Yang, Dong Wang, Haoqi Zheng, Ruochun Jin
Although the rise of large language models (LLMs) has introduced new opportunities for time series forecasting, existing LLM-based solutions require excessive training and exhibit limited transferability. In view of these challenges, we propose TimeRAG, a framework that incorporates Retrieval-Augmented Generation (RAG) into time series forecasting LLMs, which constructs a time series knowledge base from historical sequences, retrieves reference sequences from the knowledge base that exhibit similar patterns to the query sequence measured by Dynamic Time Warping (DTW), and combines these reference sequences and the prediction query as a textual prompt to the time series forecasting LLM. Experiments on datasets from various domains show that the integration of RAG improved the prediction accuracy of the original model by 2.97% on average.
Do June Min, Karel Mundnich, Andy Lapastora, Erfan Soltanmohammadi, Srikanth Ronanki, Kyu Han
One common approach for question answering over speech data is to first
transcribe speech using automatic speech recognition (ASR) and then employ
text-based retrieval-augmented generation (RAG) on the transcriptions. While
this cascaded pipeline has proven effective in many practical settings, ASR
errors can propagate to the retrieval and generation steps. To overcome this
limitation, we introduce SpeechRAG, a novel framework designed for
open-question answering over spoken data. Our proposed approach fine-tunes a
pre-trained speech encoder into a speech adapter fed into a frozen large
language model (LLM)--based retrieval model. By aligning the embedding spaces
of text and speech, our speech retriever directly retrieves audio passages from
text-based queries, leveraging the retrieval capacity of the frozen text
retriever. Our retrieval experiments on spoken question answering datasets show
that direct speech retrieval does not degrade over the text-based baseline, and
outperforms the cascaded systems using ASR. For generation, we use a speech
language model (SLM) as a generator, conditioned on audio passages rather than
transcripts. Without fine-tuning of the SLM, this approach outperforms cascaded
text-based models when there is high WER in the transcripts.
Authors' comments: ICASSP 2025
Zhuoyi Shang, Yanwei Liu, Jinxia Liu, Xiaoyan Gu, Ying Ding, Xiangyang Ji
For general users, training a neural network from scratch is usually
challenging and labor-intensive. Fortunately, neural network zoos enable them
to find a well-performing model for directly use or fine-tuning it in their
local environments. Although current model retrieval solutions attempt to
convert neural network models into vectors to avoid complex multiple inference
processes required for model selection, it is still difficult to choose a
suitable model due to inaccurate vectorization and biased correlation alignment
between the query dataset and models. From the perspective of knowledge
consistency, i.e., whether the knowledge possessed by the model can meet the
needs of query tasks, we propose a model retrieval scheme, named Know2Vec, that
acts as a black-box retrieval proxy for model zoo. Know2Vec first accesses to
models via a black-box interface in advance, capturing vital decision knowledge
from models while ensuring their privacy. Next, it employs an effective
encoding technique to transform the knowledge into precise model vectors.
Secondly, it maps the user's query task to a knowledge vector by probing the
semantic relationships within query samples. Furthermore, the proxy ensures the
knowledge-consistency between query vector and model vectors within their
alignment space, which is optimized through the supervised learning with
diverse loss functions, and finally it can identify the most suitable model for
a given task during the inference stage. Extensive experiments show that our
Know2Vec achieves superior retrieval accuracy against the state-of-the-art
methods in diverse neural network retrieval tasks.
Authors' comments: AAAI2025 accepted
Marius Memmel, Jacob Berg, Bingqing Chen, Abhishek Gupta, Jonathan Francis
Robot learning is witnessing a significant increase in the size, diversity,
and complexity of pre-collected datasets, mirroring trends in domains such as
natural language processing and computer vision. Many robot learning methods
treat such datasets as multi-task expert data and learn a multi-task,
generalist policy by training broadly across them. Notably, while these
generalist policies can improve the average performance across many tasks, the
performance of generalist policies on any one task is often suboptimal due to
negative transfer between partitions of the data, compared to task-specific
specialist policies. In this work, we argue for the paradigm of training
policies during deployment given the scenarios they encounter: rather than
deploying pre-trained policies to unseen problems in a zero-shot manner, we
non-parametrically retrieve and train models directly on relevant data at test
time. Furthermore, we show that many robotics tasks share considerable amounts
of low-level behaviors and that retrieval at the "sub"-trajectory granularity
enables significantly improved data utilization, generalization, and robustness
in adapting policies to novel problems. In contrast, existing full-trajectory
retrieval methods tend to underutilize the data and miss out on shared
cross-task content. This work proposes STRAP, a technique for leveraging
pre-trained vision foundation models and dynamic time warping to retrieve
sub-sequences of trajectories from large training corpora in a robust fashion.
STRAP outperforms both prior retrieval algorithms and multi-task learning
methods in simulated and real experiments, showing the ability to scale to much
larger offline datasets in the real world as well as the ability to learn
robust control policies with just a handful of real-world demonstrations.
Authors' comments: Project website at https://weirdlabuw.github.io/strap/
Zexiong Ma, Shengnan An, Zeqi Lin, Yanzhen Zou, Jian-Guang Lou, Bing Xie
Large language models (LLMs) are susceptible to generating hallucinated information, despite the integration of retrieval-augmented generation (RAG). Parallel context extension (PCE) is a line of research attempting to effectively integrating parallel (unordered) contexts, while it still suffers from hallucinations when adapted to RAG scenarios. In this paper, we propose DePaC (Dehallucinating Parallel Context Extension), which alleviates the hallucination problem with context-aware negative training and information-calibrated aggregation. DePaC is designed to alleviate two types of in-context hallucination: fact fabrication (i.e., LLMs present claims that are not supported by the contexts) and fact omission (i.e., LLMs fail to present claims that can be supported by the contexts). Specifically, (1) for fact fabrication, we apply the context-aware negative training that fine-tunes the LLMs with negative supervisions, thus explicitly guiding the LLMs to refuse to answer when contexts are not related to questions; (2) for fact omission, we propose the information-calibrated aggregation which prioritizes context windows with higher information increment from their contexts. The experimental results on nine RAG tasks demonstrate that DePaC significantly alleviates the two types of hallucination and consistently achieves better performances on these tasks.
Peize Li, Qingyi Si, Peng Fu, Zheng Lin, Yan Wang
Retrieval-based multi-image question answering (QA) task involves retrieving
multiple question-related images and synthesizing these images to generate an
answer. Conventional "retrieve-then-answer" pipelines often suffer from
cascading errors because the training objective of QA fails to optimize the
retrieval stage. To address this issue, we propose a novel method to
effectively introduce and reference retrieved information into the QA. Given
the image set to be retrieved, we employ a multimodal large language model
(visual perspective) and a large language model (textual perspective) to obtain
multimodal hypothetical summary in question-form and description-form. By
combining visual and textual perspectives, MHyS captures image content more
specifically and replaces real images in retrieval, which eliminates the
modality gap by transforming into text-to-text retrieval and helps improve
retrieval. To more advantageously introduce retrieval with QA, we employ
contrastive learning to align queries (questions) with MHyS. Moreover, we
propose a coarse-to-fine strategy for calculating both sentence-level and
word-level similarity scores, to further enhance retrieval and filter out
irrelevant details. Our approach achieves a 3.7% absolute improvement over
state-of-the-art methods on RETVQA and a 14.5% improvement over CLIP.
Comprehensive experiments and detailed ablation studies demonstrate the
superiority of our method.
Authors' comments: AAAI 2025
Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian et al.
Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70$\times$ more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our produced dataset, well-trained models, and data synthesis pipeline will be made publicly available to facilitate the future development of this field.
Xueguang Ma, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Wenhu Chen, Jimmy Lin
Generation with source attribution is important for enhancing the verifiability of retrieval-augmented generation (RAG) systems. However, existing approaches in RAG primarily link generated content to document-level references, making it challenging for users to locate evidence among multiple content-rich retrieved documents. To address this challenge, we propose Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel approach that combines answer generation with visual source attribution. Leveraging large vision-language models (VLMs), VISA identifies the evidence and highlights the exact regions that support the generated answers with bounding boxes in the retrieved document screenshots. To evaluate its effectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia webpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the medical domain. Experimental results demonstrate the effectiveness of VISA for visual source attribution on documents' original look, as well as highlighting the challenges for improvement. Code, data, and model checkpoints will be released.