Hossein A. Rahmani, Xi Wang, Emine Yilmaz, Nick Craswell, Bhaskar Mitra, Paul Thomas
Large-scale test collections play a crucial role in Information Retrieval
(IR) research. However, according to the Cranfield paradigm and the research
into publicly available datasets, the existing information retrieval research
studies are commonly developed on small-scale datasets that rely on human
assessors for relevance judgments - a time-intensive and expensive process.
Recent studies have shown the strong capability of Large Language Models (LLMs)
in producing reliable relevance judgments with human accuracy but at a greatly
reduced cost. In this paper, to address the missing large-scale ad-hoc document
retrieval dataset, we extend the TREC Deep Learning Track (DL) test collection
via additional language model synthetic labels to enable researchers to test
and evaluate their search systems at a large scale. Specifically, such a test
collection includes more than 1,900 test queries from the previous years of
tracks. We compare system evaluation with past human labels from past years and
find that our synthetically created large-scale test collection can lead to
highly correlated system rankings.
Authors' comments: 9 pages, resource paper, WWW 2025
Rohan Jha, Bo Wang, Michael Günther, Georgios Mastrapas, Saba Sturua, Isabelle Mohr, Andreas Koukounas, Mohammad Kalim Akram et al.
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce a novel architecture and a training framework to support long context window and multilingual retrieval. Leveraging Matryoshka Representation Loss, we further demonstrate that the reducing the embedding dimensionality from 128 to 64 has insignificant impact on the model's retrieval performance and cut storage requirements by up to 50%. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks,
Weijie Liu, Zecheng Tang, Juntao Li, Kehai Chen, Min Zhang
Recent advancements in Large Language Models (LLMs) have yielded remarkable success across diverse fields. However, handling long contexts remains a significant challenge for LLMs due to the quadratic time and space complexity of attention mechanisms and the growing memory consumption of the key-value cache during generation. This work introduces MemLong: Memory-Augmented Retrieval for Long Text Generation, a method designed to enhance the capabilities of long-context language modeling by utilizing an external retriever for historical information retrieval. MemLong combines a non-differentiable ``ret-mem'' module with a partially trainable decoder-only language model and introduces a fine-grained, controllable retrieval attention mechanism that leverages semantic-level relevant chunks. Comprehensive evaluations on multiple long-context language modeling benchmarks demonstrate that MemLong consistently outperforms other state-of-the-art LLMs. More importantly, MemLong can extend the context length on a single 3090 GPU from 4k up to 80k. Our code is available at https://github.com/Bui1dMySea/MemLong
Li-Heng Lin, Yuchen Cui, Amber Xie, Tianyu Hua, Dorsa Sadigh
Few-shot imitation learning relies on only a small amount of task-specific demonstrations to efficiently adapt a policy for a given downstream tasks. Retrieval-based methods come with a promise of retrieving relevant past experiences to augment this target data when learning policies. However, existing data retrieval methods fall under two extremes: they either rely on the existence of exact behaviors with visually similar scenes in the prior data, which is impractical to assume; or they retrieve based on semantic similarity of high-level language descriptions of the task, which might not be that informative about the shared low-level behaviors or motions across tasks that is often a more important factor for retrieving relevant data for policy learning. In this work, we investigate how we can leverage motion similarity in the vast amount of cross-task data to improve few-shot imitation learning of the target task. Our key insight is that motion-similar data carries rich information about the effects of actions and object interactions that can be leveraged during few-shot adaptation. We propose FlowRetrieval, an approach that leverages optical flow representations for both extracting similar motions to target tasks from prior data, and for guiding learning of a policy that can maximally benefit from such data. Our results show FlowRetrieval significantly outperforms prior methods across simulated and real-world domains, achieving on average 27% higher success rate than the best retrieval-based prior method. In the Pen-in-Cup task with a real Franka Emika robot, FlowRetrieval achieves 3.7x the performance of the baseline imitation learning technique that learns from all prior and target data. Website: https://flow-retrieval.github.io
Yuying Zhang, Wenyan Yang, Guhan Sivasubramanian, Joni Pajarinen
Imitation learning (IL) algorithms typically distil experience into parametric behavior policies to mimic expert demonstrations. With limited experience previous methods often struggle and cannot accurately align the current state with expert demonstrations, particularly in tasks that are characterised by partial observations or dynamic object deformations. We consider imitation learning in deformable mobile manipulation with an ego-centric limited field of view and introduce a novel IL approach called DeMoBot that directly retrieves observations from demonstrations. DeMoBot utilizes vision foundation models to identify relevant expert data based on visual similarity and matches the current trajectory with demonstrated trajectories using trajectory similarity and forward reachability constraints to select suitable sub-goals. A goal-conditioned motion generation policy shall guide the robot to the sub-goal until the task is completed. We evaluate DeMoBot using a Spot robot in several simulated and real-world settings, demonstrating its effectiveness and generalizability. DeMoBot outperforms baselines with only 20 demonstrations, attaining high success rates in gap covering (85% simulation, 80% real-world) and table uncovering (87.5% simulation, 70% real-world), while showing promise in complex tasks like curtain opening (47.5% simulation, 35% real-world). Additional details are available at: https://sites.google.com/view/demobot-fewshot/home
Soumya Basu, Ankit Singh Rawat, Manzil Zaheer
Modern ML systems increasingly augment input instances with additional relevant information to enhance final prediction. Despite growing interest in such retrieval-augmented models, their fundamental properties and training are not well understood. We propose a statistical framework to study such models with two components: 1) a {\em retriever} to identify the relevant information out of a large corpus via a data-dependent metric; and 2) a {\em predictor} that consumes the input instances along with the retrieved information to make the final predictions. We present a principled method for end-to-end training of both components and draw connections with various training approaches in the literature. Furthermore, we establish excess risk bounds for retrieval-augmented models while delineating the contributions of both retriever and predictor towards the model performance. We validate the utility of our proposed training methods along with the key takeaways from our statistical analysis on open domain question answering task where retrieval augmentation is important.
Hrishikesh Kulkarni, Nazli Goharian, Ophir Frieder, Sean MacAvaney
Sparse retrieval methods like BM25 are based on lexical overlap, focusing on
the surface form of the terms that appear in the query and the document. The
use of inverted indices in these methods leads to high retrieval efficiency. On
the other hand, dense retrieval methods are based on learned dense vectors and,
consequently, are effective but comparatively slow. Since sparse and dense
methods approach problems differently and use complementary relevance signals,
approximation methods were proposed to balance effectiveness and efficiency.
For efficiency, approximation methods like HNSW are frequently used to
approximate exhaustive dense retrieval. However, approximation techniques still
exhibit considerably higher latency than sparse approaches. We propose LexBoost
that first builds a network of dense neighbors (a corpus graph) using a dense
retrieval approach while indexing. Then, during retrieval, we consider both a
document's lexical relevance scores and its neighbors' scores to rank the
documents. In LexBoost this remarkably simple application of the Cluster
Hypothesis contributes to stronger ranking effectiveness while contributing
little computational overhead (since the corpus graph is constructed offline).
The method is robust across the number of neighbors considered, various fusion
parameters for determining the scores, and different dataset construction
methods. We also show that re-ranking on top of LexBoost outperforms
traditional dense re-ranking and leads to results comparable with
higher-latency exhaustive dense retrieval.
Authors' comments: ACM DocEng 2024
Ying Zhu, Shengchang Li, Ziqian Kong, Qiang Yang, Peilan Xu
Trustworthiness reasoning aims to enable agents in multiplayer games with incomplete information to identify potential allies and adversaries, thereby enhancing decision-making. In this paper, we introduce the graph retrieval-augmented trustworthiness reasoning (GRATR) framework, which retrieves observable evidence from the game environment to inform decision-making by large language models (LLMs) without requiring additional training, making it a zero-shot approach. Within the GRATR framework, agents first observe the actions of other players and evaluate the resulting shifts in inter-player trust, constructing a corresponding trustworthiness graph. During decision-making, the agent performs multi-hop retrieval to evaluate trustworthiness toward a specific target, where evidence chains are retrieved from multiple trusted sources to form a comprehensive assessment. Experiments in the multiplayer game \emph{Werewolf} demonstrate that GRATR outperforms the alternatives, improving reasoning accuracy by 50.5\% and reducing hallucination by 30.6\% compared to the baseline method. Additionally, when tested on a dataset of Twitter tweets during the U.S. election period, GRATR surpasses the baseline method by 10.4\% in accuracy, highlighting its potential in real-world applications such as intent analysis.
Paul Primus, Florian Schmid, Gerhard Widmer
Dual-encoder-based audio retrieval systems are commonly optimized with
contrastive learning on a set of matching and mismatching audio-caption pairs.
This leads to a shared embedding space in which corresponding items from the
two modalities end up close together. Since audio-caption datasets typically
only contain matching pairs of recordings and descriptions, it has become
common practice to create mismatching pairs by pairing the audio with a caption
randomly drawn from the dataset. This is not ideal because the randomly sampled
caption could, just by chance, partly or entirely describe the audio recording.
However, correspondence information for all possible pairs is costly to
annotate and thus typically unavailable; we, therefore, suggest substituting it
with estimated correspondences. To this end, we propose a two-staged training
procedure in which multiple retrieval models are first trained as usual, i.e.,
without estimated correspondences. In the second stage, the audio-caption
correspondences predicted by these models then serve as prediction targets. We
evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that
it improves retrieval performance, even in a restricting self-distillation
setting where a single model generates and then learns from the estimated
correspondences. We further show that our method outperforms the current state
of the art by 1.6 pp. mAP@10 on the ClothoV2 benchmark.
Authors' comments: In Proceedings of the 9th Workshop on Detection and Classification of
Acoustic Scenes and Events, DCASE, Tokyo, Japan, 2024. Implementation
available on GitHub: https://github.com/OptimusPrimus/salsa
Yili Li, Jing Yu, Keke Gai, Bang Liu, Gang Xiong, Qi Wu
Current text-video retrieval methods mainly rely on cross-modal matching between queries and videos to calculate their similarity scores, which are then sorted to obtain retrieval results. This method considers the matching between each candidate video and the query, but it incurs a significant time cost and will increase notably with the increase of candidates. Generative models are common in natural language processing and computer vision, and have been successfully applied in document retrieval, but their application in multimodal retrieval remains unexplored. To enhance retrieval efficiency, in this paper, we introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers and retrieving candidate videos with constant time complexity. T2VIndexer aims to reduce retrieval time while maintaining high accuracy. To achieve this goal, we propose video identifier encoding and query-identifier augmentation approaches to represent videos as short sequences while preserving their semantic information. Our method consistently enhances the retrieval efficiency of current state-of-the-art models on four standard datasets. It enables baselines with only 30\%-50\% of the original retrieval time to achieve better retrieval performance on MSR-VTT (+1.0%), MSVD (+1.8%), ActivityNet (+1.5%), and DiDeMo (+0.2%). The code is available at https://github.com/Lilidamowang/T2VIndexer-generativeSearch.
Ze Liu, Jin Zhang, Chao Feng, Defu Lian, Jie Wang, Enhong Chen
With the development of deep learning techniques, deep recommendation models also achieve remarkable improvements in terms of recommendation accuracy. However, due to the large number of candidate items in practice and the high cost of preference computation, these methods also suffer from low efficiency of recommendation. The recently proposed tree-based deep recommendation models alleviate the problem by directly learning tree structure and representations under the guidance of recommendation objectives. However, such models have shortcomings. The max-heap assumption in the hierarchical tree, in which the preference for a parent node should be the maximum between the preferences for its children, is difficult to satisfy in their binary classification objectives. To this end, we propose Tree-based Deep Retrieval (TDR for short) for efficient recommendation. In TDR, all the trees generated during the training process are retained to form the forest. When learning the node representation of each tree, we have to satisfy the max-heap assumption as much as possible and mimic beam search behavior over the tree in the training stage. This is achieved by TDR to regard the training task as multi-classification over tree nodes at the same level. However, the number of tree nodes grows exponentially with levels, making us train the preference model with the guidance of the sampled-softmax technique. The experiments are conducted on real-world datasets, validating the effectiveness of the proposed preference model learning method and tree learning method.
Ameya Godbole, Nicholas Monath, Seungyeon Kim, Ankit Singh Rawat, Andrew McCallum, Manzil Zaheer
In text generation, hallucinations refer to the generation of seemingly coherent text that contradicts established knowledge. One compelling hypothesis is that hallucinations occur when a language model is given a generation task outside its parametric knowledge (due to rarity, recency, domain, etc.). A common strategy to address this limitation is to infuse the language models with retrieval mechanisms, providing the model with relevant knowledge for the task. In this paper, we leverage the planning capabilities of instruction-tuned LLMs and analyze how planning can be used to guide retrieval to further reduce the frequency of hallucinations. We empirically evaluate several variations of our proposed approach on long-form text generation tasks. By improving the coverage of relevant facts, plan-guided retrieval and generation can produce more informative responses while providing a higher rate of attribution to source documents.
Meet Doshi, Vishwajeet Kumar, Rudra Murthy, Vignesh P, Jaydeep Sen
Learned Sparse Retrievers (LSR) have evolved into an effective retrieval strategy that can bridge the gap between traditional keyword-based sparse retrievers and embedding-based dense retrievers. At its core, learned sparse retrievers try to learn the most important semantic keyword expansions from a query and/or document which can facilitate better retrieval with overlapping keyword expansions. LSR like SPLADE has typically been using encoder only models with MLM (masked language modeling) style objective in conjunction with known ways of retrieval performance improvement such as hard negative mining, distillation, etc. In this work, we propose to use decoder-only model for learning semantic keyword expansion. We posit, decoder only models that have seen much higher magnitudes of data are better equipped to learn keyword expansions needed for improved retrieval. We use Mistral as the backbone to develop our Learned Sparse Retriever similar to SPLADE and train it on a subset of sentence-transformer data which is often used for training text embedding models. Our experiments support the hypothesis that a sparse retrieval model based on decoder only large language model (LLM) surpasses the performance of existing LSR systems, including SPLADE and all its variants. The LLM based model (Echo-Mistral-SPLADE) now stands as a state-of-the-art learned sparse retrieval model on the BEIR text retrieval benchmark.
Zhenyu Lu, Lakshay Sethi
Previous methods for audio-image matching generally fall into one of two categories: pipeline models or End-to-End models. Pipeline models first transcribe speech and then encode the resulting text; End-to-End models encode speech directly. Generally, pipeline models outperform end-to-end models, but the intermediate transcription necessarily discards some potentially useful non-textual information. In addition to textual information, speech can convey details such as accent, mood, and and emphasis, which should be effectively captured in the encoded representation. In this paper, we investigate whether non-textual information, which is overlooked by pipeline-based models, can be leveraged to improve speech-image matching performance. We thoroughly analyze and compare End-to-End models, pipeline models, and our proposed dual-channel model for robust audio-image retrieval on a variety of datasets. Our approach achieves a substantial performance gain over the previous state-of-the-art by leveraging strong pretrained models, a prompting mechanism and a bifurcated design.
Matteo Attimonelli, Claudio Pomo, Dietmar Jannach, Tommaso Di Noia
The increasing demand for online fashion retail has boosted research in fashion compatibility modeling and item retrieval, focusing on matching user queries (textual descriptions or reference images) with compatible fashion items. A key challenge is top-bottom retrieval, where precise compatibility modeling is essential. Traditional methods, often based on Bayesian Personalized Ranking (BPR), have shown limited performance. Recent efforts have explored using generative models in compatibility modeling and item retrieval, where generated images serve as additional inputs. However, these approaches often overlook the quality of generated images, which could be crucial for model performance. Additionally, generative models typically require large datasets, posing challenges when such data is scarce. To address these issues, we introduce the Generative Compatibility Model (GeCo), a two-stage approach that improves fashion image retrieval through paired image-to-image translation. First, the Complementary Item Generation Model (CIGM), built on Conditional Generative Adversarial Networks (GANs), generates target item images (e.g., bottoms) from seed items (e.g., tops), offering conditioning signals for retrieval. These generated samples are then integrated into GeCo, enhancing compatibility modeling and retrieval accuracy. Evaluations on three datasets show that GeCo outperforms state-of-the-art baselines. Key contributions include: (i) the GeCo model utilizing paired image-to-image translation within the Composed Image Retrieval framework, (ii) comprehensive evaluations on benchmark datasets, and (iii) the release of a new Fashion Taobao dataset designed for top-bottom retrieval, promoting further research.
Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar, Jaydeep Sen
Given the large number of Hindi speakers worldwide, there is a pressing need for robust and efficient information retrieval systems for Hindi. Despite ongoing research, there is a lack of comprehensive benchmark for evaluating retrieval models in Hindi. To address this gap, we introduce the Hindi version of the BEIR benchmark, which includes a subset of English BEIR datasets translated to Hindi, existing Hindi retrieval datasets, and synthetically created datasets for retrieval. The benchmark is comprised of $15$ datasets spanning across $8$ distinct tasks. We evaluate state-of-the-art multilingual retrieval models on this benchmark to identify task and domain-specific challenges and their impact on retrieval performance. By releasing this benchmark and a set of relevant baselines, we enable researchers to understand the limitations and capabilities of current Hindi retrieval models, promoting advancements in this critical area. The datasets from Hindi-BEIR are publicly available.
Chidaksh Ravuru, Sagar Srinivas Sakhinana, Venkataramana Runkana
Time series modeling is crucial for many applications, however, it faces
challenges such as complex spatio-temporal dependencies and distribution shifts
in learning from historical context to predict task-specific outcomes. To
address these challenges, we propose a novel approach using an agentic
Retrieval-Augmented Generation (RAG) framework for time series analysis. The
framework leverages a hierarchical, multi-agent architecture where the master
agent orchestrates specialized sub-agents and delegates the end-user request to
the relevant sub-agent. The sub-agents utilize smaller, pre-trained language
models (SLMs) customized for specific time series tasks through fine-tuning
using instruction tuning and direct preference optimization, and retrieve
relevant prompts from a shared repository of prompt pools containing distilled
knowledge about historical patterns and trends to improve predictions on new
data. Our proposed modular, multi-agent RAG approach offers flexibility and
achieves state-of-the-art performance across major time series tasks by
tackling complex challenges more effectively than task-specific customized
methods across benchmark datasets.
Authors' comments: Paper was accepted for Undergraduate Consortium at ACM KDD, 2024.
Please find the link: https://kdd2024.kdd.org/undergraduate-consortium/
Laurent Mombaerts, Terry Ding, Adi Banerjee, Florian Felice, Jonathan Taws, Tarik Borogovac
Retrieval Augmented Generation (RAG) is a technique used to augment Large
Language Models (LLMs) with contextually relevant, time-critical, or
domain-specific information without altering the underlying model parameters.
However, constructing RAG systems that can effectively synthesize information
from large and diverse set of documents remains a significant challenge. We
introduce a novel data-centric RAG workflow for LLMs, transforming the
traditional retrieve-then-read system into a more advanced
prepare-then-rewrite-then-retrieve-then-read framework, to achieve higher
domain expert-level understanding of the knowledge base. Our methodology relies
on generating metadata and synthetic Questions and Answers (QA) for each
document, as well as introducing the new concept of Meta Knowledge Summary (MK
Summary) for metadata-based clusters of documents. The proposed innovations
enable personalized user-query augmentation and in-depth information retrieval
across the knowledge base. Our research makes two significant contributions:
using LLMs as evaluators and employing new comparative performance metrics, we
demonstrate that (1) using augmented queries with synthetic question matching
significantly outperforms traditional RAG pipelines that rely on document
chunking (p < 0.01), and (2) meta knowledge-augmented queries additionally
significantly improve retrieval precision and recall, as well as the final
answers breadth, depth, relevancy, and specificity. Our methodology is
cost-effective, costing less than $20 per 2000 research papers using Claude 3
Haiku, and can be adapted with any fine-tuning of either the language or
embedding models to further enhance the performance of end-to-end RAG
pipelines.
Authors' comments: Accepted in Workshop on Generative AI for Recommender Systems and
Personalization, KDD 2024
Lin Zhao, Xiao Chen, Eric Z. Chen, Yikang Liu, Terrence Chen, Shanhui Sun
Medical image segmentation is crucial for clinical decision-making, but the scarcity of annotated data presents significant challenges. Few-shot segmentation (FSS) methods show promise but often require retraining on the target domain and struggle to generalize across different modalities. Similarly, adapting foundation models like the Segment Anything Model (SAM) for medical imaging has limitations, including the need for finetuning and domain-specific adaptation. To address these issues, we propose a novel method that adapts DINOv2 and Segment Anything Model 2 (SAM 2) for retrieval-augmented few-shot medical image segmentation. Our approach uses DINOv2's feature as query to retrieve similar samples from limited annotated data, which are then encoded as memories and stored in memory bank. With the memory attention mechanism of SAM 2, the model leverages these memories as conditions to generate accurate segmentation of the target image. We evaluated our framework on three medical image segmentation tasks, demonstrating superior performance and generalizability across various modalities without the need for any retraining or finetuning. Overall, this method offers a practical and effective solution for few-shot medical image segmentation and holds significant potential as a valuable annotation tool in clinical applications.
Tianyu Ding, Adi Banerjee, Laurent Mombaerts, Yunhong Li, Tarik Borogovac, Juan Pablo De la Cruz Weinstein
The increasing use of Retrieval-Augmented Generation (RAG) systems in various
applications necessitates stringent protocols to ensure RAG systems accuracy,
safety, and alignment with user intentions. In this paper, we introduce VERA
(Validation and Evaluation of Retrieval-Augmented Systems), a framework
designed to enhance the transparency and reliability of outputs from large
language models (LLMs) that utilize retrieved information. VERA improves the
way we evaluate RAG systems in two important ways: (1) it introduces a
cross-encoder based mechanism that encompasses a set of multidimensional
metrics into a single comprehensive ranking score, addressing the challenge of
prioritizing individual metrics, and (2) it employs Bootstrap statistics on
LLM-based metrics across the document repository to establish confidence
bounds, ensuring the repositorys topical coverage and improving the overall
reliability of retrieval systems. Through several use cases, we demonstrate how
VERA can strengthen decision-making processes and trust in AI applications. Our
findings not only contribute to the theoretical understanding of LLM-based RAG
evaluation metric but also promote the practical implementation of responsible
AI systems, marking a significant advancement in the development of reliable
and transparent generative AI technologies.
Authors' comments: Accepted in Workshop on Evaluation and Trustworthiness of Generative
AI Models, KDD 2024