Hokuto Munakata, Taichi Nishimura, Shota Nakada, Tatsuya Komatsu
In this paper, we propose and design a new task called audio moment retrieval (AMR). Unlike conventional language-based audio retrieval tasks that search for short audio clips from an audio database, AMR aims to predict relevant moments in untrimmed long audio based on a text query. Given the lack of prior work in AMR, we first build a dedicated dataset, Clotho-Moment, consisting of large-scale simulated audio recordings with moment annotations. We then propose a DETR-based model, named Audio Moment DETR (AM-DETR), as a fundamental framework for AMR tasks. This model captures temporal dependencies within audio features, inspired by similar video moment retrieval tasks, thus surpassing conventional clip-level audio retrieval methods. Additionally, we provide manually annotated datasets to properly measure the effectiveness and robustness of our methods on real data. Experimental results show that AM-DETR, trained with Clotho-Moment, outperforms a baseline model that applies a clip-level audio retrieval method with a sliding window on all metrics, particularly improving Recall1@0.7 by 9.00 points. Our datasets and code are publicly available in https://h-munakata.github.io/Language-based-Audio-Moment-Retrieval.
Jean-Jacques Godeme, Jalal Fadili
In this paper, we study the phase retrieval problem in the situation where the vector to be recovered has an a priori structure that can encoded into a regularization term. This regularizer is intended to promote solutions conforming to some notion of simplicity or low complexity. We investigate both noiseless recovery and stability to noise and provide a very general and unified analysis framework that goes far beyond the sparse phase retrieval mostly considered in the literature. In the noiseless case we provide sufficient conditions under which exact recovery, up to global sign change, is possible. For Gaussian measurement maps, we also provide a sample complexity bound for exact recovery. This bound depends on the Gaussian width of the descent cone at the soughtafter vector which is a geometric measure of the complexity of the latter. In the noisy case, we consider both the constrained (Mozorov) and penalized (Tikhonov) formulations. We provide sufficient conditions for stable recovery and prove linear convergence for sufficiently small noise. For Gaussian measurements, we again give a sample complexity bound for linear convergence to hold with high probability. This bound scales linearly in the intrinsic dimension of the sought-after vector but only logarithmically in the ambient dimension.
Krish Goel, Mahek Chandak
Large Language Models (LLMs) excel in natural language tasks but face limitations due to static training datasets, resulting in outdated or contextually shallow responses. Retrieval-Augmented Generation (RAG) addresses this by integrating real-time external knowledge, enhancing model accuracy and credibility, especially for knowledge-intensive tasks. However, RAG-enhanced LLMs struggle with long contexts, causing them to "choke" on information overload, compromising response quality. Recent RAG applications use hierarchical data structures for storing documents, organized at various levels of summarization and information density. In this context, we introduce HIRO (Hierarchical Information Retrieval Optimization), a novel querying approach for RAG applications using hierarchical structures for storing documents. HIRO employs DFS-based recursive similarity score calculation and branch pruning to minimize the context returned to the LLM without informational loss. HIRO outperforms existing querying mechanisms on the NarrativeQA dataset by an absolute performance gain of 10.85%.
Simon Akesson, Frances A. Santos
Providing external knowledge to Large Language Models (LLMs) is a key point for using these models in real-world applications for several reasons, such as incorporating up-to-date content in a real-time manner, providing access to domain-specific knowledge, and contributing to hallucination prevention. The vector database-based Retrieval Augmented Generation (RAG) approach has been widely adopted to this end. Thus, any part of external knowledge can be retrieved and provided to some LLM as the input context. Despite RAG approach's success, it still might be unfeasible for some applications, because the context retrieved can demand a longer context window than the size supported by LLM. Even when the context retrieved fits into the context window size, the number of tokens might be expressive and, consequently, impact costs and processing time, becoming impractical for most applications. To address these, we propose CRAG, a novel approach able to effectively reduce the number of prompting tokens without degrading the quality of the response generated compared to a solution using RAG. Through our experiments, we show that CRAG can reduce the number of tokens by at least 46\%, achieving more than 90\% in some cases, compared to RAG. Moreover, the number of tokens with CRAG does not increase considerably when the number of reviews analyzed is higher, unlike RAG, where the number of tokens is almost 9x higher when there are 75 reviews compared to 4 reviews.
Dian Jiao, Li Cai, Jingsheng Huang, Wenqiao Zhang, Siliang Tang, Yueting Zhuang
Retrieval-Augmented Generation (RAG) methods augment the input of Large
Language Models (LLMs) with relevant retrieved passages, reducing factual
errors in knowledge-intensive tasks. However, contemporary RAG approaches
suffer from irrelevant knowledge retrieval issues in complex domain questions
(e.g., HotPot QA) due to the lack of corresponding domain knowledge, leading to
low-quality generations. To address this issue, we propose a novel
Collaborative Retrieval-Augmented Generation framework, DuetRAG. Our
bootstrapping philosophy is to simultaneously integrate the domain fintuning
and RAG models to improve the knowledge retrieval quality, thereby enhancing
generation quality. Finally, we demonstrate DuetRAG' s matches with expert
human researchers on HotPot QA.
Authors' comments: 5 pages
Eugene Yang, Dawn Lawrie, James Mayfield
Recent work in cross-language information retrieval (CLIR), where queries and
documents are in different languages, has shown the benefit of the
Translate-Distill framework that trains a cross-language neural dual-encoder
model using translation and distillation. However, Translate-Distill only
supports a single document language. Multilingual information retrieval (MLIR),
which ranks a multilingual document collection, is harder to train than CLIR
because the model must assign comparable relevance scores to documents in
different languages. This work extends Translate-Distill and propose
Multilingual Translate-Distill (MTD) for MLIR. We show that ColBERT-X models
trained with MTD outperform their counterparts trained ith Multilingual
Translate-Train, which is the previous state-of-the-art training approach, by
5% to 25% in nDCG@20 and 15% to 45% in MAP. We also show that the model is
robust to the way languages are mixed in training batches. Our implementation
is available on GitHub.
Authors' comments: 6 pages, 1 figure, accepted at SIGIR 2024 as short paper
Wenhao Zhang, Mengqi Zhang, Shiguang Wu, Jiahuan Pei, Zhaochun Ren, Maarten de Rijke, Zhumin Chen, Pengjie Ren
Exclusion is an important and universal linguistic skill that humans use to express what they do not want. However, in information retrieval community, there is little research on exclusionary retrieval, where users express what they do not want in their queries. In this work, we investigate the scenario of exclusionary retrieval in document retrieval for the first time. We present ExcluIR, a set of resources for exclusionary retrieval, consisting of an evaluation benchmark and a training set for helping retrieval models to comprehend exclusionary queries. The evaluation benchmark includes 3,452 high-quality exclusionary queries, each of which has been manually annotated. The training set contains 70,293 exclusionary queries, each paired with a positive document and a negative document. We conduct detailed experiments and analyses, obtaining three main observations: (1) Existing retrieval models with different architectures struggle to effectively comprehend exclusionary queries; (2) Although integrating our training data can improve the performance of retrieval models on exclusionary retrieval, there still exists a gap compared to human performance; (3) Generative retrieval models have a natural advantage in handling exclusionary queries. To facilitate future research on exclusionary retrieval, we share the benchmark and evaluation scripts on \url{https://github.com/zwh-sdu/ExcluIR}.
Thibault Formal, Stéphane Clinchant, Hervé Déjean, Carlos Lassance
The late interaction paradigm introduced with ColBERT stands out in the
neural Information Retrieval space, offering a compelling
effectiveness-efficiency trade-off across many benchmarks. Efficient late
interaction retrieval is based on an optimized multi-step strategy, where an
approximate search first identifies a set of candidate documents to re-rank
exactly. In this work, we introduce SPLATE, a simple and lightweight adaptation
of the ColBERTv2 model which learns an ``MLM adapter'', mapping its frozen
token embeddings to a sparse vocabulary space with a partially learned SPLADE
module. This allows us to perform the candidate generation step in late
interaction pipelines with traditional sparse retrieval techniques, making it
particularly appealing for running ColBERT in CPU environments. Our SPLATE
ColBERTv2 pipeline achieves the same effectiveness as the PLAID ColBERTv2
engine by re-ranking 50 documents that can be retrieved under 10ms.
Authors' comments: To appear at SIGIR'24 (short paper track)
Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang
With recent advances in speech synthesis including text-to-speech (TTS) and
voice conversion (VC) systems enabling the generation of ultra-realistic audio
deepfakes, there is growing concern about their potential misuse. However, most
deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a
single model, resulting in performance bottlenecks and transparency issues.
Inspired by retrieval-augmented generation (RAG), we propose a
retrieval-augmented detection (RAD) framework that augments test samples with
similar retrieved samples for enhanced detection. We also extend the
multi-fusion attentive classifier to integrate it with our proposed RAD
framework. Extensive experiments show the superior performance of the proposed
RAD framework over baseline methods, achieving state-of-the-art results on the
ASVspoof 2021 DF set and competitive results on the 2019 and 2021 LA sets.
Further sample analysis indicates that the retriever consistently retrieves
samples mostly from the same speaker with acoustic characteristics highly
consistent with the query audio, thereby improving detection performance.
Authors' comments: Accepted by the 2024 International Conference on Multimedia Retrieval
(ICMR 2024)
Dahlia Shehata
Despite the advantages of their low-resource settings, traditional sparse retrievers depend on exact matching approaches between high-dimensional bag-of-words (BoW) representations of both the queries and the collection. As a result, retrieval performance is restricted by semantic discrepancies and vocabulary gaps. On the other hand, transformer-based dense retrievers introduce significant improvements in information retrieval tasks by exploiting low-dimensional contextualized representations of the corpus. While dense retrievers are known for their relative effectiveness, they suffer from lower efficiency and lack of generalization issues, when compared to sparse retrievers. For a lightweight retrieval task, high computational resources and time consumption are major barriers encouraging the renunciation of dense models despite potential gains. In this work, I propose boosting the performance of sparse retrievers by expanding both the queries and the documents with linked entities in two formats for the entity names: 1) explicit and 2) hashed. A zero-shot end-to-end dense entity linking system is employed for entity recognition and disambiguation to augment the corpus. By leveraging the advanced entity linking methods, I believe that the effectiveness gap between sparse and dense retrievers can be narrowed. Experiments are conducted on the MS MARCO passage dataset using the original qrel set, the re-ranked qrels favoured by MonoT5 and the latter set further re-ranked by DuoT5. Since I am concerned with the early stage retrieval in cascaded ranking architectures of large information retrieval systems, the results are evaluated using recall@1000. The suggested approach is also capable of retrieving documents for query subsets judged to be particularly difficult in prior work.
Yan Fang, Jingtao Zhan, Qingyao Ai, Jiaxin Mao, Weihang Su, Jia Chen, Yiqun Liu
Scaling up neural models has yielded significant advancements in a wide array
of tasks, particularly in language generation. Previous studies have found that
the performance of neural models frequently adheres to predictable scaling
laws, correlated with factors such as training set size and model size. This
insight is invaluable, especially as large-scale experiments grow increasingly
resource-intensive. Yet, such scaling law has not been fully explored in dense
retrieval due to the discrete nature of retrieval metrics and complex
relationships between training data and model sizes in retrieval tasks. In this
study, we investigate whether the performance of dense retrieval models follows
the scaling law as other neural models. We propose to use contrastive
log-likelihood as the evaluation metric and conduct extensive experiments with
dense retrieval models implemented with different numbers of parameters and
trained with different amounts of annotated data. Results indicate that, under
our settings, the performance of dense retrieval models follows a precise
power-law scaling related to the model size and the number of annotations.
Additionally, we examine scaling with prevalent data augmentation methods to
assess the impact of annotation quality, and apply the scaling law to find the
best resource allocation strategy under a budget constraint. We believe that
these insights will significantly contribute to understanding the scaling
effect of dense retrieval models and offer meaningful guidance for future
research endeavors.
Authors' comments: Accepted at SIGIR 2024. V2 fixes a bug in the experiments
Huayang Li, Deng Cai, Zhi Qu, Qu Cui, Hidetaka Kamigaito, Lemao Liu, Taro Watanabe
Phrase-level dense retrieval has shown many appealing characteristics in
downstream NLP tasks by leveraging the fine-grained information that phrases
offer. In our work, we propose a new task formulation of dense retrieval,
cross-lingual contextualized phrase retrieval, which aims to augment
cross-lingual applications by addressing polysemy using context information.
However, the lack of specific training data and models are the primary
challenges to achieve our goal. As a result, we extract pairs of cross-lingual
phrases using word alignment information automatically induced from parallel
sentences. Subsequently, we train our Cross-lingual Contextualized Phrase
Retriever (CCPR) using contrastive learning, which encourages the hidden
representations of phrases with similar contexts and semantics to align
closely. Comprehensive experiments on both the cross-lingual phrase retrieval
task and a downstream task, i.e, machine translation, demonstrate the
effectiveness of CCPR. On the phrase retrieval task, CCPR surpasses baselines
by a significant margin, achieving a top-1 accuracy that is at least 13 points
higher. When utilizing CCPR to augment the large-language-model-based
translator, it achieves average gains of 0.7 and 1.5 in BERTScore for
translations from X=>En and vice versa, respectively, on WMT16 dataset. Our
code and data are available at \url{https://github.com/ghrua/ccpr_release}.
Authors' comments: Accepted to Findings of EMNLP 2024
Bin Zhu, Kevin Flanagan, Adriano Fragomeni, Michael Wray, Dima Damen
Though pre-training vision-language models have demonstrated significant benefits in boosting video-text retrieval performance from large-scale web videos, fine-tuning still plays a critical role with manually annotated clips with start and end times, which requires considerable human effort. To address this issue, we explore an alternative cheaper source of annotations, single timestamps, for video-text retrieval. We initialise clips from timestamps in a heuristic way to warm up a retrieval model. Then a video clip editing method is proposed to refine the initial rough boundaries to improve retrieval performance. A student-teacher network is introduced for video clip editing. The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips. The teacher weights are updated from the student's after the student's performance increases. Our method is model agnostic and applicable to any retrieval models. We conduct experiments based on three state-of-the-art retrieval models, COOT, VideoCLIP and CLIP4Clip. Experiments conducted on three video retrieval datasets, YouCook2, DiDeMo and ActivityNet-Captions show that our edited clips consistently improve retrieval performance over initial clips across all the three retrieval models.
Fengran Mo, Chen Qu, Kelong Mao, Tianyu Zhu, Zhan Su, Kaiyu Huang, Jian-Yun Nie
Conversational search facilitates complex information retrieval by enabling multi-turn interactions between users and the system. Supporting such interactions requires a comprehensive understanding of the conversational inputs to formulate a good search query based on historical information. In particular, the search query should include the relevant information from the previous conversation turns. However, current approaches for conversational dense retrieval primarily rely on fine-tuning a pre-trained ad-hoc retriever using the whole conversational search session, which can be lengthy and noisy. Moreover, existing approaches are limited by the amount of manual supervision signals in the existing datasets. To address the aforementioned issues, we propose a History-Aware Conversational Dense Retrieval (HAConvDR) system, which incorporates two ideas: context-denoised query reformulation and automatic mining of supervision signals based on the actual impact of historical turns. Experiments on two public conversational search datasets demonstrate the improved history modeling capability of HAConvDR, in particular for long conversations with topic shifts.
Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references.
Weixuan Wang, Barry Haddow, Alexandra Birch
Knowledge represented in Large Language Models (LLMs) is quite often incorrect and can also become obsolete over time. Updating knowledge via fine-tuning is computationally resource-hungry and not reliable, and so knowledge editing (KE) has developed as an effective and economical alternative to inject new knowledge or to fix factual errors in LLMs. Although there has been considerable interest in this area, current KE research exclusively focuses on the monolingual setting, typically in English. However, what happens if the new knowledge is supplied in one language, but we would like to query the LLM in a different language? To address the problem of multilingual knowledge editing, we propose Retrieval-augmented Multilingual Knowledge Editor (ReMaKE) to update new knowledge in LLMs. ReMaKE can perform model-agnostic knowledge editing in multilingual settings. ReMaKE concatenates the new knowledge retrieved from a multilingual knowledge base with prompts. Our experimental results show that ReMaKE outperforms baseline knowledge editing methods by a significant margin and is the first KE method to work in a multilingual setting. We provide our multilingual knowledge editing dataset (MzsRE) in 12 languages, which along with code, and additional project information is available at https://github.com/Vicky-Wil/ReMaKE.
Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, Di He
We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm
designed to speed up language model generation. The key insight driving the
development of REST is the observation that the process of text generation
often includes certain common phases and patterns. Unlike previous methods that
rely on a draft language model for speculative decoding, REST harnesses the
power of retrieval to generate draft tokens. This method draws from the
reservoir of existing knowledge, retrieving and employing relevant tokens based
on the current context. Its plug-and-play nature allows for seamless
integration and acceleration of any language models, all without necessitating
additional training. When benchmarked on 7B and 13B language models in a
single-batch setting, REST achieves a significant speedup of 1.62X to 2.36X on
code or text generation. The code of REST is available at
https://github.com/FasterDecoding/REST.
Authors' comments: NAACL 2024, camera ready
Anirudh Khatry, Yasharth Bajpai, Priyanshu Gupta, Sumit Gulwani, Ashish Tiwari
Information retrieval involves selecting artifacts from a corpus that are
most relevant to a given search query. The flavor of retrieval typically used
in classical applications can be termed as homogeneous and relaxed, where
queries and corpus elements are both natural language (NL) utterances
(homogeneous) and the goal is to pick most relevant elements from the corpus in
the Top-K, where K is large, such as 10, 25, 50 or even 100 (relaxed).
Recently, retrieval is being used extensively in preparing prompts for large
language models (LLMs) to enable LLMs to perform targeted tasks. These new
applications of retrieval are often heterogeneous and strict -- the queries and
the corpus contain different kinds of entities, such as NL and code, and there
is a need for improving retrieval at Top-K for small values of K, such as K=1
or 3 or 5. Current dense retrieval techniques based on pretrained embeddings
provide a general-purpose and powerful approach for retrieval, but they are
oblivious to task-specific notions of similarity of heterogeneous artifacts. We
introduce Adapted Dense Retrieval, a mechanism to transform embeddings to
enable improved task-specific, heterogeneous and strict retrieval. Adapted
Dense Retrieval works by learning a low-rank residual adaptation of the
pretrained black-box embedding. We empirically validate our approach by showing
improvements over the state-of-the-art general-purpose embeddings-based
baseline.
Authors' comments: 14 pages
Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha
We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and
effective audio captioning system that generates captions conditioned on an
input audio and other captions similar to the audio retrieved from a datastore.
Additionally, our proposed method can transfer to any domain without the need
for any additional fine-tuning. To generate a caption for an audio sample, we
leverage an audio-text model CLAP to retrieve captions similar to it from a
replaceable datastore, which are then used to construct a prompt. Next, we feed
this prompt to a GPT-2 decoder and introduce cross-attention layers between the
CLAP encoder and GPT-2 to condition the audio for caption generation.
Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP
achieves competitive performance in in-domain settings and significant
improvements in out-of-domain settings. Additionally, due to its capability to
exploit a large text-captions-only datastore in a \textit{training-free}
fashion, RECAP shows unique capabilities of captioning novel audio events never
seen during training and compositional audios with multiple events. To promote
research in this space, we also release 150,000+ new weakly labeled captions
for AudioSet, AudioCaps, and Clotho.
Authors' comments: Code and data soon here: https://github.com/Sreyan88/RECAP
Haokun Wen, Xian Zhang, Xuemeng Song, Yinwei Wei, Liqiang Nie
Composed image retrieval (CIR) is a new and flexible image retrieval paradigm, which can retrieve the target image for a multimodal query, including a reference image and its corresponding modification text. Although existing efforts have achieved compelling success, they overlook the conflict relationship modeling between the reference image and the modification text for improving the multimodal query composition and the adaptive matching degree modeling for promoting the ranking of the candidate images that could present different levels of matching degrees with the given query. To address these two limitations, in this work, we propose a Target-Guided Composed Image Retrieval network (TG-CIR). In particular, TG-CIR first extracts the unified global and local attribute features for the reference/target image and the modification text with the contrastive language-image pre-training model (CLIP) as the backbone, where an orthogonal regularization is introduced to promote the independence among the attribute features. Then TG-CIR designs a target-query relationship-guided multimodal query composition module, comprising a target-free student composition branch and a target-based teacher composition branch, where the target-query relationship is injected into the teacher branch for guiding the conflict relationship modeling of the student branch. Last, apart from the conventional batch-based classification loss, TG-CIR additionally introduces a batch-based target similarity-guided matching degree regularization to promote the metric learning process. Extensive experiments on three benchmark datasets demonstrate the superiority of our proposed method.