Enzo Shiraishi, Raphael Y. de Camargo, Henrique L. P. Silva, Ronaldo C. Prati
When combined with In-Context Learning, a technique that enables models to
adapt to new tasks by incorporating task-specific examples or demonstrations
directly within the input prompt, autoregressive language models have achieved
good performance in a wide range of tasks and applications. However, this
combination has not been properly explored in the context of named entity
recognition, where the structure of this task poses unique challenges. We
propose RENER (Retrieval-Enhanced Named Entity Recognition), a technique for
named entity recognition using autoregressive language models based on
In-Context Learning and information retrieval techniques. When presented with
an input text, RENER fetches similar examples from a dataset of training
examples that are used to enhance a language model to recognize named entities
from this input text. RENER is modular and independent of the underlying
language model and information retrieval algorithms. Experimental results show
that in the CrossNER collection we achieve state-of-the-art performance with
the proposed technique and that information retrieval can increase the F-score
by up to 11 percentage points.
Authors' comments: 13 pages, 6 figures, 3 tables
Yan Cheng, Kui Ren, Nathan Soedjak
This work studies phase retrieval for wave fields, aiming to recover the phase of an incoming wave from multi-plane intensity measurements behind different types of linear and nonlinear media. We show that unique phase retrieval can be achieved by utilizing intensity data produced by multiple media. This uniqueness does not require prescribed boundary conditions for the phase in the incidence plane, in contrast to existing phase retrieval methods based on the transport of intensity equation. Moreover, the uniqueness proofs lead to explicit phase reconstruction algorithms. Numerical simulations are presented to validate the theory.
Loris Sauter, Ralph Gasser, Heiko Schuldt, Abraham Bernstein, Luca Rossetto
Performance evaluation in multimedia retrieval, as in the information retrieval domain at large, relies heavily on retrieval experiments, employing a broad range of techniques and metrics. These can involve human-in-the-loop and machine-only settings for the retrieval process itself and the subsequent verification of results. Such experiments can be elaborate and use-case-specific, which can make them difficult to compare or replicate. In this paper, we present a formal model to express all relevant aspects of such retrieval experiments, as well as a flexible open-source evaluation infrastructure that implements the model. These contributions intend to make a step towards lowering the hurdles for conducting retrieval experiments and improving their reproducibility.
Hokuto Munakata, Taichi Nishimura, Shota Nakada, Tatsuya Komatsu
In this paper, we propose and design a new task called audio moment retrieval (AMR). Unlike conventional language-based audio retrieval tasks that search for short audio clips from an audio database, AMR aims to predict relevant moments in untrimmed long audio based on a text query. Given the lack of prior work in AMR, we first build a dedicated dataset, Clotho-Moment, consisting of large-scale simulated audio recordings with moment annotations. We then propose a DETR-based model, named Audio Moment DETR (AM-DETR), as a fundamental framework for AMR tasks. This model captures temporal dependencies within audio features, inspired by similar video moment retrieval tasks, thus surpassing conventional clip-level audio retrieval methods. Additionally, we provide manually annotated datasets to properly measure the effectiveness and robustness of our methods on real data. Experimental results show that AM-DETR, trained with Clotho-Moment, outperforms a baseline model that applies a clip-level audio retrieval method with a sliding window on all metrics, particularly improving Recall1@0.7 by 9.00 points. Our datasets and code are publicly available in https://h-munakata.github.io/Language-based-Audio-Moment-Retrieval.
Jean-Jacques Godeme, Jalal Fadili
In this paper, we study the phase retrieval problem in the situation where the vector to be recovered has an a priori structure that can encoded into a regularization term. This regularizer is intended to promote solutions conforming to some notion of simplicity or low complexity. We investigate both noiseless recovery and stability to noise and provide a very general and unified analysis framework that goes far beyond the sparse phase retrieval mostly considered in the literature. In the noiseless case we provide sufficient conditions under which exact recovery, up to global sign change, is possible. For Gaussian measurement maps, we also provide a sample complexity bound for exact recovery. This bound depends on the Gaussian width of the descent cone at the soughtafter vector which is a geometric measure of the complexity of the latter. In the noisy case, we consider both the constrained (Mozorov) and penalized (Tikhonov) formulations. We provide sufficient conditions for stable recovery and prove linear convergence for sufficiently small noise. For Gaussian measurements, we again give a sample complexity bound for linear convergence to hold with high probability. This bound scales linearly in the intrinsic dimension of the sought-after vector but only logarithmically in the ambient dimension.
Krish Goel, Mahek Chandak
Large Language Models (LLMs) excel in natural language tasks but face limitations due to static training datasets, resulting in outdated or contextually shallow responses. Retrieval-Augmented Generation (RAG) addresses this by integrating real-time external knowledge, enhancing model accuracy and credibility, especially for knowledge-intensive tasks. However, RAG-enhanced LLMs struggle with long contexts, causing them to "choke" on information overload, compromising response quality. Recent RAG applications use hierarchical data structures for storing documents, organized at various levels of summarization and information density. In this context, we introduce HIRO (Hierarchical Information Retrieval Optimization), a novel querying approach for RAG applications using hierarchical structures for storing documents. HIRO employs DFS-based recursive similarity score calculation and branch pruning to minimize the context returned to the LLM without informational loss. HIRO outperforms existing querying mechanisms on the NarrativeQA dataset by an absolute performance gain of 10.85%.
Simon Akesson, Frances A. Santos
Providing external knowledge to Large Language Models (LLMs) is a key point for using these models in real-world applications for several reasons, such as incorporating up-to-date content in a real-time manner, providing access to domain-specific knowledge, and contributing to hallucination prevention. The vector database-based Retrieval Augmented Generation (RAG) approach has been widely adopted to this end. Thus, any part of external knowledge can be retrieved and provided to some LLM as the input context. Despite RAG approach's success, it still might be unfeasible for some applications, because the context retrieved can demand a longer context window than the size supported by LLM. Even when the context retrieved fits into the context window size, the number of tokens might be expressive and, consequently, impact costs and processing time, becoming impractical for most applications. To address these, we propose CRAG, a novel approach able to effectively reduce the number of prompting tokens without degrading the quality of the response generated compared to a solution using RAG. Through our experiments, we show that CRAG can reduce the number of tokens by at least 46\%, achieving more than 90\% in some cases, compared to RAG. Moreover, the number of tokens with CRAG does not increase considerably when the number of reviews analyzed is higher, unlike RAG, where the number of tokens is almost 9x higher when there are 75 reviews compared to 4 reviews.
Dian Jiao, Li Cai, Jingsheng Huang, Wenqiao Zhang, Siliang Tang, Yueting Zhuang
Retrieval-Augmented Generation (RAG) methods augment the input of Large
Language Models (LLMs) with relevant retrieved passages, reducing factual
errors in knowledge-intensive tasks. However, contemporary RAG approaches
suffer from irrelevant knowledge retrieval issues in complex domain questions
(e.g., HotPot QA) due to the lack of corresponding domain knowledge, leading to
low-quality generations. To address this issue, we propose a novel
Collaborative Retrieval-Augmented Generation framework, DuetRAG. Our
bootstrapping philosophy is to simultaneously integrate the domain fintuning
and RAG models to improve the knowledge retrieval quality, thereby enhancing
generation quality. Finally, we demonstrate DuetRAG' s matches with expert
human researchers on HotPot QA.
Authors' comments: 5 pages
Eugene Yang, Dawn Lawrie, James Mayfield
Recent work in cross-language information retrieval (CLIR), where queries and
documents are in different languages, has shown the benefit of the
Translate-Distill framework that trains a cross-language neural dual-encoder
model using translation and distillation. However, Translate-Distill only
supports a single document language. Multilingual information retrieval (MLIR),
which ranks a multilingual document collection, is harder to train than CLIR
because the model must assign comparable relevance scores to documents in
different languages. This work extends Translate-Distill and propose
Multilingual Translate-Distill (MTD) for MLIR. We show that ColBERT-X models
trained with MTD outperform their counterparts trained ith Multilingual
Translate-Train, which is the previous state-of-the-art training approach, by
5% to 25% in nDCG@20 and 15% to 45% in MAP. We also show that the model is
robust to the way languages are mixed in training batches. Our implementation
is available on GitHub.
Authors' comments: 6 pages, 1 figure, accepted at SIGIR 2024 as short paper
Wenhao Zhang, Mengqi Zhang, Shiguang Wu, Jiahuan Pei, Zhaochun Ren, Maarten de Rijke, Zhumin Chen, Pengjie Ren
Exclusion is an important and universal linguistic skill that humans use to express what they do not want. However, in information retrieval community, there is little research on exclusionary retrieval, where users express what they do not want in their queries. In this work, we investigate the scenario of exclusionary retrieval in document retrieval for the first time. We present ExcluIR, a set of resources for exclusionary retrieval, consisting of an evaluation benchmark and a training set for helping retrieval models to comprehend exclusionary queries. The evaluation benchmark includes 3,452 high-quality exclusionary queries, each of which has been manually annotated. The training set contains 70,293 exclusionary queries, each paired with a positive document and a negative document. We conduct detailed experiments and analyses, obtaining three main observations: (1) Existing retrieval models with different architectures struggle to effectively comprehend exclusionary queries; (2) Although integrating our training data can improve the performance of retrieval models on exclusionary retrieval, there still exists a gap compared to human performance; (3) Generative retrieval models have a natural advantage in handling exclusionary queries. To facilitate future research on exclusionary retrieval, we share the benchmark and evaluation scripts on \url{https://github.com/zwh-sdu/ExcluIR}.
Thibault Formal, Stéphane Clinchant, Hervé Déjean, Carlos Lassance
The late interaction paradigm introduced with ColBERT stands out in the
neural Information Retrieval space, offering a compelling
effectiveness-efficiency trade-off across many benchmarks. Efficient late
interaction retrieval is based on an optimized multi-step strategy, where an
approximate search first identifies a set of candidate documents to re-rank
exactly. In this work, we introduce SPLATE, a simple and lightweight adaptation
of the ColBERTv2 model which learns an ``MLM adapter'', mapping its frozen
token embeddings to a sparse vocabulary space with a partially learned SPLADE
module. This allows us to perform the candidate generation step in late
interaction pipelines with traditional sparse retrieval techniques, making it
particularly appealing for running ColBERT in CPU environments. Our SPLATE
ColBERTv2 pipeline achieves the same effectiveness as the PLAID ColBERTv2
engine by re-ranking 50 documents that can be retrieved under 10ms.
Authors' comments: To appear at SIGIR'24 (short paper track)
Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang
With recent advances in speech synthesis including text-to-speech (TTS) and
voice conversion (VC) systems enabling the generation of ultra-realistic audio
deepfakes, there is growing concern about their potential misuse. However, most
deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a
single model, resulting in performance bottlenecks and transparency issues.
Inspired by retrieval-augmented generation (RAG), we propose a
retrieval-augmented detection (RAD) framework that augments test samples with
similar retrieved samples for enhanced detection. We also extend the
multi-fusion attentive classifier to integrate it with our proposed RAD
framework. Extensive experiments show the superior performance of the proposed
RAD framework over baseline methods, achieving state-of-the-art results on the
ASVspoof 2021 DF set and competitive results on the 2019 and 2021 LA sets.
Further sample analysis indicates that the retriever consistently retrieves
samples mostly from the same speaker with acoustic characteristics highly
consistent with the query audio, thereby improving detection performance.
Authors' comments: Accepted by the 2024 International Conference on Multimedia Retrieval
(ICMR 2024)
Dahlia Shehata
Despite the advantages of their low-resource settings, traditional sparse retrievers depend on exact matching approaches between high-dimensional bag-of-words (BoW) representations of both the queries and the collection. As a result, retrieval performance is restricted by semantic discrepancies and vocabulary gaps. On the other hand, transformer-based dense retrievers introduce significant improvements in information retrieval tasks by exploiting low-dimensional contextualized representations of the corpus. While dense retrievers are known for their relative effectiveness, they suffer from lower efficiency and lack of generalization issues, when compared to sparse retrievers. For a lightweight retrieval task, high computational resources and time consumption are major barriers encouraging the renunciation of dense models despite potential gains. In this work, I propose boosting the performance of sparse retrievers by expanding both the queries and the documents with linked entities in two formats for the entity names: 1) explicit and 2) hashed. A zero-shot end-to-end dense entity linking system is employed for entity recognition and disambiguation to augment the corpus. By leveraging the advanced entity linking methods, I believe that the effectiveness gap between sparse and dense retrievers can be narrowed. Experiments are conducted on the MS MARCO passage dataset using the original qrel set, the re-ranked qrels favoured by MonoT5 and the latter set further re-ranked by DuoT5. Since I am concerned with the early stage retrieval in cascaded ranking architectures of large information retrieval systems, the results are evaluated using recall@1000. The suggested approach is also capable of retrieving documents for query subsets judged to be particularly difficult in prior work.
Yan Fang, Jingtao Zhan, Qingyao Ai, Jiaxin Mao, Weihang Su, Jia Chen, Yiqun Liu
Scaling up neural models has yielded significant advancements in a wide array
of tasks, particularly in language generation. Previous studies have found that
the performance of neural models frequently adheres to predictable scaling
laws, correlated with factors such as training set size and model size. This
insight is invaluable, especially as large-scale experiments grow increasingly
resource-intensive. Yet, such scaling law has not been fully explored in dense
retrieval due to the discrete nature of retrieval metrics and complex
relationships between training data and model sizes in retrieval tasks. In this
study, we investigate whether the performance of dense retrieval models follows
the scaling law as other neural models. We propose to use contrastive
log-likelihood as the evaluation metric and conduct extensive experiments with
dense retrieval models implemented with different numbers of parameters and
trained with different amounts of annotated data. Results indicate that, under
our settings, the performance of dense retrieval models follows a precise
power-law scaling related to the model size and the number of annotations.
Additionally, we examine scaling with prevalent data augmentation methods to
assess the impact of annotation quality, and apply the scaling law to find the
best resource allocation strategy under a budget constraint. We believe that
these insights will significantly contribute to understanding the scaling
effect of dense retrieval models and offer meaningful guidance for future
research endeavors.
Authors' comments: Accepted at SIGIR 2024. V2 fixes a bug in the experiments
Huayang Li, Deng Cai, Zhi Qu, Qu Cui, Hidetaka Kamigaito, Lemao Liu, Taro Watanabe
Phrase-level dense retrieval has shown many appealing characteristics in
downstream NLP tasks by leveraging the fine-grained information that phrases
offer. In our work, we propose a new task formulation of dense retrieval,
cross-lingual contextualized phrase retrieval, which aims to augment
cross-lingual applications by addressing polysemy using context information.
However, the lack of specific training data and models are the primary
challenges to achieve our goal. As a result, we extract pairs of cross-lingual
phrases using word alignment information automatically induced from parallel
sentences. Subsequently, we train our Cross-lingual Contextualized Phrase
Retriever (CCPR) using contrastive learning, which encourages the hidden
representations of phrases with similar contexts and semantics to align
closely. Comprehensive experiments on both the cross-lingual phrase retrieval
task and a downstream task, i.e, machine translation, demonstrate the
effectiveness of CCPR. On the phrase retrieval task, CCPR surpasses baselines
by a significant margin, achieving a top-1 accuracy that is at least 13 points
higher. When utilizing CCPR to augment the large-language-model-based
translator, it achieves average gains of 0.7 and 1.5 in BERTScore for
translations from X=>En and vice versa, respectively, on WMT16 dataset. Our
code and data are available at \url{https://github.com/ghrua/ccpr_release}.
Authors' comments: Accepted to Findings of EMNLP 2024
Bin Zhu, Kevin Flanagan, Adriano Fragomeni, Michael Wray, Dima Damen
Though pre-training vision-language models have demonstrated significant benefits in boosting video-text retrieval performance from large-scale web videos, fine-tuning still plays a critical role with manually annotated clips with start and end times, which requires considerable human effort. To address this issue, we explore an alternative cheaper source of annotations, single timestamps, for video-text retrieval. We initialise clips from timestamps in a heuristic way to warm up a retrieval model. Then a video clip editing method is proposed to refine the initial rough boundaries to improve retrieval performance. A student-teacher network is introduced for video clip editing. The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips. The teacher weights are updated from the student's after the student's performance increases. Our method is model agnostic and applicable to any retrieval models. We conduct experiments based on three state-of-the-art retrieval models, COOT, VideoCLIP and CLIP4Clip. Experiments conducted on three video retrieval datasets, YouCook2, DiDeMo and ActivityNet-Captions show that our edited clips consistently improve retrieval performance over initial clips across all the three retrieval models.
Fengran Mo, Chen Qu, Kelong Mao, Tianyu Zhu, Zhan Su, Kaiyu Huang, Jian-Yun Nie
Conversational search facilitates complex information retrieval by enabling multi-turn interactions between users and the system. Supporting such interactions requires a comprehensive understanding of the conversational inputs to formulate a good search query based on historical information. In particular, the search query should include the relevant information from the previous conversation turns. However, current approaches for conversational dense retrieval primarily rely on fine-tuning a pre-trained ad-hoc retriever using the whole conversational search session, which can be lengthy and noisy. Moreover, existing approaches are limited by the amount of manual supervision signals in the existing datasets. To address the aforementioned issues, we propose a History-Aware Conversational Dense Retrieval (HAConvDR) system, which incorporates two ideas: context-denoised query reformulation and automatic mining of supervision signals based on the actual impact of historical turns. Experiments on two public conversational search datasets demonstrate the improved history modeling capability of HAConvDR, in particular for long conversations with topic shifts.
Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references.
Weixuan Wang, Barry Haddow, Alexandra Birch
Knowledge represented in Large Language Models (LLMs) is quite often incorrect and can also become obsolete over time. Updating knowledge via fine-tuning is computationally resource-hungry and not reliable, and so knowledge editing (KE) has developed as an effective and economical alternative to inject new knowledge or to fix factual errors in LLMs. Although there has been considerable interest in this area, current KE research exclusively focuses on the monolingual setting, typically in English. However, what happens if the new knowledge is supplied in one language, but we would like to query the LLM in a different language? To address the problem of multilingual knowledge editing, we propose Retrieval-augmented Multilingual Knowledge Editor (ReMaKE) to update new knowledge in LLMs. ReMaKE can perform model-agnostic knowledge editing in multilingual settings. ReMaKE concatenates the new knowledge retrieved from a multilingual knowledge base with prompts. Our experimental results show that ReMaKE outperforms baseline knowledge editing methods by a significant margin and is the first KE method to work in a multilingual setting. We provide our multilingual knowledge editing dataset (MzsRE) in 12 languages, which along with code, and additional project information is available at https://github.com/Vicky-Wil/ReMaKE.
Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, Di He
We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm
designed to speed up language model generation. The key insight driving the
development of REST is the observation that the process of text generation
often includes certain common phases and patterns. Unlike previous methods that
rely on a draft language model for speculative decoding, REST harnesses the
power of retrieval to generate draft tokens. This method draws from the
reservoir of existing knowledge, retrieving and employing relevant tokens based
on the current context. Its plug-and-play nature allows for seamless
integration and acceleration of any language models, all without necessitating
additional training. When benchmarked on 7B and 13B language models in a
single-batch setting, REST achieves a significant speedup of 1.62X to 2.36X on
code or text generation. The code of REST is available at
https://github.com/FasterDecoding/REST.
Authors' comments: NAACL 2024, camera ready