Zhiqi Huang, Hansi Zeng, Hamed Zamani, James Allan
In this work, we explore a Multilingual Information Retrieval (MLIR) task, where the collection includes documents in multiple languages. We demonstrate that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance. This is due to the heterogeneous and imbalanced nature of multilingual collections -- some languages are better represented in the collection and some benefit from large-scale training data. To address this issue, we present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates" the representation of documents in different languages into the same embedding space. To address the challenges of data scarcity and imbalance, we introduce a knowledge distillation strategy. The teacher model is trained on rich English retrieval data, and by leveraging bi-text data, our distillation framework transfers its retrieval knowledge to the multilingual document encoder. Therefore, our approach does not require any multilingual retrieval training data. Extensive experiments on three MLIR datasets with a total of 15 languages demonstrate that KD-SPD significantly outperforms competitive baselines in all cases. We conduct extensive analyses to show that our method has less language bias and better zero-shot transfer ability towards new languages.
Matúš Pikuliak, Ivan Srba, Robert Moro, Timo Hromadka, Timotej Smolen, Martin Melisek, Ivan Vykopal, Jakub Simko et al.
Fact-checkers are often hampered by the sheer amount of online content that
needs to be fact-checked. NLP can help them by retrieving already existing
fact-checks relevant to the content being investigated. This paper introduces a
new multilingual dataset -- MultiClaim -- for previously fact-checked claim
retrieval. We collected 28k posts in 27 languages from social media, 206k
fact-checks in 39 languages written by professional fact-checkers, as well as
31k connections between these two groups. This is the most extensive and the
most linguistically diverse dataset of this kind to date. We evaluated how
different unsupervised methods fare on this dataset and its various dimensions.
We show that evaluating such a diverse dataset has its complexities and proper
care needs to be taken before interpreting the results. We also evaluated a
supervised fine-tuning approach, improving upon the unsupervised method
significantly.
Authors' comments: Accepted at EMNLP 2023
Orion Weller, Dawn Lawrie, Benjamin Van Durme
Negation is a common everyday phenomena and has been a consistent area of
weakness for language models (LMs). Although the Information Retrieval (IR)
community has adopted LMs as the backbone of modern IR architectures, there has
been little to no research in understanding how negation impacts neural IR. We
therefore construct a straightforward benchmark on this theme: asking IR models
to rank two documents that differ only by negation. We show that the results
vary widely according to the type of IR architecture: cross-encoders perform
best, followed by late-interaction models, and in last place are bi-encoder and
sparse neural architectures. We find that most information retrieval models
(including SOTA ones) do not consider negation, performing the same or worse
than a random ranking. We show that although the obvious approach of continued
fine-tuning on a dataset of contrastive documents containing negations
increases performance (as does model size), there is still a large gap between
machine and human performance.
Authors' comments: Accepted to EACL 2024
Ehsan Kamalloo, Xinyu Zhang, Odunayo Ogundepo, Nandan Thakur, David Alfonso-Hermelo, Mehdi Rezagholizadeh, Jimmy Lin
The ever-increasing size of language models curtails their widespread
availability to the community, thereby galvanizing many companies into offering
access to large language models through APIs. One particular type, suitable for
dense retrieval, is a semantic embedding service that builds vector
representations of input text. With a growing number of publicly available
APIs, our goal in this paper is to analyze existing offerings in realistic
retrieval scenarios, to assist practitioners and researchers in finding
suitable services according to their needs. Specifically, we investigate the
capabilities of existing semantic embedding APIs on domain generalization and
multilingual retrieval. For this purpose, we evaluate these services on two
standard benchmarks, BEIR and MIRACL. We find that re-ranking BM25 results
using the APIs is a budget-friendly approach and is most effective in English,
in contrast to the standard practice of employing them as first-stage
retrievers. For non-English retrieval, re-ranking still improves the results,
but a hybrid model with BM25 works best, albeit at a higher cost. We hope our
work lays the groundwork for evaluating semantic embedding APIs that are
critical in search and more broadly, for information access.
Authors' comments: ACL 2023 Industry Track
Yiqing Xie, Xiao Liu, Chenyan Xiong
In this work, we present an unsupervised retrieval method with contrastive
learning on web anchors. The anchor text describes the content that is
referenced from the linked page. This shows similarities to search queries that
aim to retrieve pertinent information from relevant documents. Based on their
commonalities, we train an unsupervised dense retriever, Anchor-DR, with a
contrastive learning task that matches the anchor text and the linked document.
To filter out uninformative anchors (such as ``homepage'' or other functional
anchors), we present a novel filtering technique to only select anchors that
contain similar types of information as search queries. Experiments show that
Anchor-DR outperforms state-of-the-art methods on unsupervised dense retrieval
by a large margin (e.g., by 5.3% NDCG@10 on MSMARCO). The gain of our method is
especially significant for search and question answering tasks. Our analysis
further reveals that the pattern of anchor-document pairs is similar to that of
search query-document pairs. Code available at
https://github.com/Veronicium/AnchorDR.
Authors' comments: SIGIR'23 Short
Marco Peer, Florian Kleber, Robert Sablatnig
This paper presents an unsupervised approach for writer retrieval based on clustering SIFT descriptors detected at keypoint locations resulting in pseudo-cluster labels. With those cluster labels, a residual network followed by our proposed NetRVLAD, an encoding layer with reduced complexity compared to NetVLAD, is trained on 32x32 patches at keypoint locations. Additionally, we suggest a graph-based reranking algorithm called SGR to exploit similarities of the page embeddings to boost the retrieval performance. Our approach is evaluated on two historical datasets (Historical-WI and HisIR19). We include an evaluation of different backbones and NetRVLAD. It competes with related work on historical datasets without using explicit encodings. We set a new State-of-the-art on both datasets by applying our reranking scheme and show that our approach achieves comparable performance on a modern dataset as well.
Xiaonan Li, Kai Lv, Hang Yan, Tianyang Lin, Wei Zhu, Yuan Ni, Guotong Xie, Xiaoling Wang et al.
In-context learning is a new learning paradigm where a language model
conditions on a few input-output pairs (demonstrations) and a test input, and
directly outputs the prediction. It has been shown highly dependent on the
provided demonstrations and thus promotes the research of demonstration
retrieval: given a test input, relevant examples are retrieved from the
training set to serve as informative demonstrations for in-context learning.
While previous works focus on training task-specific retrievers for several
tasks separately, these methods are often hard to transfer and scale on various
tasks, and separately trained retrievers incur a lot of parameter storage and
deployment cost. In this paper, we propose Unified Demonstration Retriever
(\textbf{UDR}), a single model to retrieve demonstrations for a wide range of
tasks. To train UDR, we cast various tasks' training signals into a unified
list-wise ranking formulation by language model's feedback. Then we propose a
multi-task list-wise ranking training framework, with an iterative mining
strategy to find high-quality candidates, which can help UDR fully incorporate
various tasks' signals. Experiments on 30+ tasks across 13 task families and
multiple data domains show that UDR significantly outperforms baselines.
Further analyses show the effectiveness of each proposed component and UDR's
strong ability in various scenarios including different LMs (1.3B - 175B),
unseen datasets, varying demonstration quantities, etc.
Authors' comments: ACL 2023 camera ready version
Nishant Balepur, Jie Huang, Kevin Chen-Chuan Chang
Expository documents are vital resources for conveying complex information to
readers. Despite their usefulness, writing expository text by hand is a
challenging process that requires careful content planning, obtaining facts
from multiple sources, and the ability to clearly synthesize these facts. To
ease these burdens, we propose the task of expository text generation, which
seeks to automatically generate an accurate and stylistically consistent
expository text for a topic by intelligently searching a knowledge source. We
solve our task by developing IRP, a framework that overcomes the limitations of
retrieval-augmented models and iteratively performs content planning, fact
retrieval, and rephrasing. Through experiments on three diverse,
newly-collected datasets, we show that IRP produces factual and organized
expository texts that accurately inform readers.
Authors' comments: Accepted to EMNLP 2023 Main Conference
Xiaoyang Chen, Yanjiang Liu, Ben He, Le Sun, Yingfei Sun
The Differentiable Search Index (DSI) is a novel information retrieval (IR)
framework that utilizes a differentiable function to generate a sorted list of
document identifiers in response to a given query. However, due to the
black-box nature of the end-to-end neural architecture, it remains to be
understood to what extent DSI possesses the basic indexing and retrieval
abilities. To mitigate this gap, in this study, we define and examine three
important abilities that a functioning IR framework should possess, namely,
exclusivity, completeness, and relevance ordering. Our analytical
experimentation shows that while DSI demonstrates proficiency in memorizing the
unidirectional mapping from pseudo queries to document identifiers, it falls
short in distinguishing relevant documents from random ones, thereby negatively
impacting its retrieval effectiveness. To address this issue, we propose a
multi-task distillation approach to enhance the retrieval quality without
altering the structure of the model and successfully endow it with improved
indexing abilities. Through experiments conducted on various datasets, we
demonstrate that our proposed method outperforms previous DSI baselines.
Authors' comments: Accepted to Findings of ACL 2023
James Mayfield, Eugene Yang, Dawn Lawrie, Samuel Barham, Orion Weller, Marc Mason, Suraj Nair, Scott Miller
A key stumbling block for neural cross-language information retrieval (CLIR)
systems has been the paucity of training data. The appearance of the MS MARCO
monolingual training set led to significant advances in the state of the art in
neural monolingual retrieval. By translating the MS MARCO documents into other
languages using machine translation, this resource has been made useful to the
CLIR community. Yet such translation suffers from a number of problems. While
MS MARCO is a large resource, it is of fixed size; its genre and domain of
discourse are fixed; and the translated documents are not written in the
language of a native speaker of the language, but rather in translationese. To
address these problems, we introduce the JH-POLO CLIR training set creation
methodology. The approach begins by selecting a pair of non-English passages. A
generative large language model is then used to produce an English query for
which the first passage is relevant and the second passage is not relevant. By
repeating this process, collections of arbitrary size can be created in the
style of MS MARCO but using naturally-occurring documents in any desired genre
and domain of discourse. This paper describes the methodology in detail, shows
its use in creating new CLIR training sets, and describes experiments using the
newly created training data.
Authors' comments: 11 pages, 4 figures
Hamed Zamani, Michael Bendersky
Dense retrieval models use bi-encoder network architectures for learning
query and document representations. These representations are often in the form
of a vector representation and their similarities are often computed using the
dot product function. In this paper, we propose a new representation learning
framework for dense retrieval. Instead of learning a vector for each query and
document, our framework learns a multivariate distribution and uses negative
multivariate KL divergence to compute the similarity between distributions. For
simplicity and efficiency reasons, we assume that the distributions are
multivariate normals and then train large language models to produce mean and
variance vectors for these distributions. We provide a theoretical foundation
for the proposed framework and show that it can be seamlessly integrated into
the existing approximate nearest neighbor algorithms to perform retrieval
efficiently. We conduct an extensive suite of experiments on a wide range of
datasets, and demonstrate significant improvements compared to competitive
dense retrieval models.
Authors' comments: Accepted for publication at SIGIR 2023
Aleksei Shabanov, Aleksei Tarasov, Sergey Nikolenko
Current metric learning approaches for image retrieval are usually based on
learning a space of informative latent representations where simple approaches
such as the cosine distance will work well. Recent state of the art methods
such as HypViT move to more complex embedding spaces that may yield better
results but are harder to scale to production environments. In this work, we
first construct a simpler model based on triplet loss with hard negatives
mining that performs at the state of the art level but does not have these
drawbacks. Second, we introduce a novel approach for image retrieval
postprocessing called Siamese Transformer for Image Retrieval (STIR) that
reranks several top outputs in a single forward pass. Unlike previously
proposed Reranking Transformers, STIR does not rely on global/local feature
extraction and directly compares a query image and a retrieved candidate on
pixel level with the usage of attention mechanism. The resulting approach
defines a new state of the art on standard image retrieval datasets: Stanford
Online Products and DeepFashion In-shop. We also release the source code at
https://github.com/OML-Team/open-metric-learning/tree/main/pipelines/postprocessing/
and an interactive demo of our approach at
https://dapladoc-oml-postprocessing-demo-srcappmain-pfh2g0.streamlit.app/
Authors' comments: 14 pages, 3 figures
Timo Breuer, Christin Katharina Kreutz, Philipp Schaer, Dirk Tunger
Digital libraries in the scientific domain provide users access to a wide
range of information to satisfy their diverse information needs. Here, ranking
results play a crucial role in users' satisfaction. Exploiting bibliometric
metadata, e.g., publications' citation counts or bibliometric indicators in
general, for automatically identifying the most relevant results can boost
retrieval performance. This work proposes bibliometric data fusion, which
enriches existing systems' results by incorporating bibliometric metadata such
as citations or altmetrics. Our results on three biomedical retrieval
benchmarks from TREC Precision Medicine (TREC-PM) show that bibliometric data
fusion is a promising approach to improve retrieval performance in terms of
normalized Discounted Cumulated Gain (nDCG) and Average Precision (AP), at the
cost of the Precision at 10 (P@10) rate. Patient users especially profit from
this lightweight, data-sparse technique that applies to any digital library.
Authors' comments: 10 pages + references, conference paper accepted at JCDL'23
Lucas Georges Gabriel Charpentier, Sondre Wold, David Samuel, Egil Rønningstad
Retrieval-based language models are increasingly employed in
question-answering tasks. These models search in a corpus of documents for
relevant information instead of having all factual knowledge stored in its
parameters, thereby enhancing efficiency, transparency, and adaptability. We
develop the first Norwegian retrieval-based model by adapting the REALM
framework and evaluating it on various tasks. After training, we also separate
the language model, which we call the reader, from the retriever components,
and show that this can be fine-tuned on a range of downstream tasks. Results
show that retrieval augmented language modeling improves the reader's
performance on extractive question-answering, suggesting that this type of
training improves language models' general ability to use context and that this
does not happen at the expense of other abilities such as part-of-speech
tagging, dependency parsing, named entity recognition, and lemmatization. Code,
trained models, and data are made publicly available.
Authors' comments: Accepted for NoDaLiDa 2023, main conference
Yen-Chieh Lien, Hamed Zamani, W. Bruce Croft
Neural ranking models (NRMs) have demonstrated effective performance in several information retrieval (IR) tasks. However, training NRMs often requires large-scale training data, which is difficult and expensive to obtain. To address this issue, one can train NRMs via weak supervision, where a large dataset is automatically generated using an existing ranking model (called the weak labeler) for training NRMs. Weakly supervised NRMs can generalize from the observed data and significantly outperform the weak labeler. This paper generalizes this idea through an iterative re-labeling process, demonstrating that weakly supervised models can iteratively play the role of weak labeler and significantly improve ranking performance without using manually labeled data. The proposed Generalized Weak Supervision (GWS) solution is generic and orthogonal to the ranking model architecture. This paper offers four implementations of GWS: self-labeling, cross-labeling, joint cross- and self-labeling, and greedy multi-labeling. GWS also benefits from a query importance weighting mechanism based on query performance prediction methods to reduce noise in the generated training data. We further draw a theoretical connection between self-labeling and Expectation-Maximization. Our experiments on two passage retrieval benchmarks suggest that all implementations of GWS lead to substantial improvements compared to weak supervision in all cases.
Kévin Deturck, Parantapa Goswami, Damien Nouvel, Frédérique Segond
In this paper, we present our participation to CLEF MC2 2018 edition for the task 2 Mining opinion argumentation. It consists in detecting the most argumentative and diverse Tweets about some festivals in English and French from a massive multilingual collection. We measure argumentativity of a Tweet computing the amount of argumentation compounds it contains. We consider argumentation compounds as a combination between opinion expression and its support with facts and a particular structuration. Regarding diversity, we consider the amount of festival aspects covered by Tweets. An initial step filters the original dataset to fit the language and topic requirements of the task. Then, we compute and integrate linguistic descriptors to detect claims and their respective justifications in Tweets. The final step extracts the most diverse arguments by clustering Tweets according to their textual content and selecting the most argumentative ones from each cluster. We conclude the paper describing the different ways we combined the descriptors among the different runs we submitted and discussing their results.
Si Sun, Yida Lu, Shi Yu, Xiangyang Li, Zhonghua Li, Zhao Cao, Zhiyuan Liu, Deiming Ye et al.
Few-shot dense retrieval (DR) aims to effectively generalize to novel search
scenarios by learning a few samples. Despite its importance, there is little
study on specialized datasets and standardized evaluation protocols. As a
result, current methods often resort to random sampling from supervised
datasets to create "few-data" setups and employ inconsistent training
strategies during evaluations, which poses a challenge in accurately comparing
recent progress. In this paper, we propose a customized FewDR dataset and a
unified evaluation benchmark. Specifically, FewDR employs class-wise sampling
to establish a standardized "few-shot" setting with finely-defined classes,
reducing variability in multiple sampling rounds. Moreover, the dataset is
disjointed into base and novel classes, allowing DR models to be continuously
trained on ample data from base classes and a few samples in novel classes.
This benchmark eliminates the risk of novel class leakage, providing a reliable
estimation of the DR model's few-shot ability. Our extensive empirical results
reveal that current state-of-the-art DR models still face challenges in the
standard few-shot scene. Our code and data will be open-sourced at
https://github.com/OpenMatch/ANCE-Tele.
Authors' comments: Work in progress
Weiwei Sun, Lingyong Yan, Zheng Chen, Shuaiqiang Wang, Haichao Zhu, Pengjie Ren, Zhumin Chen, Dawei Yin et al.
Conventional document retrieval techniques are mainly based on the index-retrieve paradigm. It is challenging to optimize pipelines based on this paradigm in an end-to-end manner. As an alternative, generative retrieval represents documents as identifiers (docid) and retrieves documents by generating docids, enabling end-to-end modeling of document retrieval tasks. However, it is an open question how one should define the document identifiers. Current approaches to the task of defining document identifiers rely on fixed rule-based docids, such as the title of a document or the result of clustering BERT embeddings, which often fail to capture the complete semantic information of a document. We propose GenRet, a document tokenization learning method to address the challenge of defining document identifiers for generative retrieval. GenRet learns to tokenize documents into short discrete representations (i.e., docids) via a discrete auto-encoding approach. Three components are included in GenRet: (i) a tokenization model that produces docids for documents; (ii) a reconstruction model that learns to reconstruct a document based on a docid; and (iii) a sequence-to-sequence retrieval model that generates relevant document identifiers directly for a designated query. By using an auto-encoding framework, GenRet learns semantic docids in a fully end-to-end manner. We also develop a progressive training scheme to capture the autoregressive nature of docids and to stabilize training. We conduct experiments on the NQ320K, MS MARCO, and BEIR datasets to assess the effectiveness of GenRet. GenRet establishes the new state-of-the-art on the NQ320K dataset. Especially, compared to generative retrieval baselines, GenRet can achieve significant improvements on the unseen documents. GenRet also outperforms comparable baselines on MS MARCO and BEIR, demonstrating the method's generalizability.
Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, Ziwei Liu
3D human motion generation is crucial for creative industry. Recent advances rely on generative models with domain knowledge for text-driven motion generation, leading to substantial progress in capturing common motions. However, the performance on more diverse motions remains unsatisfactory. In this work, we propose ReMoDiffuse, a diffusion-model-based motion generation framework that integrates a retrieval mechanism to refine the denoising process. ReMoDiffuse enhances the generalizability and diversity of text-driven motion generation with three key designs: 1) Hybrid Retrieval finds appropriate references from the database in terms of both semantic and kinematic similarities. 2) Semantic-Modulated Transformer selectively absorbs retrieval knowledge, adapting to the difference between retrieved samples and the target motion sequence. 3) Condition Mixture better utilizes the retrieval database during inference, overcoming the scale sensitivity in classifier-free guidance. Extensive experiments demonstrate that ReMoDiffuse outperforms state-of-the-art methods by balancing both text-motion consistency and motion quality, especially for more diverse motion generation.
Esteban Marquer, Miguel Couceiro
Analogical inference is a remarkable capability of human reasoning, and has been used to solve hard reasoning tasks. Analogy based reasoning (AR) has gained increasing interest from the artificial intelligence community and has shown its potential in multiple machine learning tasks such as classification, decision making and recommendation with competitive results. We propose a deep learning (DL) framework to address and tackle two key tasks in AR: analogy detection and solving. The framework is thoroughly tested on the Siganalogies dataset of morphological analogical proportions (APs) between words, and shown to outperform symbolic approaches in many languages. Previous work have explored the behavior of the Analogy Neural Network for classification (ANNc) on analogy detection and of the Analogy Neural Network for retrieval (ANNr) on analogy solving by retrieval, as well as the potential of an autoencoder (AE) for analogy solving by generating the solution word. In this article we summarize these findings and we extend them by combining ANNr and the AE embedding model, and checking the performance of ANNc as an retrieval method. The combination of ANNr and AE outperforms the other approaches in almost all cases, and ANNc as a retrieval method achieves competitive or better performance than 3CosMul. We conclude with general guidelines on using our framework to tackle APs with DL.