Keming Lu, I-Hung Hsu, Wenxuan Zhou, Mingyu Derek Ma, Muhao Chen
Relation Extraction (RE) has been extended to cross-document scenarios
because many relations are not simply described in a single document. This
inevitably brings the challenge of efficient open-space evidence retrieval to
support the inference of cross-document relations, along with the challenge of
multi-hop reasoning on top of entities and evidence scattered in an open set of
documents. To combat these challenges, we propose MR.COD (Multi-hop evidence
retrieval for Cross-document relation extraction), which is a multi-hop
evidence retrieval method based on evidence path mining and ranking. We explore
multiple variants of retrievers to show evidence retrieval is essential in
cross-document RE. We also propose a contextual dense retriever for this
setting. Experiments on CodRED show that evidence retrieval with MR.COD
effectively acquires crossdocument evidence and boosts end-to-end RE
performance in both closed and open settings.
Authors' comments: ACL 2023 (Findings)
Chang Liu, Chongyang Tao, Xiubo Geng, Tao Shen, Dongyan Zhao, Can Xu, Binxing Jiao, Daxin Jiang
To improve the performance of the dual-encoder retriever, one effective
approach is knowledge distillation from the cross-encoder ranker. Existing
works construct the candidate passages following the supervised learning
setting where a query is paired with a positive passage and a batch of
negatives. However, through empirical observation, we find that even the hard
negatives from advanced methods are still too trivial for the teacher to
distinguish, preventing the teacher from transferring abundant dark knowledge
to the student through its soft label. To alleviate this issue, we propose
ADAM, a knowledge distillation framework that can better transfer the dark
knowledge held in the teacher with Adaptive Dark exAMples. Different from
previous works that only rely on one positive and hard negatives as candidate
passages, we create dark examples that all have moderate relevance to the query
through mixing-up and masking in discrete space. Furthermore, as the quality of
knowledge held in different training instances varies as measured by the
teacher's confidence score, we propose a self-paced distillation strategy that
adaptively concentrates on a subset of high-quality instances to conduct our
dark-example-based knowledge distillation to help the student learn better. We
conduct experiments on two widely-used benchmarks and verify the effectiveness
of our method.
Authors' comments: 9 pages, 2 figures
Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan
While dense retrieval has been shown effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available. In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings~(HyDE). Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details. Then, an unsupervised contrastively learned encoder~(e.g. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity. This second step ground the generated document to the actual corpus, with the encoder's dense bottleneck filtering out the incorrect details. Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers, across various tasks (e.g. web search, QA, fact verification) and languages~(e.g. sw, ko, ja).
John Wieting, Jonathan H. Clark, William W. Cohen, Graham Neubig, Taylor Berg-Kirkpatrick
Contrastive learning has been successfully used for retrieval of semantically
aligned sentences, but it often requires large batch sizes or careful
engineering to work well. In this paper, we instead propose a generative model
for learning multilingual text embeddings which can be used to retrieve or
score sentence pairs. Our model operates on parallel data in $N$ languages and,
through an approximation we introduce, efficiently encourages source separation
in this multilingual setting, separating semantic information that is shared
between translations from stylistic or language-specific variation. We show
careful large-scale comparisons between contrastive and generation-based
approaches for learning multilingual text embeddings, a comparison that has not
been done to the best of our knowledge despite the popularity of these
approaches. We evaluate this method on a suite of tasks including semantic
similarity, bitext mining, and cross-lingual question retrieval -- the last of
which we introduce in this paper. Overall, our Variational Multilingual
Source-Separation Transformer (VMSST) model outperforms both a strong
contrastive and generative baseline on these tasks.
Authors' comments: Published as a long paper at ACL 2023
Xing Wu, Guangyuan Ma, Wanhui Qian, Zijia Lin, Songlin Hu
Recently, methods have been developed to improve the performance of dense
passage retrieval by using context-supervised pre-training. These methods
simply consider two passages from the same document to be relevant, without
taking into account the possibility of weakly correlated pairs. Thus, this
paper proposes query-as-context pre-training, a simple yet effective
pre-training technique to alleviate the issue. Query-as-context pre-training
assumes that the query derived from a passage is more likely to be relevant to
that passage and forms a passage-query pair. These passage-query pairs are then
used in contrastive or generative context-supervised pre-training. The
pre-trained models are evaluated on large-scale passage retrieval benchmarks
and out-of-domain zero-shot benchmarks. Experimental results show that
query-as-context pre-training brings considerable gains and meanwhile speeds up
training, demonstrating its effectiveness and efficiency. Our code will be
available at https://github.com/caskcsg/ir/tree/main/cotmae-qc .
Authors' comments: EMNLP 2023 Main Conference
Ercong Nie, Sheng Liang, Helmut Schmid, Hinrich Schütze
Multilingual Pretrained Language Models (MPLMs) have shown their strong
multilinguality in recent empirical cross-lingual transfer studies. In this
paper, we propose the Prompts Augmented by Retrieval Crosslingually (PARC)
pipeline to improve the zero-shot performance on low-resource languages (LRLs)
by augmenting the context with semantically similar sentences retrieved from a
high-resource language (HRL) as prompts. PARC improves the zero-shot
performance on three downstream tasks (binary sentiment classification, topic
categorization and natural language inference) with multilingual parallel test
sets across 10 LRLs covering 6 language families in both unlabeled settings
(+5.1%) and labeled settings (+16.3%). PARC-labeled also outperforms the
finetuning baseline by 3.7%. We find a significant positive correlation between
cross-lingual transfer performance on one side, and the similarity between the
high- and low-resource languages as well as the amount of low-resource
pretraining data on the other side. A robustness analysis suggests that PARC
has the potential to achieve even stronger performance with more powerful
MPLMs.
Authors' comments: Accepted to Findings of ACL 2023
Vardaan Pahuja, Boshi Wang, Hugo Latapie, Jayanth Srinivasa, Yu Su
Knowledge graph (KG) link prediction aims to infer new facts based on
existing facts in the KG. Recent studies have shown that using the graph
neighborhood of a node via graph neural networks (GNNs) provides more useful
information compared to just using the query information. Conventional GNNs for
KG link prediction follow the standard message-passing paradigm on the entire
KG, which leads to superfluous computation, over-smoothing of node
representations, and also limits their expressive power. On a large scale, it
becomes computationally expensive to aggregate useful information from the
entire KG for inference. To address the limitations of existing KG link
prediction frameworks, we propose a novel retrieve-and-read framework, which
first retrieves a relevant subgraph context for the query and then jointly
reasons over the context and the query with a high-capacity reader. As part of
our exemplar instantiation for the new framework, we propose a novel
Transformer-based GNN as the reader, which incorporates graph-based attention
structure and cross-attention between query and context for deep fusion. This
simple yet effective design enables the model to focus on salient context
information relevant to the query. Empirical results on two standard KG link
prediction datasets demonstrate the competitive performance of the proposed
method. Furthermore, our analysis yields valuable insights for designing
improved retrievers within the framework.
Authors' comments: Accepted to CIKM'23; Published version DOI:
https://doi.org/10.1145/3583780.3614769 ;12 pages, 4 figures
Xingwei He, Yeyun Gong, A-Long Jin, Hang Zhang, Anlei Dong, Jian Jiao, Siu Ming Yiu, Nan Duan
The dual-encoder has become the de facto architecture for dense retrieval.
Typically, it computes the latent representations of the query and document
independently, thus failing to fully capture the interactions between the query
and document. To alleviate this, recent research has focused on obtaining
query-informed document representations. During training, it expands the
document with a real query, but during inference, it replaces the real query
with a generated one. This inconsistency between training and inference causes
the dense retrieval model to prioritize query information while disregarding
the document when computing the document representation. Consequently, it
performs even worse than the vanilla dense retrieval model because its
performance heavily relies on the relevance between the generated queries and
the real query.In this paper, we propose a curriculum sampling strategy that
utilizes pseudo queries during training and progressively enhances the
relevance between the generated query and the real query. By doing so, the
retrieval model learns to extend its attention from the document alone to both
the document and query, resulting in high-quality query-informed document
representations. Experimental results on both in-domain and out-of-domain
datasets demonstrate that our approach outperforms previous dense retrieval
models.
Authors' comments: Accetpted to EMNLP 2023
Jie Guo, Meiting Wang, Yan Zhou, Bin Song, Yuhao Chi, Wei Fan, Jianglong Chang
Image-text retrieval (ITR) is a challenging task in the field of multimodal information processing due to the semantic gap between different modalities. In recent years, researchers have made great progress in exploring the accurate alignment between image and text. However, existing works mainly focus on the fine-grained alignment between image regions and sentence fragments, which ignores the guiding significance of context background information. Actually, integrating the local fine-grained information and global context background information can provide more semantic clues for retrieval. In this paper, we propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval. First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively. Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module, which enhances the semantic corresponding relations between the local and global information, and obtains more accurate feature representations for the image and text modalities. Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment. To justify the proposed model, we perform extensive experiments on MS-COCO and Flickr30K datasets. Experimental results show that the proposed HGAN outperforms the state-of-the-art methods on both datasets, which demonstrates the effectiveness and superiority of our model.
Vincent Christlein, Isabelle Marthot-Santaniello, Martin Mayr, Anguelos Nicolaou, Mathias Seuret
The analysis of digitized historical manuscripts is typically addressed by paleographic experts. Writer identification refers to the classification of known writers while writer retrieval seeks to find the writer by means of image similarity in a dataset of images. While automatic writer identification/retrieval methods already provide promising results for many historical document types, papyri data is very challenging due to the fiber structures and severe artifacts. Thus, an important step for an improved writer identification is the preprocessing and feature sampling process. We investigate several methods and show that a good binarization is key to an improved writer identification in papyri writings. We focus mainly on writer retrieval using unsupervised feature methods based on traditional or self-supervised-based methods. It is, however, also comparable to the state of the art supervised deep learning-based method in the case of writer classification/re-identification.
Jiawei Zhou, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu, Lei Chen
Disentangled representation learning remains challenging as the underlying factors of variation in the data do not naturally exist. The inherent complexity of real-world data makes it unfeasible to exhaustively enumerate and encapsulate all its variations within a finite set of factors. However, it is worth noting that most real-world data have linguistic equivalents, typically in the form of textual descriptions. These linguistic counterparts can represent the data and effortlessly decomposed into distinct tokens. In light of this, we present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning. Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish dimensions that capture intrinsic characteristics within data through its natural language counterpart, thus facilitating disentanglement. We extensively assess the performance of VDR across 15 retrieval benchmark datasets, covering text-to-text and cross-modal retrieval scenarios, as well as human evaluation. Our experimental results compellingly demonstrate the superiority of VDR over previous bi-encoder retrievers with comparable model size and training costs, achieving an impressive 8.7% improvement in NDCG@10 on the BEIR benchmark, a 5.3% increase on MS COCO, and a 6.0% increase on Flickr30k in terms of mean recall in the zero-shot setting. Moreover, The results from human evaluation indicate that interpretability of our method is on par with SOTA captioning models.
Sourav Saha, Debapriyo Majumdar, Mandar Mitra
Deep Learning and Machine Learning based models have become extremely popular in text processing and information retrieval. However, the non-linear structures present inside the networks make these models largely inscrutable. A significant body of research has focused on increasing the transparency of these models. This article provides a broad overview of research on the explainability and interpretability of natural language processing and information retrieval methods. More specifically, we survey approaches that have been applied to explain word embeddings, sequence modeling, attention modules, transformers, BERT, and document ranking. The concluding section suggests some possible directions for future research on this topic.
Guilherme Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira
Bi-encoders and cross-encoders are widely used in many state-of-the-art
retrieval pipelines. In this work we study the generalization ability of these
two types of architectures on a wide range of parameter count on both in-domain
and out-of-domain scenarios. We find that the number of parameters and early
query-document interactions of cross-encoders play a significant role in the
generalization ability of retrieval models. Our experiments show that
increasing model size results in marginal gains on in-domain test sets, but
much larger gains in new domains never seen during fine-tuning. Furthermore, we
show that cross-encoders largely outperform bi-encoders of similar size in
several tasks. In the BEIR benchmark, our largest cross-encoder surpasses a
state-of-the-art bi-encoder by more than 4 average points. Finally, we show
that using bi-encoders as first-stage retrievers provides no gains in
comparison to a simpler retriever such as BM25 on out-of-domain tasks. The code
is available at
https://github.com/guilhermemr04/scaling-zero-shot-retrieval.git
Authors' comments: arXiv admin note: substantial text overlap with arXiv:2206.02873
Ha-Thanh Nguyen, Manh-Kien Phi, Xuan-Bach Ngo, Vu Tran, Le-Minh Nguyen, Minh-Phuong Tu
Legal text retrieval serves as a key component in a wide range of legal text
processing tasks such as legal question answering, legal case entailment, and
statute law retrieval. The performance of legal text retrieval depends, to a
large extent, on the representation of text, both query and legal documents.
Based on good representations, a legal text retrieval model can effectively
match the query to its relevant documents. Because legal documents often
contain long articles and only some parts are relevant to queries, it is quite
a challenge for existing models to represent such documents. In this paper, we
study the use of attentive neural network-based text representation for statute
law document retrieval. We propose a general approach using deep neural
networks with attention mechanisms. Based on it, we develop two hierarchical
architectures with sparse attention to represent long sentences and articles,
and we name them Attentive CNN and Paraformer. The methods are evaluated on
datasets of different sizes and characteristics in English, Japanese, and
Vietnamese. Experimental results show that: i) Attentive neural methods
substantially outperform non-neural methods in terms of retrieval performance
across datasets and languages; ii) Pretrained transformer-based models achieve
better accuracy on small datasets at the cost of high computational complexity
while lighter weight Attentive CNN achieves better accuracy on large datasets;
and iii) Our proposed Paraformer outperforms state-of-the-art methods on COLIEE
dataset, achieving the highest recall and F2 scores in the top-N retrieval
task.
Authors' comments: Preprint version. The official version will be published in
Artificial Intelligence and Law journal
Hao Sun, Xiao Liu, Yeyun Gong, Anlei Dong, Jingwen Lu, Yan Zhang, Linjun Yang, Rangan Majumder et al.
Knowledge distillation is often used to transfer knowledge from a strong
teacher model to a relatively weak student model. Traditional methods include
response-based methods and feature-based methods. Response-based methods are
widely used but suffer from lower upper limits of performance due to their
ignorance of intermediate signals, while feature-based methods have constraints
on vocabularies, tokenizers and model architectures. In this paper, we propose
a liberal feature-based distillation method (LEAD). LEAD aligns the
distribution between the intermediate layers of teacher model and student
model, which is effective, extendable, portable and has no requirements on
vocabularies, tokenizers, or model architectures. Extensive experiments show
the effectiveness of LEAD on widely-used benchmarks, including MS MARCO Passage
Ranking, TREC 2019 DL Track, MS MARCO Document Ranking and TREC 2020 DL Track.
Our code is available in https://github.com/microsoft/SimXNS/tree/main/LEAD.
Authors' comments: Accepted by WSDM 2024
Mustafa Shukor, Nicolas Thome, Matthieu Cord
Vision-Language Pretraining (VLP) and Foundation models have been the go-to
recipe for achieving SoTA performance on general benchmarks. However,
leveraging these powerful techniques for more complex vision-language tasks,
such as cooking applications, with more structured input data, is still little
investigated. In this work, we propose to leverage these techniques for
structured-text based computational cuisine tasks. Our strategy, dubbed
VLPCook, first transforms existing image-text pairs to image and
structured-text pairs. This allows to pretrain our VLPCook model using VLP
objectives adapted to the strutured data of the resulting datasets, then
finetuning it on downstream computational cooking tasks. During finetuning, we
also enrich the visual encoder, leveraging pretrained foundation models (e.g.
CLIP) to provide local and global textual context. VLPCook outperforms current
SoTA by a significant margin (+3.3 Recall@1 absolute improvement) on the task
of Cross-Modal Food Retrieval on the large Recipe1M dataset. We conduct further
experiments on VLP to validate their importance, especially on the Recipe1M+
dataset. Finally, we validate the generalization of the approach to other tasks
(i.e, Food Recognition) and domains with structured text such as the Medical
domain on the ROCO dataset. The code is available here:
https://github.com/mshukor/VLPCook
Authors' comments: Code: https://github.com/mshukor/VLPCook
Roope Uola, Erkka Haapasalo, Juha-Pekka Pellonpää, Tom Kuusela
We propose a generalisation of the Leggett-Garg conditions for macrorealistic
behaviour. Our proposal relies on relaxing the postulate of non-invasive
measurability with that of retrievability of information. This leads to a
strictly broader class of hidden variable theories than those having a
macrorealistic description. Crucially, whereas quantum mechanical tests of
macrorealism require one to optimise over all possible state updates, for
retrievability of information it suffices to use the basic L\"uders state
update, which is present in every quantum measurement. We show that in qubit
systems the optimal retrieving protocols further relate to the fundamental
precision limit of quantum theory given by Busch-Lahti-Werner error-disturbance
uncertainty relations. We implement an optimal protocol using a photonic
setting, and report an experimental violation of the proposed generalisation of
macrorealism.
Authors' comments: 14 pages, 5 figures
Nadav Torem, Roi Ronen, Yoav Y. Schechner, Michael Elad
In diverse microscopy modalities, sensors measure only real-valued
intensities. Additionally, the sensor readouts are affected by
Poissonian-distributed photon noise. Traditional restoration algorithms
typically aim to minimize the mean squared error (MSE) between the original and
recovered images. This often leads to blurry outcomes with poor perceptual
quality. Recently, deep diffusion models (DDMs) have proven to be highly
capable of sampling images from the a-posteriori probability of the sought
variables, resulting in visually pleasing high-quality images. These models
have mostly been suggested for real-valued images suffering from Gaussian
noise. In this study, we generalize annealed Langevin Dynamics, a type of DDM,
to tackle the fundamental challenges in optical imaging of complex-valued
objects (and real images) affected by Poisson noise. We apply our algorithm to
various optical scenarios, such as Fourier Ptychography, Phase Retrieval, and
Poisson denoising. Our algorithm is evaluated on simulations and biological
empirical data.
Authors' comments: 11 pages, 7figures
Praveen Venkateswaran, Evelyn Duesterwald, Vatche Isahagian
Dialogue State Tracking (DST), a key component of task-oriented conversation systems, represents user intentions by determining the values of pre-defined slots in an ongoing dialogue. Existing approaches use hand-crafted templates and additional slot information to fine-tune and prompt large pre-trained language models and elicit slot values from the dialogue context. Significant manual effort and domain knowledge is required to design effective prompts, limiting the generalizability of these approaches to new domains and tasks. In this work, we propose DiSTRICT, a generalizable in-context tuning approach for DST that retrieves highly relevant training examples for a given dialogue to fine-tune the model without any hand-crafted templates. Experiments with the MultiWOZ benchmark datasets show that DiSTRICT outperforms existing approaches in various zero-shot and few-shot settings using a much smaller model, thereby providing an important advantage for real-world deployments that often have limited resource availability.
Xinyu Wang, Jiong Cai, Yong Jiang, Pengjun Xie, Kewei Tu, Wei Lu
Multi-modal named entity recognition (NER) and relation extraction (RE) aim
to leverage relevant image information to improve the performance of NER and
RE. Most existing efforts largely focused on directly extracting potentially
useful information from images (such as pixel-level features, identified
objects, and associated captions). However, such extraction processes may not
be knowledge aware, resulting in information that may not be highly relevant.
In this paper, we propose a novel Multi-modal Retrieval based framework (MoRe).
MoRe contains a text retrieval module and an image-based retrieval module,
which retrieve related knowledge of the input text and image in the knowledge
corpus respectively. Next, the retrieval results are sent to the textual and
visual models respectively for predictions. Finally, a Mixture of Experts (MoE)
module combines the predictions from the two models to make the final decision.
Our experiments show that both our textual model and visual model can achieve
state-of-the-art performance on four multi-modal NER datasets and one
multi-modal RE dataset. With MoE, the model performance can be further improved
and our analysis demonstrates the benefits of integrating both textual and
visual cues for such tasks.
Authors' comments: Findings of EMNLP 2022. Code is publicly available at
http://github.com/modelscope/adaseq/examples/MoRe