Keshav Santhanam, Jon Saad-Falcon, Martin Franz, Omar Khattab, Avirup Sil, Radu Florian, Md Arafat Sultan, Salim Roukos et al.
Neural information retrieval (IR) systems have progressed rapidly in recent years, in large part due to the release of publicly available benchmarking tasks. Unfortunately, some dimensions of this progress are illusory: the majority of the popular IR benchmarks today focus exclusively on downstream task accuracy and thus conceal the costs incurred by systems that trade away efficiency for quality. Latency, hardware cost, and other efficiency considerations are paramount to the deployment of IR systems in user-facing settings. We propose that IR benchmarks structure their evaluation methodology to include not only metrics of accuracy, but also efficiency considerations such as a query latency and the corresponding cost budget for a reproducible hardware setting. For the popular IR benchmarks MS MARCO and XOR-TyDi, we show how the best choice of IR system varies according to how these efficiency considerations are chosen and weighed. We hope that future benchmarks will adopt these guidelines toward more holistic IR evaluation.
Fangxun Shu, Biaolong Chen, Yue Liao, Shuwen Xiao, Wenyu Sun, Xiaobo Li, Yousong Zhu, Jinqiao Wang et al.
We present a simple yet effective end-to-end Video-language Pre-training
(VidLP) framework, Masked Contrastive Video-language Pretraining (MAC), for
video-text retrieval tasks. Our MAC aims to reduce video representation's
spatial and temporal redundancy in the VidLP model by a mask sampling mechanism
to improve pre-training efficiency. Comparing conventional temporal sparse
sampling, we propose to randomly mask a high ratio of spatial regions and only
feed visible regions into the encoder as sparse spatial sampling. Similarly, we
adopt the mask sampling technique for text inputs for consistency. Instead of
blindly applying the mask-then-prediction paradigm from MAE, we propose a
masked-then-alignment paradigm for efficient video-text alignment. The
motivation is that video-text retrieval tasks rely on high-level alignment
rather than low-level reconstruction, and multimodal alignment with masked
modeling encourages the model to learn a robust and general multimodal
representation from incomplete and unstable inputs. Coupling these designs
enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate
pre-training (by 3x), and improve performance. Our MAC achieves
state-of-the-art results on various video-text retrieval datasets, including
MSR-VTT, DiDeMo, and ActivityNet. Our approach is omnivorous to input
modalities. With minimal modifications, we achieve competitive results on
image-text retrieval tasks.
Authors' comments: Technical Report
Lukas Herron, Pablo Sartori, BingKan Xue
Many biological systems dynamically rearrange their components through a sequence of configurations in order to perform their functions. Such dynamic processes have been studied using network models that sequentially retrieve a set of stored patterns. Previous models of sequential retrieval belong to a general class in which the components of the system are controlled by a feedback ("input modulation"). In contrast, we introduce a new class of models in which the feedback modifies the interactions among the components ("interaction modulation"). We show that interaction modulation models are not only capable of retrieving dynamic sequences, but they do so more robustly than input modulation models. In particular, we find that modulation of symmetric interactions allows retrieval of patterns with different activity levels and has a much larger dynamic capacity. Our results suggest that interaction modulation may be a common principle underlying biological systems that show complex collective dynamics.
Dongwon Kim, Namyup Kim, Suha Kwak
Cross-modal retrieval across image and text modalities is a challenging task
due to its inherent ambiguity: An image often exhibits various situations, and
a caption can be coupled with diverse images. Set-based embedding has been
studied as a solution to this problem. It seeks to encode a sample into a set
of different embedding vectors that capture different semantics of the sample.
In this paper, we present a novel set-based embedding method, which is distinct
from previous work in two aspects. First, we present a new similarity function
called smooth-Chamfer similarity, which is designed to alleviate the side
effects of existing similarity functions for set-based embedding. Second, we
propose a novel set prediction module to produce a set of embedding vectors
that effectively captures diverse semantics of input by the slot attention
mechanism. Our method is evaluated on the COCO and Flickr30K datasets across
different visual backbones, where it outperforms existing methods including
ones that demand substantially larger computation at inference.
Authors' comments: Accepted to CVPR 2023 (Highlight)
Poojitha Nandigam, Nikhil Rayaprolu, Manish Shrivastava
Often questions provided to open-domain question answering systems are
ambiguous. Traditional QA systems that provide a single answer are incapable of
answering ambiguous questions since the question may be interpreted in several
ways and may have multiple distinct answers. In this paper, we address
multi-answer retrieval which entails retrieving passages that can capture
majority of the diverse answers to the question. We propose a re-ranking based
approach using Determinantal point processes utilizing BERT as kernels. Our
method jointly considers query-passage relevance and passage-passage
correlation to retrieve passages that are both query-relevant and diverse.
Results demonstrate that our re-ranking technique outperforms state-of-the-art
method on the AmbigQA dataset.
Authors' comments: Published as a conference paper at COLING 2022
Wayne Xin Zhao, Jing Liu, Ruiyang Ren, Ji-Rong Wen
Text retrieval is a long-standing research topic on information seeking, where a system is required to return relevant information resources to user's queries in natural language. From classic retrieval methods to learning-based ranking functions, the underlying retrieval models have been continually evolved with the ever-lasting technical innovation. To design effective retrieval models, a key point lies in how to learn the text representation and model the relevance matching. The recent success of pretrained language models (PLMs) sheds light on developing more capable text retrieval approaches by leveraging the excellent modeling capacity of PLMs. With powerful PLMs, we can effectively learn the representations of queries and texts in the latent representation space, and further construct the semantic matching function between the dense vectors for relevance modeling. Such a retrieval approach is referred to as dense retrieval, since it employs dense vectors (a.k.a., embeddings) to represent the texts. Considering the rapid progress on dense retrieval, in this survey, we systematically review the recent advances on PLM-based dense retrieval. Different from previous surveys on dense retrieval, we take a new perspective to organize the related work by four major aspects, including architecture, training, indexing and integration, and summarize the mainstream techniques for each aspect. We thoroughly survey the literature, and include 300+ related reference papers on dense retrieval. To support our survey, we create a website for providing useful resources, and release a code repertory and toolkit for implementing dense retrieval models. This survey aims to provide a comprehensive, practical reference focused on the major progress for dense text retrieval.
Marco Leonetti, Luca Leuzzi, Giancarlo Ruocco
The measurement of the optical Transmission Matrix (TM) enables to access
"open channels": input patterns, specific to each scattering structure, capable
to deliver very high transmission. Various approaches, based either on multiple
interferometric measurements or on systematic random testing of incident
wavefronts, enable to estimate the inputs required to excite these open
channels. Here, we provide for the first time an approach enabling the complete
and reference-less retrieval of the open channels. It is based on the full
mapping all the pairwise interference terms resulting from all the input modes
couples. We show that these interference terms are organized into a bi-dyadic
coupling matrix whose eigenvalues enables to access the open channel. A
disordered optical system, is thus behaving exactly like an Hopfield neural
network, where a specific input vector (an eigenvalue of the neurons' coupling
matrix) enables to retrieve a specific memory pattern. The proposed Hopfield
like open-channel-retrieval approach, enables to reach almost 100$\%$ of the
theoretically expected value of the Intensity. Moreover employing a digital
micromirror device to modulate light, we demonstrate high speed laser scanning
at the back of a disordered medium.
Authors' comments: 6 pages, 5 figures
Yunyan Xing, Benjamin J. Meyer, Mehrtash Harandi, Tom Drummond, Zongyuan Ge
Content-based medical image retrieval is an important diagnostic tool that improves the explainability of computer-aided diagnosis systems and provides decision making support to healthcare professionals. Medical imaging data, such as radiology images, are often multimorbidity; a single sample may have more than one pathology present. As such, image retrieval systems for the medical domain must be designed for the multi-label scenario. In this paper, we propose a novel multi-label metric learning method that can be used for both classification and content-based image retrieval. In this way, our model is able to support diagnosis by predicting the presence of diseases and provide evidence for these predictions by returning samples with similar pathological content to the user. In practice, the retrieved images may also be accompanied by pathology reports, further assisting in the diagnostic process. Our method leverages proxy feature vectors, enabling the efficient learning of a robust feature space in which the distance between feature vectors can be used as a measure of the similarity of those samples. Unlike existing proxy-based methods, training samples are able to assign to multiple proxies that span multiple class labels. This multi-label proxy assignment results in a feature space that encodes the complex relationships between diseases present in medical imaging data. Our method outperforms state-of-the-art image retrieval systems and a set of baseline approaches. We demonstrate the efficacy of our approach to both classification and content-based image retrieval on two multimorbidity radiology datasets.
Ron Ziv, Anatoly Patsyk, Yaakov Lumer, Yoav Sagi, Yonina C. Eldar, Mordechai Segev
We propose and demonstrate numerically a measurement scheme for complete reconstruction of the quantum wavefunctions of Bose-Einstein condensates, amplitude and phase, from a time of flight measurement. We identify a fundamental ambiguity present in the measurement of vortices and show how to overcome it by augmenting the measurement to allow reconstruction of matter-wave vortices and arrays of vortices.
Jizhe Cui, Haozhi Sha, Wenfeng Yang, Rong Yu
Atomic-scale characterization of spin textures in solids is essential for
understanding and tuning properties of magnetic materials and devices. While
high-energy electrons are employed for atomic-scale imaging of materials, they
are insensitive to the spin textures. In general, the magnetic contribution to
the phase of high-energy electron wave is 1000 times weaker than the
electrostatic potential. Via accurate phase retrieval through electron
ptychography, here we show that the magnetic phase can be separated from the
electrostatic one, opening the door to atomic-resolution characterization of
spin textures in magnetic materials and spintronic devices.
Authors' comments: 20 pages, 9 figures
Michelle K Croughan, Ying Ying How, Allan Pennings, Kaye S Morgan
Directional dark-field imaging is an emerging x-ray modality that is sensitive to unresolved anisotropic scattering from sub-pixel sample microstructures. A single-grid imaging set-up can be used to capture dark-field images by looking at changes in a grid pattern projected upon the sample. By creating analytical models for the experiment, we have developed a single-grid directional dark field retrieval algorithm that can extract dark-field parameters such as the dominant scattering direction, and the semi-major and -minor scattering angles. We show that this method is effective even in the presence of high image noise, allowing for low dose and time sequence imaging.
Junying Chen, Qingcai Chen, Dongfang Li, Yutao Huang
Recently, Dense Retrieval (DR) has become a promising solution to document retrieval, where document representations are used to perform effective and efficient semantic search. However, DR remains challenging on long documents, due to the quadratic complexity of its Transformer-based encoder and the finite capacity of a low-dimension embedding. Current DR models use suboptimal strategies such as truncating or splitting-and-pooling to long documents leading to poor utilization of whole document information. In this work, to tackle this problem, we propose Segment representation learning for long documents Dense Retrieval (SeDR). In SeDR, Segment-Interaction Transformer is proposed to encode long documents into document-aware and segment-sensitive representations, while it holds the complexity of splitting-and-pooling and outperforms other segment-interaction patterns on DR. Since GPU memory requirements for long document encoding causes insufficient negatives for DR training, Late-Cache Negative is further proposed to provide additional cache negatives for optimizing representation learning. Experiments on MS MARCO and TREC-DL datasets show that SeDR achieves superior performance among DR models, and confirm the effectiveness of SeDR on long document retrieval.
James Thorne
Document retrieval is a core component of many knowledge-intensive natural
language processing task formulations such as fact verification and question
answering. Sources of textual knowledge, such as Wikipedia articles, condition
the generation of answers from the models. Recent advances in retrieval use
sequence-to-sequence models to incrementally predict the title of the
appropriate Wikipedia page given a query. However, this method requires
supervision in the form of human annotation to label which Wikipedia pages
contain appropriate context. This paper introduces a distant-supervision method
that does not require any annotation to train autoregressive retrievers that
attain competitive R-Precision and Recall in a zero-shot setting. Furthermore
we show that with task-specific supervised fine-tuning, autoregressive
retrieval performance for two Wikipedia-based fact verification tasks can
approach or even exceed full supervision using less than $1/4$ of the annotated
data indicating possible directions for data-efficient autoregressive
retrieval.
Authors' comments: To appear at SustaiNLP@EMNLP 2022. Code is available:
https://github.com/j6mes/sustainlp2022-deardr
Xin Gu, Yinghua Shen, Chaohui Lv
We propose a method to recommend background music for videos. Current work
rarely considers the emotional information of music, which is essential for
video music retrieval. To achieve this, we design two paths to process content
information and emotional information between modal. Based on characteristics
of video and music, we design various feature extraction schemes and common
representation spaces. More importantly, we propose a way to combine content
information with emotional information. Additionally, we make improvements to
the classical metric loss to be more suited to this task. Experiments show that
this dual path video music retrieval network can effectively merge information.
Compare with existing methods, the retrieval task evaluation index: increasing
Recall@1 by 3.94 and Recall@25 by 16.36.
Authors' comments: 5pages,3figures
Kyungsu Kim, Minju Park, Haesun Joung, Yunkee Chae, Yeongbeom Hong, Seonghyeon Go, Kyogu Lee
As digital music production has become mainstream, the selection of
appropriate virtual instruments plays a crucial role in determining the quality
of music. To search the musical instrument samples or virtual instruments that
make one's desired sound, music producers use their ears to listen and compare
each instrument sample in their collection, which is time-consuming and
inefficient. In this paper, we call this task as Musical Instrument Retrieval
and propose a method for retrieving desired musical instruments using reference
music mixture as a query. The proposed model consists of the Single-Instrument
Encoder and the Multi-Instrument Encoder, both based on convolutional neural
networks. The Single-Instrument Encoder is trained to classify the instruments
used in single-track audio, and we take its penultimate layer's activation as
the instrument embedding. The Multi-Instrument Encoder is trained to estimate
multiple instrument embeddings using the instrument embeddings computed by the
Single-Instrument Encoder as a set of target embeddings. For more generalized
training and realistic evaluation, we also propose a new dataset called Nlakh.
Experimental results showed that the Single-Instrument Encoder was able to
learn the mapping from the audio signal of unseen instruments to the instrument
embedding space and the Multi-Instrument Encoder was able to extract multiple
embeddings from the mixture of music and retrieve the desired instruments
successfully. The code used for the experiment and audio samples are available
at: https://github.com/minju0821/musical_instrument_retrieval
Authors' comments: 5 pages, 4 figures, submitted to ICASSP 2023
Deunsol Jung, Dahyun Kang, Suha Kwak, Minsu Cho
Metric learning aims to build a distance metric typically by learning an
effective embedding function that maps similar objects into nearby points in
its embedding space. Despite recent advances in deep metric learning, it
remains challenging for the learned metric to generalize to unseen classes with
a substantial domain gap. To tackle the issue, we explore a new problem of
few-shot metric learning that aims to adapt the embedding function to the
target domain with only a few annotated data. We introduce three few-shot
metric learning baselines and propose the Channel-Rectifier Meta-Learning
(CRML), which effectively adapts the metric space online by adjusting channels
of intermediate layers. Experimental analyses on miniImageNet, CUB-200-2011,
MPII, as well as a new dataset, miniDeepFashion, demonstrate that our method
consistently improves the learned metric by adapting it to target classes and
achieves a greater gain in image retrieval when the domain gap from the source
classes is larger.
Authors' comments: Accepted at ACCV 2022
You Zuo, Yixuan Li, Alma Parias García, Kim Gerdes
This paper presents an automatic approach to creating taxonomies of technical
terms based on the Cooperative Patent Classification (CPC). The resulting
taxonomy contains about 170k nodes in 9 separate technological branches and is
freely available. We also show that a Text-to-Text Transfer Transformer (T5)
model can be fine-tuned to generate hypernyms and hyponyms with relatively high
precision, confirming the manually assessed quality of the resource. The T5
model opens the taxonomy to any new technological terms for which a hypernym
can be generated, thus making the resource updateable with new terms, an
essential feature for the constantly evolving field of technological
terminology.
Authors' comments: ToTh 2022 - Terminology & Ontology: Theories and applications, Jun
2022, Chamb{\'e}ry, France
Xinya Du, Heng Ji
Event argument extraction has long been studied as a sequential prediction
problem with extractive-based methods, tackling each argument in isolation.
Although recent work proposes generation-based methods to capture
cross-argument dependency, they require generating and post-processing a
complicated target sequence (template). Motivated by these observations and
recent pretrained language models' capabilities of learning from
demonstrations. We propose a retrieval-augmented generative QA model (R-GQA)
for event argument extraction. It retrieves the most similar QA pair and
augments it as prompt to the current example's context, then decodes the
arguments as answers. Our approach outperforms substantially prior methods
across various settings (i.e. fully supervised, domain transfer, and fewshot
learning). Finally, we propose a clustering-based sampling strategy (JointEnc)
and conduct a thorough analysis of how different strategies influence the
few-shot learning performance. The implementations are available at https://
github.com/xinyadu/RGQA
Authors' comments: Accepted by EMNLP 2022 (18 pages)
Huibin Chang, Li Yang, Stefano Marchesini
In nanoscale imaging technique and ultrafast laser, the reconstruction procedure is normally formulated as a blind phase retrieval (BPR) problem, where one has to recover both the sample and the probe (pupil) jointly from phaseless data. This survey first presents the mathematical formula of BPR, related nonlinear optimization problems and then gives a brief review of the recent iterative algorithms. It mainly consists of three types of algorithms, including the operator-splitting based first-order optimization methods, second order algorithm with Hessian,and subspace methods. The future research directions for experimental issues and theoretical analysis are further discussed.
Jurek Leonhardt, Marcel Jahnke, Avishek Anand
Dual-encoder-based neural retrieval models achieve appreciable performance and complement traditional lexical retrievers well due to their semantic matching capabilities, which makes them a common choice for hybrid IR systems. However, these models exhibit a performance bottleneck in the online query encoding step, as the corresponding query encoders are usually large and complex Transformer models. In this paper we investigate heterogeneous dual-encoder models, where the two encoders are separate models that do not share parameters or initializations. We empirically show that heterogeneous dual-encoders are susceptible to collapsing representations, causing them to output constant trivial representations when they are fine-tuned using a standard contrastive loss due to a distribution mismatch. We propose DAFT, a simple two-stage fine-tuning approach that aligns the two encoders in order to prevent them from collapsing. We further demonstrate how DAFT can be used to train efficient heterogeneous dual-encoder models using lightweight query encoders.