Soham Deshmukh, Benjamin Elizalde, Huaming Wang
Audio-Text retrieval takes a natural language query to retrieve relevant audio files in a database. Conversely, Text-Audio retrieval takes an audio file as a query to retrieve relevant natural language descriptions. Most of the literature train retrieval systems with one audio captioning dataset, but evaluating the benefit of training with multiple datasets is underexplored. Moreover, retrieval systems have to learn the alignment between elaborated sentences describing audio content of variable length ranging from a few seconds to several minutes. In this work, we propose a new collection of web audio-text pairs and a new framework for retrieval. First, we provide a new collection of about five thousand web audio-text pairs that we refer to as WavText5K. When used to train our retrieval system, WavText5K improved performance more than other audio captioning datasets. Second, our framework learns to connect language and audio content by using a text encoder, two audio encoders, and a contrastive learning objective. Combining both audio encoders helps to process variable length audio. The two contributions beat state of the art performance for AudioCaps and Clotho on Text-Audio retrieval by a relative 2% and 16%, and Audio-Text retrieval by 6% and 23%.
Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu et al.
Knowledge distillation is an effective way to transfer knowledge from a
strong teacher to an efficient student model. Ideally, we expect the better the
teacher is, the better the student. However, this expectation does not always
come true. It is common that a better teacher model results in a bad student
via distillation due to the nonnegligible gap between teacher and student. To
bridge the gap, we propose PROD, a PROgressive Distillation method, for dense
retrieval. PROD consists of a teacher progressive distillation and a data
progressive distillation to gradually improve the student. We conduct extensive
experiments on five widely-used benchmarks, MS MARCO Passage, TREC Passage 19,
TREC Document 19, MS MARCO Document and Natural Questions, where PROD achieves
the state-of-the-art within the distillation methods for dense retrieval. The
code and models will be released.
Authors' comments: Accepted by WWW2023
Dawn Lawrie, Eugene Yang, Douglas W. Oard, James Mayfield
Providing access to information across languages has been a goal of
Information Retrieval (IR) for decades. While progress has been made on Cross
Language IR (CLIR) where queries are expressed in one language and documents in
another, the multilingual (MLIR) task to create a single ranked list of
documents across many languages is considerably more challenging. This paper
investigates whether advances in neural document translation and pretrained
multilingual neural language models enable improvements in the state of the art
over earlier MLIR techniques. The results show that although combining neural
document translation with neural ranking yields the best Mean Average Precision
(MAP), 98% of that MAP score can be achieved with an 84% reduction in indexing
time by using a pretrained XLM-R multilingual language model to index documents
in their native language, and that 2% difference in effectiveness is not
statistically significant. Key to achieving these results for MLIR is to
fine-tune XLM-R using mixed-language batches from neural translations of MS
MARCO passages.
Authors' comments: 17 pages, 3 figures, accepted at ECIR 2023
Euna Jung, Jungwon Park, Jaekeol Choi, Sungyoon Kim, Wonjong Rhee
The recent advancement in language representation modeling has broadly
affected the design of dense retrieval models. In particular, many of the
high-performing dense retrieval models evaluate representations of query and
document using BERT, and subsequently apply a cosine-similarity based scoring
to determine the relevance. BERT representations, however, are known to follow
an anisotropic distribution of a narrow cone shape and such an anisotropic
distribution can be undesirable for the cosine-similarity based scoring. In
this work, we first show that BERT-based DR also follows an anisotropic
distribution. To cope with the problem, we introduce unsupervised
post-processing methods of Normalizing Flow and whitening, and develop
token-wise method in addition to the sequence-wise method for applying the
post-processing methods to the representations of dense retrieval models. We
show that the proposed methods can effectively enhance the representations to
be isotropic, then we perform experiments with ColBERT and RepBERT to show that
the performance (NDCG at 10) of document re-ranking can be improved by
5.17\%$\sim$8.09\% for ColBERT and 6.88\%$\sim$22.81\% for RepBERT. To examine
the potential of isotropic representation for improving the robustness of DR
models, we investigate out-of-distribution tasks where the test dataset differs
from the training dataset. The results show that isotropic representation can
achieve a generally improved performance. For instance, when training dataset
is MS-MARCO and test dataset is Robust04, isotropy post-processing can improve
the baseline performance by up to 24.98\%. Furthermore, we show that an
isotropic model trained with an out-of-distribution dataset can even outperform
a baseline model trained with the in-distribution dataset.
Authors' comments: 9 pages, 4 figures
Xin Zhang, Yong Jiang, Xiaobin Wang, Xuming Hu, Yueheng Sun, Pengjun Xie, Meishan Zhang
Successful Machine Learning based Named Entity Recognition models could fail
on texts from some special domains, for instance, Chinese addresses and
e-commerce titles, where requires adequate background knowledge. Such texts are
also difficult for human annotators. In fact, we can obtain some potentially
helpful information from correlated texts, which have some common entities, to
help the text understanding. Then, one can easily reason out the correct answer
by referencing correlated samples. In this paper, we suggest enhancing NER
models with correlated samples. We draw correlated samples by the sparse BM25
retriever from large-scale in-domain unlabeled data. To explicitly simulate the
human reasoning process, we perform a training-free entity type calibrating by
majority voting. To capture correlation features in the training stage, we
suggest to model correlated samples by the transformer-based multi-instance
cross-encoder. Empirical results on datasets of the above two domains show the
efficacy of our methods.
Authors' comments: Accepted by COLING 2022, added dev results of the address data
Nima Sadri
Although representational retrieval models based on Transformers have been able to make major advances in the past few years, and despite the widely accepted conventions and best-practices for testing such models, a $\textit{standardized}$ evaluation framework for testing them has not been developed. In this work, we formalize the best practices and conventions followed by researchers in the literature, paving the path for more standardized evaluations - and therefore more fair comparisons between the models. Our framework (1) embeds the documents and queries; (2) for each query-document pair, computes the relevance score based on the dot product of the document and query embedding; (3) uses the $\texttt{dev}$ set of the MSMARCO dataset to evaluate the models; (4) uses the $\texttt{trec_eval}$ script to calculate MRR@100, which is the primary metric used to evaluate the models. Most importantly, we showcase the use of this framework by experimenting on some of the most well-known dense retrieval models.
Dahlia Shehata, Negar Arabzadeh, Charles L. A. Clarke
Despite the advantages of their low-resource settings, traditional sparse retrievers depend on exact matching approaches between high-dimensional bag-of-words (BoW) representations of both the queries and the collection. As a result, retrieval performance is restricted by semantic discrepancies and vocabulary gaps. On the other hand, transformer-based dense retrievers introduce significant improvements in information retrieval tasks by exploiting low-dimensional contextualized representations of the corpus. While dense retrievers are known for their relative effectiveness, they suffer from lower efficiency and lack of generalization issues, when compared to sparse retrievers. For a lightweight retrieval task, high computational resources and time consumption are major barriers encouraging the renunciation of dense models despite potential gains. In this work, we propose boosting the performance of sparse retrievers by expanding both the queries and the documents with linked entities in two formats for the entity names: 1) explicit and 2) hashed. We employ a zero-shot end-to-end dense entity linking system for entity recognition and disambiguation to augment the corpus. By leveraging the advanced entity linking methods, we believe that the effectiveness gap between sparse and dense retrievers can be narrowed. We conduct our experiments on the MS MARCO passage dataset. Since we are concerned with the early stage retrieval in cascaded ranking architectures of large information retrieval systems, we evaluate our results using recall@1000. Our approach is also capable of retrieving documents for query subsets judged to be particularly difficult in prior work. We further demonstrate that the non-expanded and the expanded runs with both explicit and hashed entities retrieve complementary results. Consequently, we adopt a run fusion approach to maximize the benefits of entity linking.
Albert Fannjiang
This paper develops uniqueness theory for 3D phase retrieval with finite,
discrete measurement data for strong phase objects and weak phase objects,
including:
(i) {\em Unique determination of (phase) projections from diffraction
patterns} -- General measurement schemes with coded and uncoded apertures are
proposed and shown to ensure unique reduction of diffraction patterns to the
phase projection for a strong phase object (respectively, the projection for a
weak phase object) in each direction separately without the knowledge of
relative orientations and locations. (ii) {\em Uniqueness for 3D phase
unwrapping} -- General conditions for unique determination of a 3D strong phase
object from its phase projection data are established, including, but not
limited to, random tilt schemes densely sampled from a spherical triangle of
vertexes in three orthogonal directions and other deterministic tilt schemes.
(iii) {\em Uniqueness for projection tomography} -- Unique determination of an
object of $n^3$ voxels from generic $n$ projections or $n+1$ coded diffraction
patterns is proved.
This approach of reducing 3D phase retrieval to the problem of (phase)
projection tomography has the practical implication of enabling classification
and alignment, when relative orientations are unknown, to be carried out in
terms of (phase) projections, instead of diffraction patterns.
The applications with the measurement schemes such as single-axis tilt,
conical tilt, dual-axis tilt, random conical tilt and general random tilt are
discussed.
Authors' comments: Revision of the previously titled "3D UNWRAPPED PHASE RETRIEVAL WITH
CODED APERTURE IS REDUCIBLE TO PROJECTION TOMOGRAPHY"
Dhanasekar Sundararaman, Vivek Subramanian
Biases in culture, gender, ethnicity, etc. have existed for decades and have
affected many areas of human social interaction. These biases have been shown
to impact machine learning (ML) models, and for natural language processing
(NLP), this can have severe consequences for downstream tasks. Mitigating
gender bias in information retrieval (IR) is important to avoid propagating
stereotypes. In this work, we employ a dataset consisting of two components:
(1) relevance of a document to a query and (2) "gender" of a document, in which
pronouns are replaced by male, female, and neutral conjugations. We
definitively show that pre-trained models for IR do not perform well in
zero-shot retrieval tasks when full fine-tuning of a large pre-trained BERT
encoder is performed and that lightweight fine-tuning performed with adapter
networks improves zero-shot retrieval performance almost by 20% over baseline.
We also illustrate that pre-trained models have gender biases that result in
retrieved articles tending to be more often male than female. We overcome this
by introducing a debiasing technique that penalizes the model when it prefers
males over females, resulting in an effective model that retrieves articles in
a balanced fashion across genders.
Authors' comments: Updated title to be reflective of the methods
Stefan Lattner
Modern digital music production typically involves combining numerous
acoustic elements to compile a piece of music. Important types of such elements
are drum samples, which determine the characteristics of the percussive
components of the piece. Artists must use their aesthetic judgement to assess
whether a given drum sample fits the current musical context. However,
selecting drum samples from a potentially large library is tedious and may
interrupt the creative flow. In this work, we explore the automatic drum sample
retrieval based on aesthetic principles learned from data. As a result, artists
can rank the samples in their library by fit to some musical context at
different stages of the production process (i.e., by fit to incomplete song
mixtures). To this end, we use contrastive learning to maximize the score of
drum samples originating from the same song as the mixture. We conduct a
listening test to determine whether the human ratings match the automatic
scoring function. We also perform objective quantitative analyses to evaluate
the efficacy of our approach.
Authors' comments: 8 pages, 3 figures, 1 table; Accepted at the ISMIR conference,
Bengaluru, India, 2022
Nicola Tonellotto
These lecture notes focus on the recent advancements in neural information retrieval, with particular emphasis on the systems and models exploiting transformer networks. These networks, originally proposed by Google in 2017, have seen a large success in many natural language processing and information retrieval tasks. While there are many fantastic textbook on information retrieval and natural language processing as well as specialised books for a more advanced audience, these lecture notes target people aiming at developing a basic understanding of the main information retrieval techniques and approaches based on deep learning. These notes have been prepared for a IR graduate course of the MSc program in Artificial Intelligence and Data Engineering at the University of Pisa, Italy.
Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Image captioning models aim at connecting Vision and Language by providing
natural language descriptions of input images. In the past few years, the task
has been tackled by learning parametric models and proposing visual feature
extraction advancements or by modeling better multi-modal connections. In this
paper, we investigate the development of an image captioning approach with a
kNN memory, with which knowledge can be retrieved from an external corpus to
aid the generation process. Our architecture combines a knowledge retriever
based on visual similarities, a differentiable encoder, and a kNN-augmented
attention layer to predict tokens based on the past context and on text
retrieved from the external memory. Experimental results, conducted on the COCO
dataset, demonstrate that employing an explicit external memory can aid the
generation process and increase caption quality. Our work opens up new avenues
for improving image captioning models at larger scale.
Authors' comments: CBMI 2022
Philipp Grohs, Lukas Liehr, Martin Rathmair
Short-time Fourier transform (STFT) phase retrieval refers to the
reconstruction of a function $f$ from its spectrogram, i.e., the magnitudes of
its short-time Fourier transform $V_gf$ with window function $g$. While it is
known that for appropriate windows, any function $f \in L^2(\mathbb{R})$ can be
reconstructed from the full spectrogram $|V_g f(\mathbb{R}^2)|$, in practical
scenarios, the reconstruction must be achieved from discrete samples, typically
taken on a lattice. It turns out that the sampled problem becomes much more
subtle: recent results have demonstrated that uniqueness via lattice-sampling
is unachievable, irrespective of the choice of the window function or the
lattice density. In the present paper, we initiate the study of multi-window
STFT phase retrieval as a way to effectively bypass the discretization barriers
encountered in the single-window case. By establishing a link between
multi-window Gabor systems, sampling in Fock space, and phase retrieval for
finite frames, we derive conditions under which square-integrable functions can
be uniquely recovered from spectrogram samples on a lattice. Specifically, we
provide conditions on window functions $g_1, \dots, g_4 \in L^2(\mathbb{R})$,
such that every $f \in L^2(\mathbb{R})$ is determined up to a global phase from
$$\left(|V_{g_1}f(A\mathbb{Z}^2)|, \, \dots, \, |V_{g_4}f(A\mathbb{Z}^2)|
\right)$$ whenever $A \in \mathrm{GL}_2(\mathbb{R})$ satisfies the density
condition $|\det A|^{-1} \geq 4$. For real-valued functions, a density of
$|\det A|^{-1} \geq 2$ is sufficient. Corresponding results for irregular
sampling are also shown.
Authors' comments: 19 pages, 2 figures, incorporated referee suggestions
Bytasandram Yaswanth Reddy, Shiv Ram Dubey, Rakesh Kumar Sanodiya, Ravi Ranjan Prasad Karn
Existing data-dependent hashing methods use large backbone networks with
millions of parameters and are computationally complex. Existing knowledge
distillation methods use logits and other features of the deep (teacher) model
and as knowledge for the compact (student) model, which requires the teacher's
network to be fine-tuned on the context in parallel with the student model on
the context. Training teacher on the target context requires more time and
computational resources. In this paper, we propose context unaware knowledge
distillation that uses the knowledge of the teacher model without fine-tuning
it on the target context. We also propose a new efficient student model
architecture for knowledge distillation. The proposed approach follows a
two-step process. The first step involves pre-training the student model with
the help of context unaware knowledge distillation from the teacher model. The
second step involves fine-tuning the student model on the context of image
retrieval. In order to show the efficacy of the proposed approach, we compare
the retrieval results, no. of parameters and no. of operations of the student
models with the teacher models under different retrieval frameworks, including
deep cauchy hashing (DCH) and central similarity quantization (CSQ). The
experimental results confirm that the proposed approach provides a promising
trade-off between the retrieval results and efficiency. The code used in this
paper is released publicly at \url{https://github.com/satoru2001/CUKDFIR}.
Authors' comments: Accepted in International Conference on Computer Vision and Machine
Intelligence (CVMI), 2022
Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, Graham Neubig
Publicly available source-code libraries are continuously growing and
changing. This makes it impossible for models of code to keep current with all
available APIs by simply training these models on existing code repositories.
Thus, existing models inherently cannot generalize to using unseen functions
and libraries, because these would never appear in the training data. In
contrast, when human programmers use functions and libraries for the first
time, they frequently refer to textual resources such as code manuals and
documentation, to explore and understand the available functionality. Inspired
by this observation, we introduce DocPrompting: a natural-language-to-code
generation approach that explicitly leverages documentation by (1) retrieving
the relevant documentation pieces given an NL intent, and (2) generating code
based on the NL intent and the retrieved documentation. DocPrompting is
general: it can be applied to any programming language and is agnostic to the
underlying neural model. We demonstrate that DocPrompting consistently improves
NL-to-code models: DocPrompting improves strong base models such as CodeT5 by
2.85% in pass@1 (52% relative gain) and 4.39% in pass@10 (30% relative gain) in
execution-based evaluation on the popular Python CoNaLa benchmark; on a new
Bash dataset tldr, DocPrompting improves CodeT5 and GPT-Neo1.3B by up to
absolute 6.9% exact match.
Authors' comments: ICLR 2023 (notable-top-25%); code and data are available at
https://github.com/shuyanzhou/docprompting
Jun Rao, Liang Ding, Shuhan Qi, Meng Fang, Yang Liu, Li Shen, Dacheng Tao
Although the vision-and-language pretraining (VLP) equipped cross-modal image-text retrieval (ITR) has achieved remarkable progress in the past two years, it suffers from a major drawback: the ever-increasing size of VLP models restricts its deployment to real-world search scenarios (where the high latency is unacceptable). To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to the cross-modal tasks, due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which have different effects on distillation learning and student network optimization. We try to overcome these challenges from two points. First, to achieve multi-modal contrastive learning, and balance the training costs and effects, we propose to use a teacher network to estimate the difficult samples for students, making the students absorb the powerful knowledge from pre-trained teachers, and master the knowledge from hard samples. Second, to dynamic learn from hard sample pairs, we propose dynamic distillation to dynamically learn samples of different difficulties, from the perspective of better balancing the difficulty of knowledge and students' self-learning ability. We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER. Extensive experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework. Encouragingly, we can speed up the inference at least 129$\times$ compared to the existing ITR models.
Ayan Kumar Bhunia, Aneeshan Sain, Parth Shah, Animesh Gupta, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song
The recent focus on Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) has
shifted towards generalising a model to new categories without any training
data from them. In real-world applications, however, a trained FG-SBIR model is
often applied to both new categories and different human sketchers, i.e.,
different drawing styles. Although this complicates the generalisation problem,
fortunately, a handful of examples are typically available, enabling the model
to adapt to the new category/style. In this paper, we offer a novel perspective
-- instead of asking for a model that generalises, we advocate for one that
quickly adapts, with just very few samples during testing (in a few-shot
manner). To solve this new problem, we introduce a novel model-agnostic
meta-learning (MAML) based framework with several key modifications: (1) As a
retrieval task with a margin-based contrastive loss, we simplify the MAML
training in the inner loop to make it more stable and tractable. (2) The margin
in our contrastive loss is also meta-learned with the rest of the model. (3)
Three additional regularisation losses are introduced in the outer loop, to
make the meta-learned FG-SBIR model more effective for category/style
adaptation. Extensive experiments on public datasets suggest a large gain over
generalisation and zero-shot based approaches, and a few strong few-shot
baselines.
Authors' comments: Accepted in ECCV 2022. Minor typos and Eq.4 corrected
Xu Huang, Defu Lian, Jin Chen, Zheng Liu, Xing Xie, Enhong Chen
Deep recommender systems (DRS) are intensively applied in modern web
services. To deal with the massive web contents, DRS employs a two-stage
workflow: retrieval and ranking, to generate its recommendation results. The
retriever aims to select a small set of relevant candidates from the entire
items with high efficiency; while the ranker, usually more precise but
time-consuming, is supposed to further refine the best items from the retrieved
candidates. Traditionally, the two components are trained either independently
or within a simple cascading pipeline, which is prone to poor collaboration
effect. Though some latest works suggested to train retriever and ranker
jointly, there still exist many severe limitations: item distribution shift
between training and inference, false negative, and misalignment of ranking
order. As such, it remains to explore effective collaborations between
retriever and ranker.
Authors' comments: 12pages, 4 figures, WWW'23
Julien Flamant, Konstantin Usevich, Marianne Clausel, David Brie
This work introduces a novel Fourier phase retrieval model, called
polarimetric phase retrieval that enables a systematic use of polarization
information in Fourier phase retrieval problems. We provide a complete
characterization of uniqueness properties of this new model by unraveling
equivalencies with a peculiar polynomial factorization problem. We introduce
two different but complementary categories of reconstruction methods. The first
one is algebraic and relies on the use of approximate greatest common divisor
computations using Sylvester matrices. The second one carefully adapts existing
algorithms for Fourier phase retrieval, namely semidefinite positive relaxation
and Wirtinger-Flow, to solve the polarimetric phase retrieval problem. Finally,
a set of numerical experiments permits a detailed assessment of the numerical
behavior and relative performances of each proposed reconstruction strategy. We
further highlight a reconstruction strategy that combines both approaches for
scalable, computationally efficient and asymptotically MSE optimal performance.
Authors' comments: 37 pages, 10 figures
Ying Wang
This paper addresses the construction of inverted index for large-scale image retrieval. The inverted index proposed by J. Sivic brings a significant acceleration by reducing distance computations with only a small fraction of the database. The state-of-the-art inverted indices aim to build finer partitions that produce a concise and accurate candidate list. However, partitioning in these frameworks is generally achieved by unsupervised clustering methods which ignore the semantic information of images. In this paper, we replace the clustering method with image classification, during the construction of codebook. We then propose a merging and splitting method to solve the problem that the number of partitions is unchangeable in the inverted semantic-index. Next, we combine our semantic-index with the product quantization (PQ) so as to alleviate the accuracy loss caused by PQ compression. Finally, we evaluate our model on large-scale image retrieval benchmarks. Experiment results demonstrate that our model can significantly improve the retrieval accuracy by generating high-quality candidate lists.