Dhanasekar Sundararaman, Vivek Subramanian
Biases in culture, gender, ethnicity, etc. have existed for decades and have
affected many areas of human social interaction. These biases have been shown
to impact machine learning (ML) models, and for natural language processing
(NLP), this can have severe consequences for downstream tasks. Mitigating
gender bias in information retrieval (IR) is important to avoid propagating
stereotypes. In this work, we employ a dataset consisting of two components:
(1) relevance of a document to a query and (2) "gender" of a document, in which
pronouns are replaced by male, female, and neutral conjugations. We
definitively show that pre-trained models for IR do not perform well in
zero-shot retrieval tasks when full fine-tuning of a large pre-trained BERT
encoder is performed and that lightweight fine-tuning performed with adapter
networks improves zero-shot retrieval performance almost by 20% over baseline.
We also illustrate that pre-trained models have gender biases that result in
retrieved articles tending to be more often male than female. We overcome this
by introducing a debiasing technique that penalizes the model when it prefers
males over females, resulting in an effective model that retrieves articles in
a balanced fashion across genders.
Authors' comments: Updated title to be reflective of the methods
Stefan Lattner
Modern digital music production typically involves combining numerous
acoustic elements to compile a piece of music. Important types of such elements
are drum samples, which determine the characteristics of the percussive
components of the piece. Artists must use their aesthetic judgement to assess
whether a given drum sample fits the current musical context. However,
selecting drum samples from a potentially large library is tedious and may
interrupt the creative flow. In this work, we explore the automatic drum sample
retrieval based on aesthetic principles learned from data. As a result, artists
can rank the samples in their library by fit to some musical context at
different stages of the production process (i.e., by fit to incomplete song
mixtures). To this end, we use contrastive learning to maximize the score of
drum samples originating from the same song as the mixture. We conduct a
listening test to determine whether the human ratings match the automatic
scoring function. We also perform objective quantitative analyses to evaluate
the efficacy of our approach.
Authors' comments: 8 pages, 3 figures, 1 table; Accepted at the ISMIR conference,
Bengaluru, India, 2022
Nicola Tonellotto
These lecture notes focus on the recent advancements in neural information retrieval, with particular emphasis on the systems and models exploiting transformer networks. These networks, originally proposed by Google in 2017, have seen a large success in many natural language processing and information retrieval tasks. While there are many fantastic textbook on information retrieval and natural language processing as well as specialised books for a more advanced audience, these lecture notes target people aiming at developing a basic understanding of the main information retrieval techniques and approaches based on deep learning. These notes have been prepared for a IR graduate course of the MSc program in Artificial Intelligence and Data Engineering at the University of Pisa, Italy.
Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Image captioning models aim at connecting Vision and Language by providing
natural language descriptions of input images. In the past few years, the task
has been tackled by learning parametric models and proposing visual feature
extraction advancements or by modeling better multi-modal connections. In this
paper, we investigate the development of an image captioning approach with a
kNN memory, with which knowledge can be retrieved from an external corpus to
aid the generation process. Our architecture combines a knowledge retriever
based on visual similarities, a differentiable encoder, and a kNN-augmented
attention layer to predict tokens based on the past context and on text
retrieved from the external memory. Experimental results, conducted on the COCO
dataset, demonstrate that employing an explicit external memory can aid the
generation process and increase caption quality. Our work opens up new avenues
for improving image captioning models at larger scale.
Authors' comments: CBMI 2022
Philipp Grohs, Lukas Liehr, Martin Rathmair
Short-time Fourier transform (STFT) phase retrieval refers to the
reconstruction of a function $f$ from its spectrogram, i.e., the magnitudes of
its short-time Fourier transform $V_gf$ with window function $g$. While it is
known that for appropriate windows, any function $f \in L^2(\mathbb{R})$ can be
reconstructed from the full spectrogram $|V_g f(\mathbb{R}^2)|$, in practical
scenarios, the reconstruction must be achieved from discrete samples, typically
taken on a lattice. It turns out that the sampled problem becomes much more
subtle: recent results have demonstrated that uniqueness via lattice-sampling
is unachievable, irrespective of the choice of the window function or the
lattice density. In the present paper, we initiate the study of multi-window
STFT phase retrieval as a way to effectively bypass the discretization barriers
encountered in the single-window case. By establishing a link between
multi-window Gabor systems, sampling in Fock space, and phase retrieval for
finite frames, we derive conditions under which square-integrable functions can
be uniquely recovered from spectrogram samples on a lattice. Specifically, we
provide conditions on window functions $g_1, \dots, g_4 \in L^2(\mathbb{R})$,
such that every $f \in L^2(\mathbb{R})$ is determined up to a global phase from
$$\left(|V_{g_1}f(A\mathbb{Z}^2)|, \, \dots, \, |V_{g_4}f(A\mathbb{Z}^2)|
\right)$$ whenever $A \in \mathrm{GL}_2(\mathbb{R})$ satisfies the density
condition $|\det A|^{-1} \geq 4$. For real-valued functions, a density of
$|\det A|^{-1} \geq 2$ is sufficient. Corresponding results for irregular
sampling are also shown.
Authors' comments: 19 pages, 2 figures, incorporated referee suggestions
Bytasandram Yaswanth Reddy, Shiv Ram Dubey, Rakesh Kumar Sanodiya, Ravi Ranjan Prasad Karn
Existing data-dependent hashing methods use large backbone networks with
millions of parameters and are computationally complex. Existing knowledge
distillation methods use logits and other features of the deep (teacher) model
and as knowledge for the compact (student) model, which requires the teacher's
network to be fine-tuned on the context in parallel with the student model on
the context. Training teacher on the target context requires more time and
computational resources. In this paper, we propose context unaware knowledge
distillation that uses the knowledge of the teacher model without fine-tuning
it on the target context. We also propose a new efficient student model
architecture for knowledge distillation. The proposed approach follows a
two-step process. The first step involves pre-training the student model with
the help of context unaware knowledge distillation from the teacher model. The
second step involves fine-tuning the student model on the context of image
retrieval. In order to show the efficacy of the proposed approach, we compare
the retrieval results, no. of parameters and no. of operations of the student
models with the teacher models under different retrieval frameworks, including
deep cauchy hashing (DCH) and central similarity quantization (CSQ). The
experimental results confirm that the proposed approach provides a promising
trade-off between the retrieval results and efficiency. The code used in this
paper is released publicly at \url{https://github.com/satoru2001/CUKDFIR}.
Authors' comments: Accepted in International Conference on Computer Vision and Machine
Intelligence (CVMI), 2022
Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, Graham Neubig
Publicly available source-code libraries are continuously growing and
changing. This makes it impossible for models of code to keep current with all
available APIs by simply training these models on existing code repositories.
Thus, existing models inherently cannot generalize to using unseen functions
and libraries, because these would never appear in the training data. In
contrast, when human programmers use functions and libraries for the first
time, they frequently refer to textual resources such as code manuals and
documentation, to explore and understand the available functionality. Inspired
by this observation, we introduce DocPrompting: a natural-language-to-code
generation approach that explicitly leverages documentation by (1) retrieving
the relevant documentation pieces given an NL intent, and (2) generating code
based on the NL intent and the retrieved documentation. DocPrompting is
general: it can be applied to any programming language and is agnostic to the
underlying neural model. We demonstrate that DocPrompting consistently improves
NL-to-code models: DocPrompting improves strong base models such as CodeT5 by
2.85% in pass@1 (52% relative gain) and 4.39% in pass@10 (30% relative gain) in
execution-based evaluation on the popular Python CoNaLa benchmark; on a new
Bash dataset tldr, DocPrompting improves CodeT5 and GPT-Neo1.3B by up to
absolute 6.9% exact match.
Authors' comments: ICLR 2023 (notable-top-25%); code and data are available at
https://github.com/shuyanzhou/docprompting
Jun Rao, Liang Ding, Shuhan Qi, Meng Fang, Yang Liu, Li Shen, Dacheng Tao
Although the vision-and-language pretraining (VLP) equipped cross-modal image-text retrieval (ITR) has achieved remarkable progress in the past two years, it suffers from a major drawback: the ever-increasing size of VLP models restricts its deployment to real-world search scenarios (where the high latency is unacceptable). To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to the cross-modal tasks, due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which have different effects on distillation learning and student network optimization. We try to overcome these challenges from two points. First, to achieve multi-modal contrastive learning, and balance the training costs and effects, we propose to use a teacher network to estimate the difficult samples for students, making the students absorb the powerful knowledge from pre-trained teachers, and master the knowledge from hard samples. Second, to dynamic learn from hard sample pairs, we propose dynamic distillation to dynamically learn samples of different difficulties, from the perspective of better balancing the difficulty of knowledge and students' self-learning ability. We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER. Extensive experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework. Encouragingly, we can speed up the inference at least 129$\times$ compared to the existing ITR models.
Ayan Kumar Bhunia, Aneeshan Sain, Parth Shah, Animesh Gupta, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song
The recent focus on Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) has
shifted towards generalising a model to new categories without any training
data from them. In real-world applications, however, a trained FG-SBIR model is
often applied to both new categories and different human sketchers, i.e.,
different drawing styles. Although this complicates the generalisation problem,
fortunately, a handful of examples are typically available, enabling the model
to adapt to the new category/style. In this paper, we offer a novel perspective
-- instead of asking for a model that generalises, we advocate for one that
quickly adapts, with just very few samples during testing (in a few-shot
manner). To solve this new problem, we introduce a novel model-agnostic
meta-learning (MAML) based framework with several key modifications: (1) As a
retrieval task with a margin-based contrastive loss, we simplify the MAML
training in the inner loop to make it more stable and tractable. (2) The margin
in our contrastive loss is also meta-learned with the rest of the model. (3)
Three additional regularisation losses are introduced in the outer loop, to
make the meta-learned FG-SBIR model more effective for category/style
adaptation. Extensive experiments on public datasets suggest a large gain over
generalisation and zero-shot based approaches, and a few strong few-shot
baselines.
Authors' comments: Accepted in ECCV 2022. Minor typos and Eq.4 corrected
Xu Huang, Defu Lian, Jin Chen, Zheng Liu, Xing Xie, Enhong Chen
Deep recommender systems (DRS) are intensively applied in modern web
services. To deal with the massive web contents, DRS employs a two-stage
workflow: retrieval and ranking, to generate its recommendation results. The
retriever aims to select a small set of relevant candidates from the entire
items with high efficiency; while the ranker, usually more precise but
time-consuming, is supposed to further refine the best items from the retrieved
candidates. Traditionally, the two components are trained either independently
or within a simple cascading pipeline, which is prone to poor collaboration
effect. Though some latest works suggested to train retriever and ranker
jointly, there still exist many severe limitations: item distribution shift
between training and inference, false negative, and misalignment of ranking
order. As such, it remains to explore effective collaborations between
retriever and ranker.
Authors' comments: 12pages, 4 figures, WWW'23
Julien Flamant, Konstantin Usevich, Marianne Clausel, David Brie
This work introduces a novel Fourier phase retrieval model, called
polarimetric phase retrieval that enables a systematic use of polarization
information in Fourier phase retrieval problems. We provide a complete
characterization of uniqueness properties of this new model by unraveling
equivalencies with a peculiar polynomial factorization problem. We introduce
two different but complementary categories of reconstruction methods. The first
one is algebraic and relies on the use of approximate greatest common divisor
computations using Sylvester matrices. The second one carefully adapts existing
algorithms for Fourier phase retrieval, namely semidefinite positive relaxation
and Wirtinger-Flow, to solve the polarimetric phase retrieval problem. Finally,
a set of numerical experiments permits a detailed assessment of the numerical
behavior and relative performances of each proposed reconstruction strategy. We
further highlight a reconstruction strategy that combines both approaches for
scalable, computationally efficient and asymptotically MSE optimal performance.
Authors' comments: 37 pages, 10 figures
Ying Wang
This paper addresses the construction of inverted index for large-scale image retrieval. The inverted index proposed by J. Sivic brings a significant acceleration by reducing distance computations with only a small fraction of the database. The state-of-the-art inverted indices aim to build finer partitions that produce a concise and accurate candidate list. However, partitioning in these frameworks is generally achieved by unsupervised clustering methods which ignore the semantic information of images. In this paper, we replace the clustering method with image classification, during the construction of codebook. We then propose a merging and splitting method to solve the problem that the number of partitions is unchangeable in the inverted semantic-index. Next, we combine our semantic-index with the product quantization (PQ) so as to alleviate the accuracy loss caused by PQ compression. Finally, we evaluate our model on large-scale image retrieval benchmarks. Experiment results demonstrate that our model can significantly improve the retrieval accuracy by generating high-quality candidate lists.
Yucheng Zhou, Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Guodong Long, Binxing Jiao, Daxin Jiang
A ranker plays an indispensable role in the de facto 'retrieval & rerank'
pipeline, but its training still lags behind -- learning from moderate
negatives or/and serving as an auxiliary module for a retriever. In this work,
we first identify two major barriers to a robust ranker, i.e., inherent label
noises caused by a well-trained retriever and non-ideal negatives sampled for a
high-capable ranker. Thereby, we propose multiple retrievers as negative
generators improve the ranker's robustness, where i) involving extensive
out-of-distribution label noises renders the ranker against each noise
distribution, and ii) diverse hard negatives from a joint distribution are
relatively close to the ranker's negative distribution, leading to more
challenging thus effective training. To evaluate our robust ranker (dubbed
R$^2$anker), we conduct experiments in various settings on the popular passage
retrieval benchmark, including BM25-reranking, full-ranking, retriever
distillation, etc. The empirical results verify the new state-of-the-art
effectiveness of our model.
Authors' comments: 11 pages of main content, 4 tables, 3 figures
Peter C. Humphreys, Arthur Guez, Olivier Tieleman, Laurent Sifre, Théophane Weber, Timothy Lillicrap
Effective decision making involves flexibly relating past experiences and
relevant contextual information to a novel situation. In deep reinforcement
learning (RL), the dominant paradigm is for an agent to amortise information
that helps decision making into its network weights via gradient descent on
training losses. Here, we pursue an alternative approach in which agents can
utilise large-scale context sensitive database lookups to support their
parametric computations. This allows agents to directly learn in an end-to-end
manner to utilise relevant information to inform their outputs. In addition,
new information can be attended to by the agent, without retraining, by simply
augmenting the retrieval dataset. We study this approach for offline RL in 9x9
Go, a challenging game for which the vast combinatorial state space privileges
generalisation over direct matching to past experiences. We leverage fast,
approximate nearest neighbor techniques in order to retrieve relevant data from
a set of tens of millions of expert demonstration states. Attending to this
information provides a significant boost to prediction accuracy and game-play
performance over simply using these demonstrations as training trajectories,
providing a compelling demonstration of the value of large-scale retrieval in
offline RL agents.
Authors' comments: Thirty-sixth Annual Conference on Neural Information Processing
Systems (NeurIPS 2022), 16 pages
Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Hao Sun, Qi Chen, Yuqing Xia et al.
Current state-of-the-art document retrieval solutions mainly follow an
index-retrieve paradigm, where the index is hard to be directly optimized for
the final retrieval target. In this paper, we aim to show that an end-to-end
deep neural network unifying training and indexing stages can significantly
improve the recall performance of traditional methods. To this end, we propose
Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates
relevant document identifiers directly for a designated query. To optimize the
recall performance of NCI, we invent a prefix-aware weight-adaptive decoder
architecture, and leverage tailored techniques including query generation,
semantic document identifiers, and consistency-based regularization. Empirical
studies demonstrated the superiority of NCI on two commonly used academic
benchmarks, achieving +21.4% and +16.8% relative enhancement for Recall@1 on
NQ320k dataset and R-Precision on TriviaQA dataset, respectively, compared to
the best baseline method.
Authors' comments: 19 pages, 6 figures, accepted by NeurIPS 2022
Devansh Gupta, Aditya Saini, Drishti Bhasin, Sarthak Bhagat, Shagun Uppal, Rishi Raj Jain, Ponnurangam Kumaraguru, Rajiv Ratn Shah
Retrieving facial images from attributes plays a vital role in various systems such as face recognition and suspect identification. Compared to other image retrieval tasks, facial image retrieval is more challenging due to the high subjectivity involved in describing a person's facial features. Existing methods do so by comparing specific characteristics from the user's mental image against the suggested images via high-level supervision such as using natural language. In contrast, we propose a method that uses a relatively simpler form of binary supervision by utilizing the user's feedback to label images as either similar or dissimilar to the target image. Such supervision enables us to exploit the contrastive learning paradigm for encapsulating each user's personalized notion of similarity. For this, we propose a novel loss function optimized online via user feedback. We validate the efficacy of our proposed approach using a carefully designed testbed to simulate user feedback and a large-scale user study. Our experiments demonstrate that our method iteratively improves personalization, leading to faster convergence and enhanced recommendation relevance, thereby, improving user satisfaction. Our proposed framework is also equipped with a user-friendly web interface with a real-time experience for facial image retrieval.
Philippe Jaming, Martin Rathmair
In this paper we consider the question of finding an as small as possible family of operators $(T_j)_{j\in J}$ on $L^2(R)$ that does phase retrieval: every $\varphi$ is uniquely determined (up to a constant phase factor) by the phaseless data $(|T_j\varphi|)_{j\in J}$. This problem arises in various fields of applied sciences where usually the operators obey further restrictions. Of particular interest here are so-called {\em coded diffraction paterns} where the operators are of the form $T_j\varphi=\mathcal{F}m_j\varphi$, $\mathcal{F}$ the Fourier transform and $m_j\in L^\infty(R)$ are "masks". Here we explicitely construct three real-valued masks $m_1,m_2,m_3\in L^\infty(R)$ so that the associated coded diffraction patterns do phase retrieval. This implies that the three self-adjoint operators $T_j\varphi=\mathcal{F}[m_j\mathcal{F}^{-1}\varphi]$ also do phase retrieval. The proof uses complex analysis.We then show that some natural analogues of these operators in the finite dimensional setting do not always lead to the same uniqueness result due to an undersampling effect.
Avinash Madasu, Junier Oliva, Gedas Bertasius
The majority of traditional text-to-video retrieval systems operate in static environments, i.e., there is no interaction between the user and the agent beyond the initial textual query provided by the user. This can be sub-optimal if the initial query has ambiguities, which would lead to many falsely retrieved videos. To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog, where the user refines retrieved results by answering questions generated by an AI agent. Our novel multimodal question generator learns to ask questions that maximize the subsequent video retrieval performance using (i) the video candidates retrieved during the last round of interaction with the user and (ii) the text-based dialog history documenting all previous interactions, to generate questions that incorporate both visual and linguistic cues relevant to video retrieval. Furthermore, to generate maximally informative questions, we propose an Information-Guided Supervision (IGS), which guides the question generator to ask questions that would boost subsequent video retrieval accuracy. We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems. We also demonstrate that our proposed approach generalizes to the real-world settings that involve interactions with real humans, thus, demonstrating the robustness and generality of our framework
Sayan Nath, Nikhil Nayak
In recent years, we know that the interaction with images has increased. Image similarity involves fetching similar-looking images abiding by a given reference image. The target is to find out whether the image searched as a query can result in similar pictures. We are using the BigTransfer Model, which is a state-of-art model itself. BigTransfer(BiT) is essentially a ResNet but pre-trained on a larger dataset like ImageNet and ImageNet-21k with additional modifications. Using the fine-tuned pre-trained Convolution Neural Network Model, we extract the key features and train on the K-Nearest Neighbor model to obtain the nearest neighbor. The application of our model is to find similar images, which are hard to achieve through text queries within a low inference time. We analyse the benchmark of our model based on this application.
Jihyuk Kim, Minsso Kim, Seung-won Hwang
Deep learning for Information Retrieval (IR) requires a large amount of
high-quality query-document relevance labels, but such labels are inherently
sparse. Label smoothing redistributes some observed probability mass over
unobserved instances, often uniformly, uninformed of the true distribution. In
contrast, we propose knowledge distillation for informed labeling, without
incurring high computation overheads at evaluation time. Our contribution is
designing a simple but efficient teacher model which utilizes collective
knowledge, to outperform state-of-the-arts distilled from a more complex
teacher model. Specifically, we train up to x8 faster than the state-of-the-art
teacher, while distilling the rankings better. Our code is publicly available
at https://github.com/jihyukkim-nlp/CollectiveKD.
Authors' comments: NAACL 2022