Vivek Ramanujan, Pavan Kumar Anasosalu Vasu, Ali Farhadi, Oncel Tuzel, Hadi Pouransari
In visual retrieval systems, updating the embedding model requires
recomputing features for every piece of data. This expensive process is
referred to as backfilling. Recently, the idea of backward compatible training
(BCT) was proposed. To avoid the cost of backfilling, BCT modifies training of
the new model to make its representations compatible with those of the old
model. However, BCT can significantly hinder the performance of the new model.
In this work, we propose a new learning paradigm for representation learning:
forward compatible training (FCT). In FCT, when the old model is trained, we
also prepare for a future unknown version of the model. We propose learning
side-information, an auxiliary feature for each sample which facilitates future
updates of the model. To develop a powerful and flexible framework for model
compatibility, we combine side-information with a forward transformation from
old to new embeddings. Training of the new model is not modified, hence, its
accuracy is not degraded. We demonstrate significant retrieval accuracy
improvement compared to BCT for various datasets: ImageNet-1k (+18.1%),
Places-365 (+5.4%), and VGG-Face2 (+8.3%). FCT obtains model compatibility when
the new and old models are trained across different datasets, losses, and
architectures.
Authors' comments: 14 pages with appendix. In proceedings at the conference on Computer
Vision and Pattern Recognition 2022
Guillaume Couairon, Matthieu Cord, Matthijs Douze, Holger Schwenk
Latent text representations exhibit geometric regularities, such as the
famous analogy: queen is to king what woman is to man. Such structured semantic
relations were not demonstrated on image representations. Recent works aiming
at bridging this semantic gap embed images and text into a multimodal space,
enabling the transfer of text-defined transformations to the image modality. We
introduce the SIMAT dataset to evaluate the task of Image Retrieval with
Multimodal queries. SIMAT contains 6k images and 18k textual transformation
queries that aim at either replacing scene elements or changing pairwise
relationships between scene elements. The goal is to retrieve an image
consistent with the (source image, text transformation) query. We use an
image/text matching oracle (OSCAR) to assess whether the image transformation
is successful. The SIMAT dataset will be publicly available. We use SIMAT to
evaluate the geometric properties of multimodal embedding spaces trained with
an image/text matching objective, like CLIP. We show that vanilla CLIP
embeddings are not very well suited to transform images with delta vectors, but
that a simple finetuning on the COCO dataset can bring dramatic improvements.
We also study whether it is beneficial to leverage pretrained universal
sentence encoders (FastText, LASER and LaBSE).
Authors' comments: accepted at O-DRUM (CVPR workshop 2022)
Jiwei Zhang, Yi Yu, Suhua Tang, Jianming Wu, Wei Li
Cross-modal retrieval is to utilize one modality as a query to retrieve data from another modality, which has become a popular topic in information retrieval, machine learning, and database. How to effectively measure the similarity between different modality data is the major challenge of cross-modal retrieval. Although several reasearch works have calculated the correlation between different modality data via learning a common subspace representation, the encoder's ability to extract features from multi-modal information is not satisfactory. In this paper, we present a novel variational autoencoder (VAE) architecture for audio-visual cross-modal retrieval, by learning paired audio-visual correlation embedding and category correlation embedding as constraints to reinforce the mutuality of audio-visual information. On the one hand, audio encoder and visual encoder separately encode audio data and visual data into two different latent spaces. Further, two mutual latent spaces are respectively constructed by canonical correlation analysis (CCA). On the other hand, probabilistic modeling methods is used to deal with possible noise and missing information in the data. Additionally, in this way, the cross-modal discrepancy from intra-modal and inter-modal information are simultaneously eliminated in the joint embedding subspace. We conduct extensive experiments over two benchmark datasets. The experimental outcomes exhibit that the proposed architecture is effective in learning audio-visual correlation and is appreciably better than the existing cross-modal retrieval methods.
Dariusz R. Kowalski, Dominik Pajak
The Quantitative Group Testing (QGT) is about learning a (hidden) subset $K$ of some large domain $N$ using a sequence of queries, where a result of a query provides information about the size of the intersection of the query with the unknown subset $K$. Almost all previous work focused on randomized algorithms minimizing the number of queries; however, in case of large domains $N$, randomization may result in a significant deviation from the expected precision. Others assumed unlimited computational power (existential results) or adaptiveness of queries. In this work we propose efficient non-adaptive deterministic QGT algorithms for constructing queries and deconstructing a hidden set $K$ from the results of the queries, without using randomization, adaptiveness or unlimited computational power. The efficiency is three-fold. First, in terms of almost-optimal number of queries - we improve it by factor nearly $|K|$ comparing to previous constructive results. Second, our algorithms construct the queries and reconstruct set $K$ in polynomial time. Third, they work for any hidden set $K$, as well as multi-sets, and even if the results of the queries are capped at $\sqrt{|K|}$. We also analyze how often elements occur in queries and its impact to parallelization and fault-tolerance of the query system.
Chengyin Xu, Zenghao Chai, Zhengzhuo Xu, Hongjia Li, Qiruyi Zuo, Lingyu Yang, Chun Yuan
Deep hashing has shown promising performance in large-scale image retrieval. However, latent codes extracted by Deep Neural Networks (DNNs) will inevitably lose semantic information during the binarization process, which damages the retrieval accuracy and makes it challenging. Although many existing approaches perform regularization to alleviate quantization errors, we figure out an incompatible conflict between metric learning and quantization learning. The metric loss penalizes the inter-class distances to push different classes unconstrained far away. Worse still, it tends to map the latent code deviate from ideal binarization point and generate severe ambiguity in the binarization process. Based on the minimum distance of the binary linear code, we creatively propose Hashing-guided Hinge Function (HHF) to avoid such conflict. In detail, the carefully-designed inflection point, which relies on the hash bit length and category numbers, is explicitly adopted to balance the metric term and quantization term. Such a modification prevents the network from falling into local metric optimal minima in deep hashing. Extensive experiments in CIFAR-10, CIFAR-100, ImageNet, and MS-COCO show that HHF consistently outperforms existing techniques, and is robust and flexible to transplant into other methods. Code is available at https://github.com/JerryXu0129/HHF.
Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, Matei Zaharia
Neural information retrieval (IR) has greatly advanced search and other
knowledge-intensive language tasks. While many neural IR methods encode queries
and documents into single-vector representations, late interaction models
produce multi-vector representations at the granularity of each token and
decompose relevance modeling into scalable token-level computations. This
decomposition has been shown to make late interaction more effective, but it
inflates the space footprint of these models by an order of magnitude. In this
work, we introduce ColBERTv2, a retriever that couples an aggressive residual
compression mechanism with a denoised supervision strategy to simultaneously
improve the quality and space footprint of late interaction. We evaluate
ColBERTv2 across a wide range of benchmarks, establishing state-of-the-art
quality within and outside the training domain while reducing the space
footprint of late interaction models by 6--10$\times$.
Authors' comments: NAACL 2022. Omar and Keshav contributed equally to this work
Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, Xiaohu Qie, Mike Zheng Shou
Recently, by introducing large-scale dataset and strong transformer network,
video-language pre-training has shown great success especially for retrieval.
Yet, existing video-language transformer models do not explicitly fine-grained
semantic align. In this work, we present Object-aware Transformers, an
object-centric approach that extends video-language transformer to incorporate
object representations. The key idea is to leverage the bounding boxes and
object tags to guide the training process. We evaluate our model on three
standard sub-tasks of video-text matching on four widely used benchmarks. We
also provide deep analysis and detailed ablation about the proposed method. We
show clear improvement in performance across all tasks and datasets considered,
demonstrating the value of a model that incorporates object representations
into a video-language architecture. The code will be released at
\url{https://github.com/FingerRec/OA-Transformer}.
Authors' comments: CVPR2022; Code: https://github.com/FingerRec/OA-Transformer
Stan Weixian Lei, Difei Gao, Yuxuan Wang, Dongxing Mao, Zihan Liang, Lingmin Ran, Mike Zheng Shou
It is still a pipe dream that personal AI assistants on the phone and AR
glasses can assist our daily life in addressing our questions like ``how to
adjust the date for this watch?'' and ``how to set its heating duration? (while
pointing at an oven)''. The queries used in conventional tasks (i.e. Video
Question Answering, Video Retrieval, Moment Localization) are often factoid and
based on pure text. In contrast, we present a new task called Task-oriented
Question-driven Video Segment Retrieval (TQVSR). Each of our questions is an
image-box-text query that focuses on affordance of items in our daily life and
expects relevant answer segments to be retrieved from a corpus of instructional
video-transcript segments. To support the study of this TQVSR task, we
construct a new dataset called AssistSR. We design novel guidelines to create
high-quality samples. This dataset contains 3.2k multimodal questions on 1.6k
video segments from instructional videos on diverse daily-used items. To
address TQVSR, we develop a simple yet effective model called Dual Multimodal
Encoders (DME) that significantly outperforms several baseline methods while
still having large room for improvement in the future. Moreover, we present
detailed ablation analyses. Code and data are available at
\url{https://github.com/StanLei52/TQVSR}.
Authors' comments: 20 pages, 12 figures
Konstantin Schall, Kai Uwe Barthel, Nico Hezel, Klaus Jung
Even though it has extensively been shown that retrieval specific training of deep neural networks is beneficial for nearest neighbor image search quality, most of these models are trained and tested in the domain of landmarks images. However, some applications use images from various other domains and therefore need a network with good generalization properties - a general-purpose CBIR model. To the best of our knowledge, no testing protocol has so far been introduced to benchmark models with respect to general image retrieval quality. After analyzing popular image retrieval test sets we decided to manually curate GPR1200, an easy to use and accessible but challenging benchmark dataset with a broad range of image categories. This benchmark is subsequently used to evaluate various pretrained models of different architectures on their generalization qualities. We show that large-scale pretraining significantly improves retrieval performance and present experiments on how to further increase these properties by appropriate fine-tuning. With these promising results, we hope to increase interest in the research topic of general-purpose CBIR.
Gustavo Penha, Arthur Câmara, Claudia Hauff
Heavily pre-trained transformers for language modelling, such as BERT, have
shown to be remarkably effective for Information Retrieval (IR) tasks,
typically applied to re-rank the results of a first-stage retrieval model. IR
benchmarks evaluate the effectiveness of retrieval pipelines based on the
premise that a single query is used to instantiate the underlying information
need. However, previous research has shown that (I) queries generated by users
for a fixed information need are extremely variable and, in particular, (II)
neural models are brittle and often make mistakes when tested with modified
inputs. Motivated by those observations we aim to answer the following
question: how robust are retrieval pipelines with respect to different
variations in queries that do not change the queries' semantics? In order to
obtain queries that are representative of users' querying variability, we first
created a taxonomy based on the manual annotation of transformations occurring
in a dataset (UQV100) of user-created query variations. For each
syntax-changing category of our taxonomy, we employed different automatic
methods that when applied to a query generate a query variation. Our
experimental results across two datasets for two IR tasks reveal that retrieval
pipelines are not robust to these query variations, with effectiveness drops of
$\approx20\%$ on average. The code and datasets are available at
https://github.com/Guzpenha/query_variation_generators.
Authors' comments: Accepted for publication in the 44nd European Conference on
Information Retrieval (ECIR'22). V3: Fixed Table 2
Dingrong Wang, Hitesh Sapkota, Xumin Liu, Qi Yu
Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) aims at finding a
specific image from a large gallery given a query sketch. Despite the
widespread applicability of FG-SBIR in many critical domains (e.g., crime
activity tracking), existing approaches still suffer from a low accuracy while
being sensitive to external noises such as unnecessary strokes in the sketch.
The retrieval performance will further deteriorate under a more practical
on-the-fly setting, where only a partially complete sketch with only a few
(noisy) strokes are available to retrieve corresponding images. We propose a
novel framework that leverages a uniquely designed deep reinforcement learning
model that performs a dual-level exploration to deal with partial sketch
training and attention region selection. By enforcing the model's attention on
the important regions of the original sketches, it remains robust to
unnecessary stroke noises and improve the retrieval accuracy by a large margin.
To sufficiently explore partial sketches and locate the important regions to
attend, the model performs bootstrapped policy gradient for global exploration
while adjusting a standard deviation term that governs a locator network for
local exploration. The training process is guided by a hybrid loss that
integrates a reinforcement loss and a supervised loss. A dynamic ranking reward
is developed to fit the on-the-fly image retrieval process using partial
sketches. The extensive experimentation performed on three public datasets
shows that our proposed approach achieves the state-of-the-art performance on
partial sketch based image retrieval.
Authors' comments: 2021 IEEE International Conference on Data Mining (ICDM)
Rong Fu, Tianyao Huang, Lei Wang, Yimin Liu
As a typical signal processing problem, multidimensional harmonic retrieval
(MHR) has been adapted to a wide range of applications in signal processing.
Block-sparse signals, whose nonzero entries appearing in clusters, have
received much attention recently. An unfolded network, named Ada-BlockLISTA,
was proposed to recover a block-sparse signal at a small computational cost,
which learns an individual weight matrix for each block. However, as the number
of network parameters is increasingly associated with the number of blocks, the
demand for parameter reduction becomes very significant, especially for
large-scale MHR. Based on the dictionary characteristics in two-dimensional
(2D) harmonic retrieve problems, we introduce a weight coupling structure to
shrink Ada-BlockLISTA, which significantly reduces the number of weights
without performance degradation. In simulations, our proposed block-sparse
reconstruction network, named AdaBLISTA-CP, shows excellent recovery
performance and convergence speed in 2D harmonic retrieval problems.
Authors' comments: 2 pages, 2 figures, 13 conferences
Sheng Jin, Xiaojian Ding, Su Wang, Yao Dong, Jianghui Ji
Here we present an open source Python-based Bayesian orbit retrieval code
(Nii) that implements an automatic parallel tempering Markov chain Monte Carlo
(APT-MCMC) strategy. Nii provides a module to simulate the observations of a
space-based astrometry mission in the search for exoplanets, a signal
extraction process for differential astrometric measurements using multiple
reference stars, and an orbital parameter retrieval framework using APT-MCMC.
We further verify the orbit retrieval ability of the code through two examples
corresponding to a single-planet system and a dual-planet system. In both
cases, efficient convergence on the posterior probability distribution can be
achieved. Although this code specifically focuses on the orbital parameter
retrieval problem of differential astrometry, Nii can also be widely used in
other Bayesian analysis applications.
Authors' comments: Accepted for publication in MNRAS
Zijian Gao, Jingyu Liu, Weiqi Sun, Sheng Chen, Dedan Chang, Lili Zhao
Modern video-text retrieval frameworks basically consist of three parts: video encoder, text encoder and the similarity head. With the success on both visual and textual representation learning, transformer based encoders and fusion methods have also been adopted in the field of video-text retrieval. In this report, we present CLIP2TV, aiming at exploring where the critical elements lie in transformer based methods. To achieve this, We first revisit some recent works on multi-modal learning, then introduce some techniques into video-text retrieval, finally evaluate them through extensive experiments in different configurations. Notably, CLIP2TV achieves 52.9@R1 on MSR-VTT dataset, outperforming the previous SOTA result by 4.1%.
Yauhen Yakimenka, Hsuan-Yin Lin, Eirik Rosnes, Jörg Kliewer
Private information retrieval protocols guarantee that a user can privately
and losslessly retrieve a single file from a database stored across multiple
servers. In this work, we propose to simultaneously relax the conditions of
perfect retrievability and privacy in order to obtain improved download rates
when all files are stored uncoded on a single server. Information leakage is
measured in terms of the average success probability for the server of
correctly guessing the identity of the desired file. The main findings are: i)
The derivation of the optimal tradeoff between download rate, distortion, and
information leakage when the file size is infinite. Closed-form expressions of
the optimal tradeoff for the special cases of "no-leakage" and "no-privacy" are
also given. ii) A novel approach based on linear programming (LP) to construct
schemes for a finite file size and an arbitrary number of files. The proposed
LP approach can be leveraged to find provably optimal schemes with
corresponding closed-form expressions for the rate-distortion-leakage tradeoff
when the database contains at most four bits.
Finally, for a database that contains 320 bits, we compare two construction
methods based on the LP approach with a nonconstructive scheme downloading
subsets of files using a finite-length lossy compressor based on random coding.
Authors' comments: 14 pages, 3 figures. Accepted for publication in IEEE Journal on
Selected Areas in Communications, Special Issue on Private Information
Retrieval, Private Coded Computing over Distributed Servers, and Privacy in
Distributed Learning
Salima Mdhaffar, Jean-François Bonastre, Marc Tommasi, Natalia Tomashenko, Yannick Estève
The widespread of powerful personal devices capable of collecting voice of their users has opened the opportunity to build speaker adapted speech recognition system (ASR) or to participate to collaborative learning of ASR. In both cases, personalized acoustic models (AM), i.e. fine-tuned AM with specific speaker data, can be built. A question that naturally arises is whether the dissemination of personalized acoustic models can leak personal information. In this paper, we show that it is possible to retrieve the gender of the speaker, but also his identity, by just exploiting the weight matrix changes of a neural acoustic model locally adapted to this speaker. Incidentally we observe phenomena that may be useful towards explainability of deep neural networks in the context of speech processing. Gender can be identified almost surely using only the first layers and speaker verification performs well when using middle-up layers. Our experimental study on the TED-LIUM 3 dataset with HMM/TDNN models shows an accuracy of 95% for gender detection, and an Equal Error Rate of 9.07% for a speaker verification task by only exploiting the weights from personalized models that could be exchanged instead of user data.
Ding Li, Rui Wu, Yongqiang Tang, Zhizhong Zhang, Wensheng Zhang
Video moment retrieval aims to search the moment most relevant to a given
language query. However, most existing methods in this community often require
temporal boundary annotations which are expensive and time-consuming to label.
Hence weakly supervised methods have been put forward recently by only using
coarse video-level label. Despite effectiveness, these methods usually process
moment candidates independently, while ignoring a critical issue that the
natural temporal dependencies between candidates in different temporal scales.
To cope with this issue, we propose a Multi-scale 2D Representation Learning
method for weakly supervised video moment retrieval. Specifically, we first
construct a two-dimensional map for each temporal scale to capture the temporal
dependencies between candidates. Two dimensions in this map indicate the start
and end time points of these candidates. Then, we select top-K candidates from
each scale-varied map with a learnable convolutional neural network. With a
newly designed Moments Evaluation Module, we obtain the alignment scores of the
selected candidates. At last, the similarity between captions and language
query is served as supervision for further training the candidates' selector.
Experiments on two benchmark datasets Charades-STA and ActivityNet Captions
demonstrate that our approach achieves superior performance to state-of-the-art
results.
Authors' comments: 8 pages, 4 figuers. Accepted for publication in 2020 25th
International Conference on Pattern Recognition (ICPR)
Philipp Grohs, Lukas Liehr
We prove that there exists no window function $g \in L^2(\mathbb{R})$ and no
lattice $\mathcal{L} \subset \mathbb{R}^2$ such that every $f \in
L^2(\mathbb{R})$ is determined up to a global phase by spectrogram samples
$|V_gf(\mathcal{L})|$ where $V_gf$ denotes the short-time Fourier transform of
$f$ with respect to $g$. Consequently, the forward operator $f \mapsto
|V_gf(\mathcal{L})|$ mapping a square-integrable function to its spectrogram
samples on a lattice is never injective on the quotient space $L^2(\mathbb{R})
/ {\sim}$ with $f \sim h$ identifying two functions which agree up to a
multiplicative constant of modulus one. We will further elaborate this result
and point out that under mild conditions on the lattice $\mathcal{L}$,
functions which produce identical spectrogram samples but do not agree up to a
unimodular constant can be chosen to be real-valued. The derived results
highlight that in the discretization of the STFT phase retrieval problem from
lattice measurements, a prior restriction of the underlying signal space to a
proper subspace of $L^2(\mathbb{R})$ is inevitable.
Authors' comments: 19 pages, 3 figures
Eunji Lee, Sundong Kim, Sihyun Kim, Sungwon Park, Meeyoung Cha, Soyeon Jung, Suyoung Yang, Yeonsoo Choi et al.
The task of assigning and validating internationally accepted commodity code (HS code) to traded goods is one of the critical functions at the customs office. This decision is crucial to importers and exporters, as it determines the tariff rate. However, similar to court decisions made by judges, the task can be non-trivial even for experienced customs officers. The current paper proposes a deep learning model to assist this seemingly challenging HS code classification. Together with Korea Customs Service, we built a decision model based on KoELECTRA that suggests the most likely heading and subheadings (i.e., the first four and six digits) of the HS code. Evaluation on 129,084 past cases shows that the top-3 suggestions made by our model have an accuracy of 95.5% in classifying 265 subheadings. This promising result implies algorithms may reduce the time and effort taken by customs officers substantially by assisting the HS code classification task.
Jinbao Zhu, Qifa Yan, Xiaohu Tang
The problem of Multi-user Blind $X$-secure $T$-colluding Symmetric Private Information Retrieval from Maximum Distance Separable (MDS) coded storage system with $B$ Byzantine and $U$ unresponsive servers (U-B-MDS-MB-XTSPIR) is studied in this paper. Specifically, a database consisting of multiple files, each labeled by $M$ indices, is stored at the distributed system with $N$ servers according to $(N,K+X)$ MDS codes over $\mathbb{F}_q$ such that any group of up to $X$ colluding servers learn nothing about the data files. There are $M$ users, in which each user $m,m=1,\ldots,M$ privately selects an index $\theta_m$ and wishes to jointly retrieve the file specified by the $M$ users' indices $(\theta_1,\ldots,\theta_M)$ from the storage system, while keeping its index $\theta_m$ private from any $T_m$ colluding servers, where there exists $B$ Byzantine servers that can send arbitrary responses maliciously to confuse the users retrieving the desired file and $U$ unresponsive servers that will not respond any message at all. In addition, each user must not learn information about the other users' indices and the database more than the desired file. An U-B-MDS-MB-XTSPIR scheme is constructed based on Lagrange encoding. The scheme achieves a retrieval rate of $1-\frac{K+X+T_1+\ldots+T_M+2B-1}{N-U}$ with secrecy rate $\frac{K+X+T_1+\ldots+T_M-1}{ N-(K+X+T_1+\ldots+T_M+2B+U-1)}$ on the finite field of size $q\geq N+\max\{K, N-(K+X+T_1+\ldots+T_M+2B+U-1)\}$ for any number of files.