Meilin Yang, Jian Xu, Yang Liu, Wenbo Ding
Deep hashing has been widely applied in large-scale data retrieval due to its superior retrieval efficiency and low storage cost. However, data are often scattered in data silos with privacy concerns, so performing centralized data storage and retrieval is not always possible. Leveraging the concept of federated learning (FL) to perform deep hashing is a recent research trend. However, existing frameworks mostly rely on the aggregation of the local deep hashing models, which are trained by performing similarity learning with local skewed data only. Therefore, they cannot work well for non-IID clients in a real federated environment. To overcome these challenges, we propose a novel federated hashing framework that enables participating clients to jointly train the shared deep hashing model by leveraging the prototypical hash codes for each class. Globally, the transmission of global prototypes with only one prototypical hash code per class will minimize the impact of communication cost and privacy risk. Locally, the use of global prototypes are maximized by jointly training a discriminator network and the local hashing network. Extensive experiments on benchmark datasets are conducted to demonstrate that our method can significantly improve the performance of the deep hashing model in the federated environments with non-IID data distributions.
Philipp Grohs, Lukas Liehr
The reconstruction of a function from its spectrogram (i.e., the absolute
value of its short-time Fourier transform (STFT)) arises as a key problem in
several important applications, including coherent diffraction imaging and
audio processing. It is a classical result that for suitable windows any
function can, in principle, be uniquely recovered up to a global phase factor
from its spectrogram. However, for most practical applications only discrete
samples - typically from a lattice - of the spectrogram are available. This
raises the question of whether lattice samples of the spectrogram contain
sufficient information for determining a function $f\in L^2(\mathbb{R}^d)$ up
to a global phase factor. In the present paper, we answer this question in the
negative by providing general non-identifiability results which lead to a
non-uniqueness theory for the sampled STFT phase retrieval problem. Precisely,
given any dimension $d$, any window function $g$ and any (symplectic or
separable) lattice $\mathcal{L} \subseteq \mathbb{R}^d$, we construct pairs of
functions $f,h\in L^2(\mathbb{R}^d)$ that do not agree up to a global phase
factor, but whose spectrograms agree on $\mathcal{L}$. Our techniques are
sufficiently flexible to produce counterexamples to unique recoverability under
even more stringent assumptions; for example, if the window function is
real-valued, the functions $f,h$ can even be chosen to satisfy $|f|=|h|$. Our
results thus reveal the non-existence of a critical sampling density in the
absence of phase information, a property which is in stark contrast to
uniqueness results in time-frequency analysis.
Authors' comments: 35 pages, 3 figures
Mengxue Du, Shasha Li, Jie Yu, Jun Ma, Bin Ji, Huijun Liu, Wuhang Lin, Zibo Yi
Document retrieval enables users to find their required documents accurately
and quickly. To satisfy the requirement of retrieval efficiency, prevalent deep
neural methods adopt a representation-based matching paradigm, which saves
online matching time by pre-storing document representations offline. However,
the above paradigm consumes vast local storage space, especially when storing
the document as word-grained representations. To tackle this, we present TGTR,
a Topic-Grained Text Representation-based Model for document retrieval.
Following the representation-based matching paradigm, TGTR stores the document
representations offline to ensure retrieval efficiency, whereas it
significantly reduces the storage requirements by using novel topicgrained
representations rather than traditional word-grained. Experimental results
demonstrate that compared to word-grained baselines, TGTR is consistently
competitive with them on TREC CAR and MS MARCO in terms of retrieval accuracy,
but it requires less than 1/10 of the storage space required by them. Moreover,
TGTR overwhelmingly surpasses global-grained baselines in terms of retrieval
accuracy.
Authors' comments: Accepted to ICANN2022
Jianan Chen, Lu Zhang, Qiong Wang, Cong Bai, Kidiyo Kpalma
Cross-modal retrieval has drawn much attention in both computer vision and natural language processing domains. With the development of convolutional and recurrent neural networks, the bottleneck of retrieval across image-text modalities is no longer the extraction of image and text features but an efficient loss function learning in embedding space. Many loss functions try to closer pairwise features from heterogeneous modalities. This paper proposes a method for learning joint embedding of images and texts using an intra-modal constraint loss function to reduce the violation of negative pairs from the same homogeneous modality. Experimental results show that our approach outperforms state-of-the-art bi-directional image-text retrieval methods on Flickr30K and Microsoft COCO datasets. Our code is publicly available: https://github.com/CanonChen/IMC.
Min Yang, Cheng Cui, Xuetong Xue, Hui Ren, Kai Wei
This paper presents the 2nd place solution to the Google Landmark Retrieval Competition 2020. We propose a training method of global feature model for landmark retrieval without post-processing, such as local feature and spatial verification. There are two parts in our retrieval method in this competition. This training scheme mainly includes training by increasing margin value of arcmargin loss and increasing image resolution step by step. Models are trained by PaddlePaddle framework and Pytorch framework, and then converted to tensorflow 2.2. Using this method, we got a public score of 0.40176 and a private score of 0.36278 and achieved 2nd place in the Google Landmark Retrieval Competition 2020.
Jinbin Bai, Chunhui Liu, Feiyue Ni, Haofan Wang, Mengying Hu, Xiaofeng Guo, Lele Cheng
Video-text retrieval is a class of cross-modal representation learning problems, where the goal is to select the video which corresponds to the text query between a given text query and a pool of candidate videos. The contrastive paradigm of vision-language pretraining has shown promising success with large-scale datasets and unified transformer architecture, and demonstrated the power of a joint latent space. Despite this, the intrinsic divergence between the visual domain and textual domain is still far from being eliminated, and projecting different modalities into a joint latent space might result in the distorting of the information inside the single modality. To overcome the above issue, we present a novel mechanism for learning the translation relationship from a source modality space $\mathcal{S}$ to a target modality space $\mathcal{T}$ without the need for a joint latent space, which bridges the gap between visual and textual domains. Furthermore, to keep cycle consistency between translations, we adopt a cycle loss involving both forward translations from $\mathcal{S}$ to the predicted target space $\mathcal{T'}$, and backward translations from $\mathcal{T'}$ back to $\mathcal{S}$. Extensive experiments conducted on MSR-VTT, MSVD, and DiDeMo datasets demonstrate the superiority and effectiveness of our LaT approach compared with vanilla state-of-the-art methods.
Clive Gomes, Hyejin Park, Patrick Kollman, Yi Song, Iffanice Houndayi, Ankit Shah
This project involved participation in the DCASE 2022 Competition (Task 6)
which had two subtasks: (1) Automated Audio Captioning and (2) Language-Based
Audio Retrieval. The first subtask involved the generation of a textual
description for audio samples, while the goal of the second was to find audio
samples within a fixed dataset that match a given description. For both
subtasks, the Clotho dataset was used. The models were evaluated on BLEU1,
BLEU2, BLEU3, ROUGEL, METEOR, CIDEr, SPICE, and SPIDEr scores for audio
captioning and R1, R5, R10 and mARP10 scores for audio retrieval. We have
conducted a handful of experiments that modify the baseline models for these
tasks. Our final architecture for Automated Audio Captioning is close to the
baseline performance, while our model for Language-Based Audio Retrieval has
surpassed its counterpart.
Authors' comments: DCASE 2022 Competition (Task 6)
Alistair Moffat
A sequence of recent papers has considered the role of measurement scales in information retrieval (IR) experimentation, and presented the argument that (only) uniform-step interval scales should be used, and hence that well-known metrics such as reciprocal rank, expected reciprocal rank, normalized discounted cumulative gain, and average precision, should be either discarded as measurement tools, or adapted so that their metric values lie at uniformly-spaced points on the number line. These papers paint a rather bleak picture of past decades of IR evaluation, at odds with the community's overall emphasis on practical experimentation and measurable improvement. Our purpose in this work is to challenge that position. In particular, we argue that mappings from categorical and ordinal data to sets of points on the number line are valid provided there is an external reason for each target point to have been selected. We first consider the general role of measurement scales, and of categorical, ordinal, interval, ratio, and absolute data collections. In connection with the first two of those categories we also provide examples of the knowledge that is captured and represented by numeric mappings to the real number line. Focusing then on information retrieval, we argue that document rankings are categorical data, and that the role of an effectiveness metric is to provide a single value that represents the usefulness to a user or population of users of any given ranking, with usefulness able to be represented as a continuous variable on a ratio scale. That is, we argue that current IR metrics are well-founded, and, moreover, that those metrics are more meaningful in their current form than in the proposed "intervalized" versions.
Sebastian Hofstätter, Jiecao Chen, Karthik Raman, Hamed Zamani
This paper studies multi-task training of retrieval-augmented generation
models for knowledge-intensive tasks. We propose to clean the training set by
utilizing a distinct property of knowledge-intensive generation: The connection
of query-answer pairs to items in the knowledge base. We filter training
examples via a threshold of confidence on the relevance labels, whether a pair
is answerable by the knowledge base or not. We train a single Fusion-in-Decoder
(FiD) generator on seven combined tasks of the KILT benchmark. The experimental
results suggest that our simple yet effective approach substantially improves
competitive baselines on two strongly imbalanced tasks; and shows either
smaller improvements or no significant regression on the remaining tasks.
Furthermore, we demonstrate our multi-task training with relevance label
sampling scales well with increased model capacity and achieves
state-of-the-art results in five out of seven KILT tasks.
Authors' comments: Accepted at the ICML 2022 Workshop on Knowledge Retrieval and
Language Models (KRLM)
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei
In this paper, we propose SimLM (Similarity matching with Language Model
pre-training), a simple yet effective pre-training method for dense passage
retrieval. It employs a simple bottleneck architecture that learns to compress
the passage information into a dense vector through self-supervised
pre-training. We use a replaced language modeling objective, which is inspired
by ELECTRA, to improve the sample efficiency and reduce the mismatch of the
input distribution between pre-training and fine-tuning. SimLM only requires
access to unlabeled corpus, and is more broadly applicable when there are no
labeled data or queries. We conduct experiments on several large-scale passage
retrieval datasets, and show substantial improvements over strong baselines
under various settings. Remarkably, SimLM even outperforms multi-vector
approaches such as ColBERTv2 which incurs significantly more storage cost. Our
code and model check points are available at
https://github.com/microsoft/unilm/tree/master/simlm .
Authors' comments: Accepted to ACL 2023
Elias Ramzi, Nicolas Audebert, Nicolas Thome, Clément Rambour, Xavier Bitot
Image Retrieval is commonly evaluated with Average Precision (AP) or Recall@k. Yet, those metrics, are limited to binary labels and do not take into account errors' severity. This paper introduces a new hierarchical AP training method for pertinent image retrieval (HAP-PIER). HAPPIER is based on a new H-AP metric, which leverages a concept hierarchy to refine AP by integrating errors' importance and better evaluate rankings. To train deep models with H-AP, we carefully study the problem's structure and design a smooth lower bound surrogate combined with a clustering loss that ensures consistent ordering. Extensive experiments on 6 datasets show that HAPPIER significantly outperforms state-of-the-art methods for hierarchical retrieval, while being on par with the latest approaches when evaluating fine-grained ranking performances. Finally, we show that HAPPIER leads to better organization of the embedding space, and prevents most severe failure cases of non-hierarchical methods. Our code is publicly available at: https://github.com/elias-ramzi/HAPPIER.
Yuxin Song, Ruolin Zhu, Min Yang, Dongliang He
Deeply learned representations have achieved superior image retrieval
performance in a retrieve-then-rerank manner. Recent state-of-the-art single
stage model, which heuristically fuses local and global features, achieves
promising trade-off between efficiency and effectiveness. However, we notice
that efficiency of existing solutions is still restricted because of their
multi-scale inference paradigm. In this paper, we follow the single stage art
and obtain further complexity-effectiveness balance by successfully getting rid
of multi-scale testing. To achieve this goal, we abandon the widely-used
convolution network giving its limitation in exploring diverse visual patterns,
and resort to fully attention based framework for robust representation
learning motivated by the success of Transformer. Besides applying Transformer
for global feature extraction, we devise a local branch composed of
window-based multi-head attention and spatial attention to fully exploit local
image patterns. Furthermore, we propose to combine the hierarchical local and
global features via a cross-attention module, instead of using heuristically
fusion as previous art does. With our Deep Attentive Local and Global modeling
framework (DALG), extensive experimental results show that efficiency can be
significantly improved while maintaining competitive results with the state of
the arts.
Authors' comments: 8 pages, 6 figures
Maik Fröbe, Christopher Akiki, Martin Potthast, Matthias Hagen
Neural retrieval models are often trained on (subsets of) the millions of
queries of the MS MARCO / ORCAS datasets and then tested on the 250 Robust04
queries or other TREC benchmarks with often only 50 queries. In such setups,
many of the few test queries can be very similar to queries from the huge
training data -- in fact, 69% of the Robust04 queries have near-duplicates in
MS MARCO / ORCAS. We investigate the impact of this unintended train-test
leakage by training neural retrieval models on combinations of a fixed number
of MS MARCO / ORCAS queries that are highly similar to the actual test queries
and an increasing number of other queries. We find that leakage can improve
effectiveness and even change the ranking of systems. However, these effects
diminish as the amount of leakage among all training instances decreases and
thus becomes more realistic.
Authors' comments: To appear at the 29th International Symposium on String Processing
and Information Retrieval (SPIRE 2022)
Dacheng Yin, Chuanxin Tang, Yanqing Liu, Xiaoqiang Wang, Zhiyuan Zhao, Yucheng Zhao, Zhiwei Xiong, Sheng Zhao et al.
This paper proposes a new "decompose-and-edit" paradigm for the text-based
speech insertion task that facilitates arbitrary-length speech insertion and
even full sentence generation. In the proposed paradigm, global and local
factors in speech are explicitly decomposed and separately manipulated to
achieve high speaker similarity and continuous prosody. Specifically, we
proposed to represent the global factors by multiple tokens, which are
extracted by cross-attention operation and then injected back by link-attention
operation. Due to the rich representation of global factors, we manage to
achieve high speaker similarity in a zero-shot manner. In addition, we
introduce a prosody smoothing task to make the local prosody factor
context-aware and therefore achieve satisfactory prosody continuity. We further
achieve high voice quality with an adversarial training stage. In the
subjective test, our method achieves state-of-the-art performance in both
naturalness and similarity. Audio samples can be found at
https://ydcustc.github.io/retrieverTTS-demo/.
Authors' comments: 5 pages, 1 figure, 3 tables. Accepted by Interspeech 2022
Zhongzheng Lin, Jianqi Hu, Yujie Chen, Camille-Sophie Brès, Siyuan Yu
Orbital angular momentum (OAM) spectrum diagnosis is a fundamental building block for diverse OAM-based systems. Among others, the simple on-axis interferometric measurement can retrieve the amplitude and phase information of complex OAM spectra in a few shots. Yet, its single-shot retrieval remains illusive, due to the signal-signal beat interference inherent in the measurement. Here, we introduce the concept of Kramers-Kronig (KK) receiver in coherent communications to the OAM domain, enabling rigorous, single-shot OAM spectrum measurement. We explain in detail the working principle and the requirement of the KK method, and then apply the technique to precisely measure various characteristic OAM states. In addition, we discuss the effects of the carrier-to-signal power ratio and the number of sampling points essential for rigorous retrieval, and evaluate the performance on a large set of random OAM spectra and high-dimensional spaces. Single-shot KK interferometry shows enormous potential for characterizing complex OAM states in real-time.
Burak Satar, Hongyuan Zhu, Xavier Bresson, Joo Hwee Lim
With the emergence of social media, voluminous video clips are uploaded every
day, and retrieving the most relevant visual content with a language query
becomes critical. Most approaches aim to learn a joint embedding space for
plain textual and visual contents without adequately exploiting their
intra-modality structures and inter-modality correlations. This paper proposes
a novel transformer that explicitly disentangles the text and video into
semantic roles of objects, spatial contexts and temporal contexts with an
attention scheme to learn the intra- and inter-role correlations among the
three roles to discover discriminative features for matching at different
levels. The preliminary results on popular YouCook2 indicate that our approach
surpasses a current state-of-the-art method, with a high margin in all metrics.
It also overpasses two SOTA methods in terms of two metrics.
Authors' comments: Camera-ready for ICIP 2021
Guile Wu, Chao Zhang, Stephan Liwicki
Unsupervised image retrieval aims to learn an efficient retrieval system
without expensive data annotations, but most existing methods rely heavily on
handcrafted feature descriptors or pre-trained feature extractors. To minimize
human supervision, recent advance proposes deep fully unsupervised image
retrieval aiming at training a deep model from scratch to jointly optimize
visual features and quantization codes. However, existing approach mainly
focuses on instance contrastive learning without considering underlying
semantic structure information, resulting in sub-optimal performance. In this
work, we propose a novel self-supervised consistent quantization approach to
deep fully unsupervised image retrieval, which consists of part consistent
quantization and global consistent quantization. In part consistent
quantization, we devise part neighbor semantic consistency learning with
codeword diversity regularization. This allows to discover underlying neighbor
structure information of sub-quantized representations as self-supervision. In
global consistent quantization, we employ contrastive learning for both
embedding and quantized representations and fuses these representations for
consistent contrastive regularization between instances. This can make up for
the loss of useful representation information during quantization and
regularize consistency between instances. With a unified learning objective of
part and global consistent quantization, our approach exploits richer
self-supervision cues to facilitate model learning. Extensive experiments on
three benchmark datasets show the superiority of our approach over the
state-of-the-art methods.
Authors' comments: 10 pages, 5 figures
Vishal Batchu, Grey Nearing, Varun Gulshan
We develop a deep learning based convolutional-regression model that
estimates the volumetric soil moisture content in the top ~5 cm of soil. Input
predictors include Sentinel-1 (active radar), Sentinel-2 (optical imagery), and
SMAP (passive radar) as well as geophysical variables from SoilGrids and
modelled soil moisture fields from GLDAS. The model was trained and evaluated
on data from ~1300 in-situ sensors globally over the period 2015 - 2021 and
obtained an average per-sensor correlation of 0.727 and ubRMSE of 0.054, and
can be used to produce a soil moisture map at a nominal 320m resolution. These
results are benchmarked against 13 other soil moisture works at different
locations, and an ablation study was used to identify important predictors.
Authors' comments: 58 pages, 21 tables, 26 figures
Lawrence Yunliang Chen, Huang Huang, Michael Danielczuk, Jeffrey Ichnowski, Ken Goldberg
Shelves are commonly used to store objects in homes, stores, and warehouses.
We formulate the problem of Optimal Shelf Arrangement (OSA), where the goal is
to optimize the arrangement of objects on a shelf for access time given an
access frequency and movement cost for each object. We propose OSA-MIP, a
mixed-integer program (MIP), show that it finds an optimal solution for OSA
under certain conditions, and provide bounds on its suboptimal solutions in
general cost settings. We analytically characterize a necessary and sufficient
shelf density condition for which there exists an arrangement such that any
object can be retrieved without removing objects from the shelf. Experimental
data from 1,575 simulated shelf trials and 54 trials with a physical Fetch
robot equipped with a pushing blade and suction grasping tool suggest that
arranging the objects optimally reduces the expected retrieval cost by 60-80%
in fully-observed configurations and reduces the expected search cost by 50-70%
while increasing the search success rate by up to 2x in partially-observed
configurations.
Authors' comments: 2022 IEEE 18th International Conference on Automation Science and
Engineering (CASE)
Gyungin Shin, Weidi Xie, Samuel Albanie
Semantic segmentation has a broad range of applications, but its real-world
impact has been significantly limited by the prohibitive annotation costs
necessary to enable deployment. Segmentation methods that forgo supervision can
side-step these costs, but exhibit the inconvenient requirement to provide
labelled examples from the target distribution to assign concept names to
predictions. An alternative line of work in language-image pre-training has
recently demonstrated the potential to produce models that can both assign
names across large vocabularies of concepts and enable zero-shot transfer for
classification, but do not demonstrate commensurate segmentation abilities. In
this work, we strive to achieve a synthesis of these two approaches that
combines their strengths. We leverage the retrieval abilities of one such
language-image pre-trained model, CLIP, to dynamically curate training sets
from unlabelled images for arbitrary collections of concept names, and leverage
the robust correspondences offered by modern image representations to
co-segment entities among the resulting collections. The synthetic segment
collections are then employed to construct a segmentation model (without
requiring pixel labels) whose knowledge of concepts is inherited from the
scalable pre-training process of CLIP. We demonstrate that our approach, termed
Retrieve and Co-segment (ReCo) performs favourably to unsupervised segmentation
approaches while inheriting the convenience of nameable predictions and
zero-shot transfer. We also demonstrate ReCo's ability to generate specialist
segmenters for extremely rare objects.
Authors' comments: Tech report. Code: https://github.com/NoelShin/reco