Carlos Lassance, Thibault Formal, Stephane Clinchant
We propose a Composite Code Sparse Autoencoder (CCSA) approach for
Approximate Nearest Neighbor (ANN) search of document representations based on
Siamese-BERT models. In Information Retrieval (IR), the ranking pipeline is
generally decomposed in two stages: the first stage focus on retrieving a
candidate set from the whole collection. The second stage re-ranks the
candidate set by relying on more complex models. Recently, Siamese-BERT models
have been used as first stage ranker to replace or complement the traditional
bag-of-word models. However, indexing and searching a large document collection
require efficient similarity search on dense vectors and this is why ANN
techniques come into play. Since composite codes are naturally sparse, we first
show how CCSA can learn efficient parallel inverted index thanks to an
uniformity regularizer. Second, CCSA can be used as a binary quantization
method and we propose to combine it with the recent graph based ANN techniques.
Our experiments on MSMARCO dataset reveal that CCSA outperforms IVF with
product quantization. Furthermore, CCSA binary quantization is beneficial for
the index size, and memory usage for the graph-based HNSW method, while
maintaining a good level of recall and MRR. Third, we compare with recent
supervised quantization methods for image retrieval and find that CCSA is able
to outperform them.
Authors' comments: 24 pages, longer version of short paper accepted at SIGIR 2021
Javier Morlana, J. M. M. Montiel
We propose a compact pipeline to unify all the steps of Visual Localization:
image retrieval, candidate re-ranking and initial pose estimation, and camera
pose refinement. Our key assumption is that the deep features used for these
individual tasks share common characteristics, so we should reuse them in all
the procedures of the pipeline. Our DRAN (Deep Retrieval and image Alignment
Network) is able to extract global descriptors for efficient image retrieval,
use intermediate hierarchical features to re-rank the retrieval list and
produce an initial pose guess, which is finally refined by means of a
feature-metric optimization based on learned deep multi-scale dense features.
DRAN is the first single network able to produce the features for the three
steps of visual localization. DRAN achieves competitive performance in terms of
robustness and accuracy under challenging conditions in public benchmarks,
outperforming other unified approaches and consuming lower computational and
memory cost than its counterparts using multiple networks. Code and models will
be publicly available at https://github.com/jmorlana/DRAN.
Authors' comments: ICRA 2023
Suyu Ouyang, Yingxia Shao, Ang Li
Institutions of higher learning, research institutes and other scientific
research units have abundant scientific and technological resources of experts
and scholars, and these talents with great scientific and technological
innovation ability are an important force to promote industrial upgrading. The
scientific and technological resources of experts and scholars are mainly
composed of basic attributes and scientific research achievements. The basic
attributes include information such as research interests, institutions, and
educational work experience. However, due to information asymmetry and other
reasons, the scientific and technological resources of experts and scholars
cannot be connected with the society in a timely manner, and social needs
cannot be accurately matched with experts and scholars. Therefore, it is very
necessary to build an expert and scholar information database and provide
relevant expert and scholar retrieval services. This paper sorts out the
related research work in this field from four aspects: text relation
extraction, text knowledge representation learning, text vector retrieval and
visualization system.
Authors' comments: 9 pages
Yang Jiang, Zhe Xue, Ang Li
Since the era of big data, the Internet has been flooded with all kinds of
information. Browsing information through the Internet has become an integral
part of people's daily life. Unlike the news data and social data in the
Internet, the cross-media technology information data has different
characteristics. This data has become an important basis for researchers and
scholars to track the current hot spots and explore the future direction of
technology development. As the volume of science and technology information
data becomes richer, the traditional science and technology information
retrieval system, which only supports unimodal data retrieval and uses outdated
data keyword matching model, can no longer meet the daily retrieval needs of
science and technology scholars. Therefore, in view of the above research
background, it is of profound practical significance to study the cross-media
science and technology information data retrieval system based on deep semantic
features, which is in line with the development trend of domestic and
international technologies.
Authors' comments: We found some errors in the algorithm and need to withdraw this paper
Alejandro Delgado, Charalampos Saitis, Emmanouil Benetos, Mark Sandler
Imitating musical instruments with the human voice is an efficient way of
communicating ideas between music producers, from sketching melody lines to
clarifying desired sonorities. For this reason, there is an increasing interest
in building applications that allow artists to efficiently pick target samples
from big sound libraries just by imitating them vocally. In this study, we
investigated the potential of conditional autoencoder models to learn
informative features for Drum Sample Retrieval by Vocalisation (DSRV). We
assessed the usefulness of their embeddings using four evaluation metrics, two
of them relative to their acoustic properties and two of them relative to their
perceptual properties via human listeners' similarity ratings. Results suggest
that models conditioned on both sound-type labels (drum vs imitation) and
drum-type labels (kick vs snare vs closed hi-hat vs opened hi-hat) learn the
most informative embeddings for DSRV. We finally looked into individual
differences in vocal imitation style via the Mantel test and found salient
differences among participants, highlighting the importance of user information
when designing DSRV systems.
Authors' comments: Submitted to Interspeech 2022 (under review)
Ziyang Yuan, Hongxia Wang, Zhiwei Li, Tao Wang, Hui Wang, Xinchao Huang, Tianjun Li, Ziru Ma et al.
Light-matter interaction is exploited in spectroscopic techniques to access information about molecular, atomic or nuclear constituents of the sample of interest. While scattered light carries both amplitude and phase information of the electromagnetic field, most of the time the latter is lost in intensity measurements. However, often the phase information is paramount to reconstruct the desired information of the target, as it is well known from coherent x-ray imaging. Here we introduce a new phase retrieval algorithm which allows us to reconstruct the field phase information from two-dimensional time- and energy-resolved spectra. We apply this method to the particular case of x-ray scattering off M\"ossbauer nuclei at a synchrotron radiation source. Knowledge of the phase allows also for an excellent reconstruction of the energy spectra from experimental data, which could not be achieved with this resolution otherwise. Our approach provides an efficient novel data analysis tool which will benefit x-ray quantum optics and M\"ossbauer spectroscopy with synchrotron radiation alike.
Jonathan Dong, Lorenzo Valzania, Antoine Maillard, Thanh-an Pham, Sylvain Gigan, Michael Unser
Phase retrieval consists in the recovery of a complex-valued signal from intensity-only measurements. As it pervades a broad variety of applications, many researchers have striven to develop phase-retrieval algorithms. Classical approaches involve techniques as varied as generic gradient-descent routines or specialized spectral methods, to name a few. Yet, the phase-recovery problem remains a challenge to this day. Recently, however, advances in machine learning have revitalized the study of phase retrieval in two ways: significant theoretical advances have emerged from the analogy between phase retrieval and single-layer neural networks; practical breakthroughs have been obtained thanks to deep-learning regularization. In this tutorial, we review phase retrieval under a unifying framework that encompasses classical and machine-learning methods. We focus on three key elements: applications, overview of recent reconstruction algorithms, and the latest theoretical results.
Xiaoyuan Guo, Jiali Duan, Saptarshi Purkayastha, Hari Trivedi, Judy Wawira Gichoya, Imon Banerjee
Improving the retrieval relevance on noisy datasets is an emerging need for
the curation of a large-scale clean dataset in the medical domain. While
existing methods can be applied for class-wise retrieval (aka. inter-class),
they cannot distinguish the granularity of likeness within the same class (aka.
intra-class). The problem is exacerbated on medical external datasets, where
noisy samples of the same class are treated equally during training. Our goal
is to identify both intra/inter-class similarities for fine-grained retrieval.
To achieve this, we propose an Outlier-Sensitive Content-based rAdiologhy
Retrieval System (OSCARS), consisting of two steps. First, we train an outlier
detector on a clean internal dataset in an unsupervised manner. Then we use the
trained detector to generate the anomaly scores on the external dataset, whose
distribution will be used to bin intra-class variations. Second, we propose a
quadruplet (a, p, nintra, ninter) sampling strategy, where intra-class
negatives nintra are sampled from bins of the same class other than the bin
anchor a belongs to, while niner are randomly sampled from inter-classes. We
suggest a weighted metric learning objective to balance the intra and
inter-class feature learning. We experimented on two representative public
radiography datasets. Experiments show the effectiveness of our approach. The
training and evaluation code can be found in
https://github.com/XiaoyuanGuo/oscars.
Authors' comments: 12 pages, 6 figures, 2 tables
Yupeng Shi, Xiao Liu, Yuxiang Wei, Zhongqin Wu, Wangmeng Zuo
Semantic image synthesis is a challenging task with many practical applications. Albeit remarkable progress has been made in semantic image synthesis with spatially-adaptive normalization and existing methods normalize the feature activations under the coarse-level guidance (e.g., semantic class). However, different parts of a semantic object (e.g., wheel and window of car) are quite different in structures and textures, making blurry synthesis results usually inevitable due to the missing of fine-grained guidance. In this paper, we propose a novel normalization module, termed as REtrieval-based Spatially AdaptIve normaLization (RESAIL), for introducing pixel level fine-grained guidance to the normalization architecture. Specifically, we first present a retrieval paradigm by finding a content patch of the same semantic class from training set with the most similar shape to each test semantic mask. Then, RESAIL is presented to use the retrieved patch for guiding the feature normalization of corresponding region, and can provide pixel level fine-grained guidance, thereby greatly mitigating blurry synthesis results. Moreover, distorted ground-truth images are also utilized as alternatives of retrieval-based guidance for feature normalization, further benefiting model training and improving visual quality of generated images. Experiments on several challenging datasets show that our RESAIL performs favorably against state-of-the-arts in terms of quantitative metrics, visual quality, and subjective evaluation. The source code and pre-trained models will be publicly available.
Yan-Bo Lin, Jie Lei, Mohit Bansal, Gedas Bertasius
We introduce an audiovisual method for long-range text-to-video retrieval.
Unlike previous approaches designed for short video retrieval (e.g., 5-15
seconds in duration), our approach aims to retrieve minute-long videos that
capture complex human actions. One challenge of standard video-only approaches
is the large computational cost associated with processing hundreds of densely
extracted frames from such long videos. To address this issue, we propose to
replace parts of the video with compact audio cues that succinctly summarize
dynamic audio events and are cheap to process. Our method, named ECLIPSE
(Efficient CLIP with Sound Encoding), adapts the popular CLIP model to an
audiovisual video setting, by adding a unified audiovisual transformer block
that captures complementary cues from the video and audio streams. In addition
to being 2.92x faster and 2.34x memory-efficient than long-range video-only
approaches, our method also achieves better text-to-video retrieval accuracy on
several diverse long-range video datasets such as ActivityNet, QVHighlights,
YouCook2, DiDeMo and Charades.
Authors' comments: ECCV 2022 Oral project page: https://yanbo.ml/project_page/eclipse/
Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, Yaniv Taigman
Recent text-to-image models have achieved impressive results. However, since they require large-scale datasets of text-image pairs, it is impractical to train them on new domains where data is scarce or not labeled. In this work, we propose using large-scale retrieval methods, in particular, efficient k-Nearest-Neighbors (kNN), which offers novel capabilities: (1) training a substantially small and efficient text-to-image diffusion model without any text, (2) generating out-of-distribution images by simply swapping the retrieval database at inference time, and (3) performing text-driven local semantic manipulations while preserving object identity. To demonstrate the robustness of our method, we apply our kNN approach on two state-of-the-art diffusion backbones, and show results on several different datasets. As evaluated by human studies and automatic metrics, our method achieves state-of-the-art results compared to existing approaches that train text-to-image generation models using images only (without paired text data)
Xinyu Zhang, Kelechi Ogueji, Xueguang Ma, Jimmy Lin
Dense retrieval models using a transformer-based bi-encoder design have emerged as an active area of research. In this work, we focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design. Although recent work with multilingual transformers demonstrates that they exhibit strong cross-lingual generalization capabilities, there remain many open research questions, which we tackle here. Our study is organized as a "best practices" guide for training multilingual dense retrieval models, broken down into three main scenarios: where a multilingual transformer is available, but relevance judgments are not available in the language of interest; where both models and training data are available; and, where training data are available not but models. In considering these scenarios, we gain a better understanding of the role of multi-stage fine-tuning, the strength of cross-lingual transfer under various conditions, the usefulness of out-of-language data, and the advantages of multilingual vs. monolingual transformers. Our recommendations offer a guide for practitioners building search applications, particularly for low-resource languages, and while our work leaves open a number of research questions, we provide a solid foundation for future work.
Robert Litschko, Ivan Vulić, Goran Glavaš
State-of-the-art neural (re)rankers are notoriously data-hungry which --
given the lack of large-scale training data in languages other than English --
makes them rarely used in multilingual and cross-lingual retrieval settings.
Current approaches therefore commonly transfer rankers trained on English data
to other languages and cross-lingual setups by means of multilingual encoders:
they fine-tune all parameters of pretrained massively multilingual Transformers
(MMTs, e.g., multilingual BERT) on English relevance judgments, and then deploy
them in the target language(s). In this work, we show that two
parameter-efficient approaches to cross-lingual transfer, namely Sparse
Fine-Tuning Masks (SFTMs) and Adapters, allow for a more lightweight and more
effective zero-shot transfer to multilingual and cross-lingual retrieval tasks.
We first train language adapters (or SFTMs) via Masked Language Modelling and
then train retrieval (i.e., reranking) adapters (SFTMs) on top, while keeping
all other parameters fixed. At inference, this modular design allows us to
compose the ranker by applying the (re)ranking adapter (or SFTM) trained with
source language data together with the language adapter (or SFTM) of a target
language. We carry out a large scale evaluation on the CLEF-2003 and HC4
benchmarks and additionally, as another contribution, extend the former with
queries in three new languages: Kyrgyz, Uyghur and Turkish. The proposed
parameter-efficient methods outperform standard zero-shot transfer with full
MMT fine-tuning, while being more modular and reducing training times. The
gains are particularly pronounced for low-resource languages, where our
approaches also substantially outperform the competitive machine
translation-based rankers.
Authors' comments: COLING 2022
Anna Lueber, Daniel Kitzmann, Brendan P. Bowler, Adam J. Burgasser, Kevin Heng
A large suite of 228 atmospheric retrievals is performed on a curated sample
of 19 brown dwarfs spanning the L0 to T8 spectral types using the open-source
Helios-r2 retrieval code, which implements the method of short characteristics
for radiative transfer and a finite-element description of the
temperature-pressure profile. Surprisingly, we find that cloud-free and cloudy
(both gray and non-gray) models are equally consistent with the archival SpeX
data from the perspective of Bayesian model comparison. Only upper limits for
cloud properties are inferred if log-uniform priors are assumed, but the cloud
optical depth becomes constrained if a uniform prior is used.
Water is detected in all 19 objects and methane is detected in all of the T
dwarfs, but no obvious trend exists across effective temperature. As carbon
monoxide is only detected in a handful of objects, the inferred
carbon-to-oxygen ratios are unreliable. The retrieved radius generally
decreases with effective temperature, but the values inferred for some T dwarfs
are implausibly low and may indicate missing physics or chemistry in the
models. For the early L dwarfs, the retrieved surface gravity depends on
whether the gray or non-gray cloud model is preferred. Future data are
necessary for constraining cloud properties and the vertical variation of
chemical abundances, the latter of which is needed for distinguishing between
the chemical instability versus traditional cloud interpretation of the L-T
transition.
Authors' comments: Accepted for publication in ApJ. The complete figure sets will be
available in the online journal
Pierre-Hugo Vial, Paul Magron, Thomas Oberlin, Cédric Févotte
This paper considers the phase retrieval (PR) problem, which aims to
reconstruct a signal from phaseless measurements such as magnitude or power
spectrograms. PR is generally handled as a minimization problem involving a
quadratic loss. Recent works have considered alternative discrepancy measures,
such as the Bregman divergences, but it is still challenging to tailor the
optimal loss for a given setting. In this paper we propose a novel strategy to
automatically learn the optimal metric for PR. We unfold a recently introduced
ADMM algorithm into a neural network, and we emphasize that the information
about the loss used to formulate the PR problem is conveyed by the proximity
operator involved in the ADMM updates. Therefore, we replace this proximity
operator with trainable activation functions: learning these in a supervised
setting is then equivalent to learning an optimal metric for PR. Experiments
conducted with speech signals show that our approach outperforms the baseline
ADMM, using a light and interpretable neural architecture.
Authors' comments: 10 pages, 5 figures, submitted to IEEE SPL
Shengyao Zhuang, Hang Li, Guido Zuccon
In this paper we study how to effectively exploit implicit feedback in Dense
Retrievers (DRs). We consider the specific case in which click data from a
historic click log is available as implicit feedback. We then exploit such
historic implicit interactions to improve the effectiveness of a DR. A key
challenge that we study is the effect that biases in the click signal, such as
position bias, have on the DRs. To overcome the problems associated with the
presence of such bias, we propose the Counterfactual Rocchio (CoRocchio)
algorithm for exploiting implicit feedback in Dense Retrievers. We demonstrate
both theoretically and empirically that dense query representations learnt with
CoRocchio are unbiased with respect to position bias and lead to higher
retrieval effectiveness. We make available the implementations of the proposed
methods and the experimental framework, along with all results at
https://github.com/ielab/Counterfactual-DR.
Authors' comments: Full paper, accepted at SGIR2022
Feifei Pan, Mustafa Canim, Michael Glass, Alfio Gliozzo, James Hendler
Most existing end-to-end Table Question Answering (Table QA) models consist of a two-stage framework with a retriever to select relevant table candidates from a corpus and a reader to locate the correct answers from table candidates. Even though the accuracy of the reader models is significantly improved with the recent transformer-based approaches, the overall performance of such frameworks still suffers from the poor accuracy of using traditional information retrieval techniques as retrievers. To alleviate this problem, we introduce T-RAG, an end-to-end Table QA model, where a non-parametric dense vector index is fine-tuned jointly with BART, a parametric sequence-to-sequence model to generate answer tokens. Given any natural language question, T-RAG utilizes a unified pipeline to automatically search through a table corpus to directly locate the correct answer from the table cells. We apply T-RAG to recent open-domain Table QA benchmarks and demonstrate that the fine-tuned T-RAG model is able to achieve state-of-the-art performance in both the end-to-end Table QA and the table retrieval tasks.
Mengjun Cheng, Yipeng Sun, Longchao Wang, Xiongwei Zhu, Kun Yao, Jie Chen, Guoli Song, Junyu Han et al.
Visual appearance is considered to be the most important cue to understand
images for cross-modal retrieval, while sometimes the scene text appearing in
images can provide valuable information to understand the visual semantics.
Most of existing cross-modal retrieval approaches ignore the usage of scene
text information and directly adding this information may lead to performance
degradation in scene text free scenarios. To address this issue, we propose a
full transformer architecture to unify these cross-modal retrieval scenarios in
a single $\textbf{Vi}$sion and $\textbf{S}$cene $\textbf{T}$ext
$\textbf{A}$ggregation framework (ViSTA). Specifically, ViSTA utilizes
transformer blocks to directly encode image patches and fuse scene text
embedding to learn an aggregated visual representation for cross-modal
retrieval. To tackle the modality missing problem of scene text, we propose a
novel fusion token based transformer aggregation approach to exchange the
necessary scene text information only through the fusion token and concentrate
on the most important features in each modality. To further strengthen the
visual modality, we develop dual contrastive learning losses to embed both
image-text pairs and fusion-text pairs into a common cross-modal space.
Compared to existing methods, ViSTA enables to aggregate relevant scene text
semantics with visual appearance, and hence improve results under both scene
text free and scene text aware scenarios. Experimental results show that ViSTA
outperforms other methods by at least $\bf{8.4}\%$ at Recall@1 for scene text
aware retrieval task. Compared with state-of-the-art scene text free retrieval
methods, ViSTA can achieve better accuracy on Flicker30K and MSCOCO while
running at least three times faster during the inference stage, which validates
the effectiveness of the proposed framework.
Authors' comments: Accepted by CVPR 2022
Riku Togashi, Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkila, Tetsuya Sakai
Evaluation measures have a crucial impact on the direction of research.
Therefore, it is of utmost importance to develop appropriate and reliable
evaluation measures for new applications where conventional measures are not
well suited. Video Moment Retrieval (VMR) is one such application, and the
current practice is to use R@$K,\theta$ for evaluating VMR systems. However,
this measure has two disadvantages. First, it is rank-insensitive: It ignores
the rank positions of successfully localised moments in the top-$K$ ranked list
by treating the list as a set. Second, it binarizes the Intersection over Union
(IoU) of each retrieved video moment using the threshold $\theta$ and thereby
ignoring fine-grained localisation quality of ranked moments.
We propose an alternative measure for evaluating VMR, called Average Max IoU
(AxIoU), which is free from the above two problems. We show that AxIoU
satisfies two important axioms for VMR evaluation, namely, \textbf{Invariance
against Redundant Moments} and \textbf{Monotonicity with respect to the Best
Moment}, and also that R@$K,\theta$ satisfies the first axiom only. We also
empirically examine how AxIoU agrees with R@$K,\theta$, as well as its
stability with respect to change in the test data and human-annotated temporal
boundaries.
Authors' comments: Accepted by CVPR2022
Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
Audio-text retrieval aims at retrieving a target audio clip or caption from a
pool of candidates given a query in another modality. Solving such cross-modal
retrieval task is challenging because it not only requires learning robust
feature representations for both modalities, but also requires capturing the
fine-grained alignment between these two modalities. Existing cross-modal
retrieval models are mostly optimized by metric learning objectives as both of
them attempt to map data to an embedding space, where similar data are close
together and dissimilar data are far apart. Unlike other cross-modal retrieval
tasks such as image-text and video-text retrievals, audio-text retrieval is
still an unexplored task. In this work, we aim to study the impact of different
metric learning objectives on the audio-text retrieval task. We present an
extensive evaluation of popular metric learning objectives on the AudioCaps and
Clotho datasets. We demonstrate that NT-Xent loss adapted from self-supervised
learning shows stable performance across different datasets and training
settings, and outperforms the popular triplet-based losses. Our code is
available at https://github.com/XinhaoMei/audio-text_retrieval.
Authors' comments: 5 pages, accepted to InterSpeech2022