Hanqing Lu, Youna Hu, Tong Zhao, Tony Wu, Yiwei Song, Bing Yin
Nowadays, with many e-commerce platforms conducting global business,
e-commerce search systems are required to handle product retrieval under
multilingual scenarios. Moreover, comparing with maintaining per-country
specific e-commerce search systems, having a universal system across countries
can further reduce the operational and computational costs, and facilitate
business expansion to new countries. In this paper, we introduce a universal
end-to-end multilingual retrieval system, and discuss our learnings and
technical details when training and deploying the system to serve billion-scale
product retrieval for e-commerce search. In particular, we propose a
multilingual graph attention based retrieval network by leveraging recent
advances in transformer-based multilingual language models and graph neural
network architectures to capture the interactions between search queries and
items in e-commerce search. Offline experiments on five countries data show
that our algorithm outperforms the state-of-the-art baselines by 35% recall and
25% mAP on average. Moreover, the proposed model shows significant increase of
conversion/revenue in online A/B experiments and has been deployed in
production for multiple countries.
Authors' comments: Accepted by 2021 Annual Conference of the North American Chapter of
the Association for Computational Linguistics (NAACL 2021)
Zhiyu Chen, Shuo Zhang, Brian D. Davison
We describe the development, characteristics and availability of a test
collection for the task of Web table retrieval, which uses a large-scale Web
Table Corpora extracted from the Common Crawl. Since a Web table usually has
rich context information such as the page title and surrounding paragraphs, we
not only provide relevance judgments of query-table pairs, but also the
relevance judgments of query-table context pairs with respect to a query, which
are ignored by previous test collections. To facilitate future research with
this benchmark, we provide details about how the dataset is pre-processed and
also baseline results from both traditional and recently proposed table
retrieval methods. Our experimental results show that proper usage of context
labels can benefit previous table retrieval methods.
Authors' comments: Accepted as a resource paper in SIGIR 2021
Hiren Galiyawala, Mehul S Raval
Recent advancement of research in biometrics, computer vision, and natural
language processing has discovered opportunities for person retrieval from
surveillance videos using textual query. The prime objective of a surveillance
system is to locate a person using a description, e.g., a short woman with a
pink t-shirt and white skirt carrying a black purse. She has brown hair. Such a
description contains attributes like gender, height, type of clothing, colour
of clothing, hair colour, and accessories. Such attributes are formally known
as soft biometrics. They help bridge the semantic gap between a human
description and a machine as a textual query contains the person's soft
biometric attributes. It is also not feasible to manually search through huge
volumes of surveillance footage to retrieve a specific person. Hence, automatic
person retrieval using vision and language-based algorithms is becoming
popular. In comparison to other state-of-the-art reviews, the contribution of
the paper is as follows: 1. Recommends most discriminative soft biometrics for
specifiic challenging conditions. 2. Integrates benchmark datasets and
retrieval methods for objective performance evaluation. 3. A complete snapshot
of techniques based on features, classifiers, number of soft biometric
attributes, type of the deep neural networks, and performance measures. 4. The
comprehensive coverage of person retrieval from handcrafted features based
methods to end-to-end approaches based on natural language description.
Authors' comments: 45 pages, 17 figures, 6 Tables
Yongbiao Chen, Sheng Zhang, Fangxin Liu, Zhigang Chang, Mang Ye, Zhengwei Qi
Deep hamming hashing has gained growing popularity in approximate nearest neighbour search for large-scale image retrieval. Until now, the deep hashing for the image retrieval community has been dominated by convolutional neural network architectures, e.g. \texttt{Resnet}\cite{he2016deep}. In this paper, inspired by the recent advancements of vision transformers, we present \textbf{Transhash}, a pure transformer-based framework for deep hashing learning. Concretely, our framework is composed of two major modules: (1) Based on \textit{Vision Transformer} (ViT), we design a siamese vision transformer backbone for image feature extraction. To learn fine-grained features, we innovate a dual-stream feature learning on top of the transformer to learn discriminative global and local features. (2) Besides, we adopt a Bayesian learning scheme with a dynamically constructed similarity matrix to learn compact binary hash codes. The entire framework is jointly trained in an end-to-end manner.~To the best of our knowledge, this is the first work to tackle deep hashing learning problems without convolutional neural networks (\textit{CNNs}). We perform comprehensive experiments on three widely-studied datasets: \textbf{CIFAR-10}, \textbf{NUSWIDE} and \textbf{IMAGENET}. The experiments have evidenced our superiority against the existing state-of-the-art deep hashing methods. Specifically, we achieve 8.2\%, 2.6\%, 12.7\% performance gains in terms of average \textit{mAP} for different hash bit lengths on three public datasets, respectively.
Fei Wang, Kexuan Sun, Muhao Chen, Jay Pujara, Pedro Szekely
The task of natural language table retrieval (NLTR) seeks to retrieve
semantically relevant tables based on natural language queries. Existing
learning systems for this task often treat tables as plain text based on the
assumption that tables are structured as dataframes. However, tables can have
complex layouts which indicate diverse dependencies between subtable
structures, such as nested headers. As a result, queries may refer to different
spans of relevant content that is distributed across these structures.
Moreover, such systems fail to generalize to novel scenarios beyond those seen
in the training set. Prior methods are still distant from a generalizable
solution to the NLTR problem, as they fall short in handling complex table
layouts or queries over multiple granularities. To address these issues, we
propose Graph-based Table Retrieval (GTR), a generalizable NLTR framework with
multi-granular graph representation learning. In our framework, a table is
first converted into a tabular graph, with cell nodes, row nodes and column
nodes to capture content at different granularities. Then the tabular graph is
input to a Graph Transformer model that can capture both table cell content and
the layout structures. To enhance the robustness and generalizability of the
model, we further incorporate a self-supervised pre-training task based on
graph-context matching. Experimental results on two benchmarks show that our
method leads to significant improvements over the current state-of-the-art
systems. Further experiments demonstrate promising performance of our method on
cross-dataset generalization, and enhanced capability of handling complex
tables and fulfilling diverse query intents. Code and data are available at
https://github.com/FeiWang96/GTR.
Authors' comments: Accepted by SIGIR 2021
Soyeong Jeong, Jinheon Baek, ChaeHun Park, Jong C. Park
One of the challenges in information retrieval (IR) is the vocabulary
mismatch problem, which happens when the terms between queries and documents
are lexically different but semantically similar. While recent work has
proposed to expand the queries or documents by enriching their representations
with additional relevant terms to address this challenge, they usually require
a large volume of query-document pairs to train an expansion model. In this
paper, we propose an Unsupervised Document Expansion with Generation (UDEG)
framework with a pre-trained language model, which generates diverse
supplementary sentences for the original document without using labels on
query-document pairs for training. For generating sentences, we further
stochastically perturb their embeddings to generate more diverse sentences for
document expansion. We validate our framework on two standard IR benchmark
datasets. The results show that our framework significantly outperforms
relevant expansion baselines for IR.
Authors' comments: SDP@NAACL2021
Golsa Tahmasebzadeh, Endri Kacupaj, Eric Müller-Budack, Sherzod Hakimov, Jens Lehmann, Ralph Ewerth
In the context of social media, geolocation inference on news or events has
become a very important task. In this paper, we present the GeoWINE
(Geolocation-based Wiki-Image-News-Event retrieval) demonstrator, an effective
modular system for multimodal retrieval which expects only a single image as
input. The GeoWINE system consists of five modules in order to retrieve related
information from various sources. The first module is a state-of-the-art model
for geolocation estimation of images. The second module performs a
geospatial-based query for entity retrieval using the Wikidata knowledge graph.
The third module exploits four different image embedding representations, which
are used to retrieve most similar entities compared to the input image. The
embeddings are derived from the tasks of geolocation estimation, place
recognition, ImageNet-based image classification, and their combination. The
last two modules perform news and event retrieval from EventRegistry and the
Open Event Knowledge Graph (OEKG). GeoWINE provides an intuitive interface for
end-users and is insightful for experts for reconfiguration to individual
setups. The GeoWINE achieves promising results in entity label prediction for
images on Google Landmarks dataset. The demonstrator is publicly available at
http://cleopatra.ijs.si/geowine/.
Authors' comments: Accepted for publication in: International ACM SIGIR Conference on
Research and Development in Information Retrieval 2021
Tarun Krishna, Kevin McGuinness, Noel O'Connor
In this work, we evaluate contrastive models for the task of image retrieval.
We hypothesise that models that are learned to encode semantic similarity among
instances via discriminative learning should perform well on the task of image
retrieval, where relevancy is defined in terms of instances of the same object.
Through our extensive evaluation, we find that representations from models
trained using contrastive methods perform on-par with (and outperforms) a
pre-trained supervised baseline trained on the ImageNet labels in retrieval
tasks under various configurations. This is remarkable given that the
contrastive models require no explicit supervision. Thus, we conclude that
these models can be used to bootstrap base models to build more robust image
retrieval engines.
Authors' comments: Accepted In Proceedings of the 2021 International Conference on
Multimedia Retrieval (ICMR 21)
Mikolaj Wieczorek, Barbara Rychalska, Jacek Dabrowski
Image retrieval task consists of finding similar images to a query image from
a set of gallery (database) images. Such systems are used in various
applications e.g. person re-identification (ReID) or visual product search.
Despite active development of retrieval models it still remains a challenging
task mainly due to large intra-class variance caused by changes in view angle,
lighting, background clutter or occlusion, while inter-class variance may be
relatively low. A large portion of current research focuses on creating more
robust features and modifying objective functions, usually based on Triplet
Loss. Some works experiment with using centroid/proxy representation of a class
to alleviate problems with computing speed and hard samples mining used with
Triplet Loss. However, these approaches are used for training alone and
discarded during the retrieval stage. In this paper we propose to use the mean
centroid representation both during training and retrieval. Such an aggregated
representation is more robust to outliers and assures more stable features. As
each class is represented by a single embedding - the class centroid - both
retrieval time and storage requirements are reduced significantly. Aggregating
multiple embeddings results in a significant reduction of the search space due
to lowering the number of candidate target vectors, which makes the method
especially suitable for production deployments. Comprehensive experiments
conducted on two ReID and Fashion Retrieval datasets demonstrate effectiveness
of our method, which outperforms the current state-of-the-art. We propose
centroid training and retrieval as a viable method for both Fashion Retrieval
and ReID applications.
Authors' comments: 5 pages, 2 figures
Guilherme Moraes Rosa, Ruan Chaves Rodrigues, Roberto Lotufo, Rodrigo Nogueira
We describe our single submission to task 1 of COLIEE 2021. Our vanilla BM25 got second place, well above the median of submissions. Code is available at https://github.com/neuralmind-ai/coliee.
Bijita Sarma, Thomas Busch, Jason Twamley
We show how a quantum state in a microwave cavity mode can be transferred to
and stored in a phononic mode via an intermediate magnon mode in a
magnomechanical system. For this we consider a ferrimagnetic yttrium iron
garnet (YIG) sphere inserted in a microwave cavity, where the microwave and
magnon modes are coupled via a magnetic-dipole interaction and the magnon and
phonon modes in the YIG sphere are coupled via magnetostrictive forces. By
modulating the cavity and magnon detunings and the driving of the magnon mode
in time, a Stimulated Raman Adiabatic Passage (STIRAP)-like coherent transfer
becomes possible between the cavity mode and the phonon mode. The phononic mode
can be used to store the photonic quantum state for long periods as it
possesses lower damping than the photonic and magnon modes. Thus our proposed
scheme offers a possibility of using magnomechanical systems as quantum memory
for photonic quantum information.
Authors' comments: 15 pages, 7 figures
Xiaohan Wang, Linchao Zhu, Yi Yang
Text-video retrieval is a challenging task that aims to search relevant video
contents based on natural language descriptions. The key to this problem is to
measure text-video similarities in a joint embedding space. However, most
existing methods only consider the global cross-modal similarity and overlook
the local details. Some works incorporate the local comparisons through
cross-modal local matching and reasoning. These complex operations introduce
tremendous computation. In this paper, we design an efficient global-local
alignment method. The multi-modal video sequences and text features are
adaptively aggregated with a set of shared semantic centers. The local
cross-modal similarities are computed between the video feature and text
feature within the same center. This design enables the meticulous local
comparison and reduces the computational cost of the interaction between each
text-video pair. Moreover, a global alignment method is proposed to provide a
global cross-modal measurement that is complementary to the local perspective.
The global aggregated visual features also provide additional supervision,
which is indispensable to the optimization of the learnable semantic centers.
We achieve consistent improvements on three standard text-video retrieval
benchmarks and outperform the state-of-the-art by a clear margin.
Authors' comments: Accepted to CVPR 2021
Pablo Torres, Jose M. Saavedra
Sketch-based image retrieval (SBIR) has undergone an increasing interest in the community of computer vision bringing high impact in real applications. For instance, SBIR brings an increased benefit to eCommerce search engines because it allows users to formulate a query just by drawing what they need to buy. However, current methods showing high precision in retrieval work in a high dimensional space, which negatively affects aspects like memory consumption and time processing. Although some authors have also proposed compact representations, these drastically degrade the performance in a low dimension. Therefore in this work, we present different results of evaluating methods for producing compact embeddings in the context of sketch-based image retrieval. Our main interest is in strategies aiming to keep the local structure of the original space. The recent unsupervised local-topology preserving dimension reduction method UMAP fits our requirements and shows outstanding performance, improving even the precision achieved by SOTA methods. We evaluate six methods in two different datasets. We use Flickr15K and eCommerce datasets; the latter is another contribution of this work. We show that UMAP allows us to have feature vectors of 16 bytes improving precision by more than 35%.
Xiuwen Zheng, Dheeraj Mekala, Amarnath Gupta, Jingbo Shang
Hashtag annotation for microblog posts has been recently formulated as a sequence generation problem to handle emerging hashtags that are unseen in the training set. The state-of-the-art method leverages conversations initiated by posts to enrich contextual information for the short posts. However, it is unrealistic to assume the existence of conversations before the hashtag annotation itself. Therefore, we propose to leverage news articles published before the microblog post to generate hashtags following a Retriever-Generator framework. Extensive experiments on English Twitter datasets demonstrate superior performance and significant advantages of leveraging news articles to generate hashtags.
Baban Gain, Dibyanayan Bandyopadhyay, Tanik Saikh, Asif Ekbal
Natural Language Processing (NLP) and Information Retrieval (IR) in the
judicial domain is an essential task. With the advent of availability
domain-specific data in electronic form and aid of different Artificial
intelligence (AI) technologies, automated language processing becomes more
comfortable, and hence it becomes feasible for researchers and developers to
provide various automated tools to the legal community to reduce human burden.
The Competition on Legal Information Extraction/Entailment (COLIEE-2019) run in
association with the International Conference on Artificial Intelligence and
Law (ICAIL)-2019 has come up with few challenging tasks. The shared defined
four sub-tasks (i.e. Task1, Task2, Task3 and Task4), which will be able to
provide few automated systems to the judicial system. The paper presents our
working note on the experiments carried out as a part of our participation in
all the sub-tasks defined in this shared task. We make use of different
Information Retrieval(IR) and deep learning based approaches to tackle these
problems. We obtain encouraging results in all these four sub-tasks.
Authors' comments: 5 pages
Sewon Min, Kenton Lee, Ming-Wei Chang, Kristina Toutanova, Hannaneh Hajishirzi
We study multi-answer retrieval, an under-explored problem that requires
retrieving passages to cover multiple distinct answers for a given question.
This task requires joint modeling of retrieved passages, as models should not
repeatedly retrieve passages containing the same answer at the cost of missing
a different valid answer. In this paper, we introduce JPR, the first joint
passage retrieval model for multi-answer retrieval. JPR makes use of an
autoregressive reranker that selects a sequence of passages, each conditioned
on previously selected passages. JPR is trained to select passages that cover
new answers at each timestep and uses a tree-decoding algorithm to enable
flexibility in the degree of diversity. Compared to prior approaches, JPR
achieves significantly better answer coverage on three multi-answer datasets.
When combined with downstream question answering, the improved retrieval
enables larger answer generation models since they need to consider fewer
passages, establishing a new state-of-the-art.
Authors' comments: 13 pages; Published as a conference paper at EMNLP 2021 (long)
Daniel Heestermans Svendsen, Pablo Morales-Alvarez, Ana Belen Ruescas, Rafael Molina, Gustau Camps-Valls
Parameter retrieval and model inversion are key problems in remote sensing and Earth observation. Currently, different approximations exist: a direct, yet costly, inversion of radiative transfer models (RTMs); the statistical inversion with in situ data that often results in problems with extrapolation outside the study area; and the most widely adopted hybrid modeling by which statistical models, mostly nonlinear and non-parametric machine learning algorithms, are applied to invert RTM simulations. We will focus on the latter. Among the different existing algorithms, in the last decade kernel based methods, and Gaussian Processes (GPs) in particular, have provided useful and informative solutions to such RTM inversion problems. This is in large part due to the confidence intervals they provide, and their predictive accuracy. However, RTMs are very complex, highly nonlinear, and typically hierarchical models, so that often a shallow GP model cannot capture complex feature relations for inversion. This motivates the use of deeper hierarchical architectures, while still preserving the desirable properties of GPs. This paper introduces the use of deep Gaussian Processes (DGPs) for bio-geo-physical model inversion. Unlike shallow GP models, DGPs account for complicated (modular, hierarchical) processes, provide an efficient solution that scales well to big datasets, and improve prediction accuracy over their single layer counterpart. In the experimental section, we provide empirical evidence of performance for the estimation of surface temperature and dew point temperature from infrared sounding data, as well as for the prediction of chlorophyll content, inorganic suspended matter, and coloured dissolved matter from multispectral data acquired by the Sentinel-3 OLCI sensor. The presented methodology allows for more expressive forms of GPs in remote sensing model inversion problems.
Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Samuel Albanie, Yang Liu
In recent years, considerable progress on the task of text-video retrieval
has been achieved by leveraging large-scale pretraining on visual and audio
datasets to construct powerful video encoders. By contrast, despite the natural
symmetry, the design of effective algorithms for exploiting large-scale
language pretraining remains under-explored. In this work, we are the first to
investigate the design of such algorithms and propose a novel generalized
distillation method, TeachText, which leverages complementary cues from
multiple text encoders to provide an enhanced supervisory signal to the
retrieval model. Moreover, we extend our method to video side modalities and
show that we can effectively reduce the number of used modalities at test time
without compromising performance. Our approach advances the state of the art on
several video retrieval benchmarks by a significant margin and adds no
computational overhead at test time. Last but not least, we show an effective
application of our method for eliminating noise from retrieval datasets. Code
and data can be found at https://www.robots.ox.ac.uk/~vgg/research/teachtext/.
Authors' comments: ICCV 2021
Shir Gur, Natalia Neverova, Chris Stauffer, Ser-Nam Lim, Douwe Kiela, Austin Reiter
Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement in performance on image-caption retrieval w.r.t. similar methods. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hot-swapping indices.
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma
Ranking has always been one of the top concerns in information retrieval
researches. For decades, the lexical matching signal has dominated the ad-hoc
retrieval process, but solely using this signal in retrieval may cause the
vocabulary mismatch problem. In recent years, with the development of
representation learning techniques, many researchers turn to Dense Retrieval
(DR) models for better ranking performance. Although several existing DR models
have already obtained promising results, their performance improvement heavily
relies on the sampling of training examples. Many effective sampling strategies
are not efficient enough for practical usage, and for most of them, there still
lacks theoretical analysis in how and why performance improvement happens. To
shed light on these research questions, we theoretically investigate different
training strategies for DR models and try to explain why hard negative sampling
performs better than random sampling. Through the analysis, we also find that
there are many potential risks in static hard negative sampling, which is
employed by many existing training methods. Therefore, we propose two training
strategies named a Stable Training Algorithm for dense Retrieval (STAR) and a
query-side training Algorithm for Directly Optimizing Ranking pErformance
(ADORE), respectively. STAR improves the stability of DR training process by
introducing random negatives. ADORE replaces the widely-adopted static hard
negative sampling method with a dynamic one to directly optimize the ranking
performance. Experimental results on two publicly available retrieval benchmark
datasets show that either strategy gains significant improvements over existing
competitive baselines and a combination of them leads to the best performance.
Authors' comments: To be published in SIGIR2021