Xulu Zhang, Zhenqun Yang, Hao Tian, Qing Li, Xiaoyong Wei
Deep learning became the game changer for image retrieval soon after it was introduced. It promotes the feature extraction (by representation learning) as the core of image retrieval, with the relevance/matching evaluation being degenerated into simple similarity metrics. In many applications, we need the matching evidence to be indicated rather than just have the ranked list (e.g., the locations of the target proteins/cells/lesions in medical images). It is like the matched words need to be highlighted in search engines. However, this is not easy to implement without explicit relevance/matching modeling. The deep representation learning models are not feasible because of their blackbox nature. In this paper, we revisit the importance of relevance/matching modeling in deep learning era with an indicative retrieval setting. The study shows that it is possible to skip the representation learning and model the matching evidence directly. By removing the dependency on the pre-trained models, it has avoided a lot of related issues (e.g., the domain gap between classification and retrieval, the detail-diffusion caused by convolution, and so on). More importantly, the study demonstrates that the matching can be explicitly modeled and backtracked later for generating the matching evidence indications. It can improve the explainability of deep inference. Our method obtains a best performance in literature on both Oxford-5k and Paris-6k, and sets a new record of 97.77% on Oxford-5k (97.81% on Paris-6k) without extracting any deep features.
Jiaohao Weng, Chao Zhang, Xi Yang, Haoran Xie
With the emergence of large-scale open online courses and online academic
conferences, it has become increasingly feasible and convenient to access
online educational resources. However, it is time consuming and challenging to
effectively retrieve and present numerous lecture videos for common users. In
this work, we propose a hierarchical visual interface for retrieving and
summarizing lecture videos. Users can utilize the proposed interface to
effectively explore the required video information through the results of the
video summary generation in different layers. We retrieve the input keywords
with the corresponding video layer with timestamps, a frame layer with slides,
and the poster layer with summarization of the lecture videos. We verified the
proposed interface with our user study by comparing it with other conventional
interfaces. The results from our user study confirmed that the proposed
interface can achieve high retrieval accuracy and good user experience.see
video here https://www.youtube.com/watch?v=zrnejwsOVpc .
Authors' comments: 6 pages, 3 figures, accepted in Proceedings of International Workshop
on Advanced Image Technology 2022
Haoming Zhang, Xiao-Jun Wu, Tianyang Xu, Donglin Zhang
Nowadays the measure between heterogeneous data is still an open problem for cross-modal retrieval. The core of cross-modal retrieval is how to measure the similarity between different types of data. Many approaches have been developed to solve the problem. As one of the mainstream, approaches based on subspace learning pay attention to learning a common subspace where the similarity among multi-modal data can be measured directly. However, many of the existing approaches only focus on learning a latent subspace. They ignore the full use of discriminative information so that the semantically structural information is not well preserved. Therefore satisfactory results can not be achieved as expected. We in this paper propose a discriminative supervised subspace learning for cross-modal retrieval(DS2L), to make full use of discriminative information and better preserve the semantically structural information. Specifically, we first construct a shared semantic graph to preserve the semantic structure within each modality. Subsequently, the Hilbert-Schmidt Independence Criterion(HSIC) is introduced to preserve the consistence between feature-similarity and semantic-similarity of samples. Thirdly, we introduce a similarity preservation term, thus our model can compensate for the shortcomings of insufficient use of discriminative data and better preserve the semantically structural information within each modality. The experimental results obtained on three well-known benchmark datasets demonstrate the effectiveness and competitiveness of the proposed method against the compared classic subspace learning approaches.
Ellen M. Voorhees, Ian Soboroff, Jimmy Lin
Neural retrieval models are generally regarded as fundamentally different from the retrieval techniques used in the late 1990's when the TREC ad hoc test collections were constructed. They thus provide the opportunity to empirically test the claim that pooling-built test collections can reliably evaluate retrieval systems that did not contribute to the construction of the collection (in other words, that such collections can be reusable). To test the reusability claim, we asked TREC assessors to judge new pools created from new search results for the TREC-8 ad hoc collection. These new search results consisted of five new runs (one each from three transformer-based models and two baseline runs that use BM25) plus the set of TREC-8 submissions that did not previously contribute to pools. The new runs did retrieve previously unseen documents, but the vast majority of those documents were not relevant. The ranking of all runs by mean evaluation score when evaluated using the official TREC-8 relevance judgment set and the newly expanded relevance set are almost identical, with Kendall's tau correlations greater than 0.99. Correlations for individual topics are also high. The TREC-8 ad hoc collection was originally constructed using deep pools over a diverse set of runs, including several effective manual runs. Its judgment budget, and hence construction cost, was relatively large. However, it does appear that the expense was well-spent: even with the advent of neural techniques, the collection has stood the test of time and remains a reliable evaluation instrument as retrieval techniques have advanced.
Nazifa Rumman, Tianhong Wang, Kaitlin Jennings, Pascal Bassène, Finn Buldt, Moussa N'Gom
We present an optical wavefront shaping approach that allows tracking and
localization of signal hidden inside or behind a scattering medium. The method
combines traditional feedback based wavefront shaping together with a switch
function, controlled by two different signals. A simple, in transmission
imaging system is used with two detectors: one monitors the speckle signature
and the other tracks the fully hidden signal (e.g., fluorescent beads). The
algorithm initially finds the optimal incident wavefront to maximize light
transmission to generate a focus in the scattering medium. This modulation
process redirects the scattered input signal inducing instantaneous changes in
both monitored signals, which in turn locates the hidden objects. Once the
response from the hidden target becomes distinct, the algorithm switches to use
this signal as the feedback. We provide experimental demonstrations as a proof
of concept of our approach. Potential applications of our method include
extracting information from biological samples and developing non-invasive
diagnosis methods.
Authors' comments: 4 pages, 5 figures, submitted to Applied Physics Letters
Sumrit Gupta, Ushasi Chaudhuri, Biplab Banerjee
The performance of a zero-shot sketch-based image retrieval (ZS-SBIR) task is
primarily affected by two challenges. The substantial domain gap between image
and sketch features needs to be bridged, while at the same time the side
information has to be chosen tactfully. Existing literature has shown that
varying the semantic side information greatly affects the performance of
ZS-SBIR. To this end, we propose a novel graph transformer based zero-shot
sketch-based image retrieval (GTZSR) framework for solving ZS-SBIR tasks which
uses a novel graph transformer to preserve the topology of the classes in the
semantic space and propagates the context-graph of the classes within the
embedding features of the visual space. To bridge the domain gap between the
visual features, we propose minimizing the Wasserstein distance between images
and sketches in a learned domain-shared space. We also propose a novel
compatibility loss that further aligns the two visual domains by bridging the
domain gap of one class with respect to the domain gap of all other classes in
the training set. Experimental results obtained on the extended Sketchy,
TU-Berlin, and QuickDraw datasets exhibit sharp improvements over the existing
state-of-the-art methods in both ZS-SBIR and generalized ZS-SBIR.
Authors' comments: Accepted at ICPR 2022
B. Miao, L. Feder, J. E. Shrock, H. M. Milchberg
Bessel beams generated with non-ideal axicons are affected by aberrations. We introduce a method to retrieve the complex amplitude of a Bessel beam from intensity measurements alone, and then use this information to correct the wavefront and intensity profile using a deformable mirror.
Cash Costello, Eugene Yang, Dawn Lawrie, James Mayfield
While there are high-quality software frameworks for information retrieval
experimentation, they do not explicitly support cross-language information
retrieval (CLIR). To fill this gap, we have created Patapsco, a Python CLIR
framework. This framework specifically addresses the complexity that comes with
running experiments in multiple languages. Patapsco is designed to be
extensible to many language pairs, to be scalable to large document
collections, and to support reproducible experiments driven by a configuration
file. We include Patapsco results on standard CLIR collections using multiple
settings.
Authors' comments: 5 pages, accepted at ECIR 2022 as a demo paper
Vilém Zouhar, Marius Mosbach, Debanjali Biswas, Dietrich Klakow
Many NLP models gain performance by having access to a knowledge base. A lot
of research has been devoted to devising and improving the way the knowledge
base is accessed and incorporated into the model, resulting in a number of
mechanisms and pipelines. Despite the diversity of proposed mechanisms, there
are patterns in the designs of such systems. In this paper, we systematically
describe the typology of artefacts (items retrieved from a knowledge base),
retrieval mechanisms and the way these artefacts are fused into the model. This
further allows us to uncover combinations of design decisions that had not yet
been tried. Most of the focus is given to language models, though we also show
how question answering, fact-checking and knowledgable dialogue models fit into
this system as well. Having an abstract model which can describe the
architecture of specific models also helps with transferring these
architectures between multiple NLP tasks.
Authors' comments: 11 pages of main content, 7 pages of appendix; presented at AKBC CSRR
2021
Jurijs Nazarovs, Cristian Lumezanu, Qianying Ren, Yuncong Chen, Takehiko Mizoguchi, Dongjin Song, Haifeng Chen
In this paper, we propose an ordered time series classification framework that is robust against missing classes in the training data, i.e., during testing we can prescribe classes that are missing during training. This framework relies on two main components: (1) our newly proposed ordinal-quadruplet loss, which forces the model to learn latent representation while preserving the ordinal relation among labels, (2) testing procedure, which utilizes the property of latent representation (order preservation). We conduct experiments based on real world multivariate time series data and show the significant improvement in the prediction of missing labels even with 40% of the classes are missing from training. Compared with the well-known triplet loss optimization augmented with interpolation for missing information, in some cases, we nearly double the accuracy.
Suraj Nair, Eugene Yang, Dawn Lawrie, Kevin Duh, Paul McNamee, Kenton Murray, James Mayfield, Douglas W. Oard
The advent of transformer-based models such as BERT has led to the rise of
neural ranking models. These models have improved the effectiveness of
retrieval systems well beyond that of lexical term matching models such as
BM25. While monolingual retrieval tasks have benefited from large-scale
training collections such as MS MARCO and advances in neural architectures,
cross-language retrieval tasks have fallen behind these advancements. This
paper introduces ColBERT-X, a generalization of the ColBERT
multi-representation dense retrieval model that uses the XLM-RoBERTa (XLM-R)
encoder to support cross-language information retrieval (CLIR). ColBERT-X can
be trained in two ways. In zero-shot training, the system is trained on the
English MS MARCO collection, relying on the XLM-R encoder for cross-language
mappings. In translate-train, the system is trained on the MS MARCO English
queries coupled with machine translations of the associated MS MARCO passages.
Results on ad hoc document ranking tasks in several languages demonstrate
substantial and statistically significant improvements of these trained dense
retrieval models over traditional lexical CLIR baselines.
Authors' comments: Accepted at ECIR 2022 (Full paper)
Cheng-En Wu, Farley Lai, Yu Hen Hu, Asim Kadav
Self-supervised video representation learning has been shown to effectively
improve downstream tasks such as video retrieval and action recognition. In
this paper, we present the Cascade Positive Retrieval (CPR) that successively
mines positive examples w.r.t. the query for contrastive learning in a cascade
of stages. Specifically, CPR exploits multiple views of a query example in
different modalities, where an alternative view may help find another positive
example dissimilar in the query view. We explore the effects of possible CPR
configurations in ablations including the number of mining stages, the top
similar example selection ratio in each stage, and progressive training with an
incremental number of the final Top-k selection. The overall mining quality is
measured to reflect the recall across training set classes. CPR reaches a
median class mining recall of 83.3%, outperforming previous work by 5.5%.
Implementation-wise, CPR is complementary to pretext tasks and can be easily
applied to previous work. In the evaluation of pretraining on UCF101, CPR
consistently improves existing work and even achieves state-of-the-art R@1 of
56.7% and 24.4% in video retrieval as well as 83.8% and 54.8% in action
recognition on UCF101 and HMDB51. The code is available at
https://github.com/necla-ml/CPR.
Authors' comments: To appear in CVPR 2022 L3D-IVU Workshop
Yue Ruan, Han-Hung Lee, Yiming Zhang, Ke Zhang, Angel X. Chang
Text-to-shape retrieval is an increasingly relevant problem with the growth
of 3D shape data. Recent work on contrastive losses for learning joint
embeddings over multimodal data has been successful at tasks such as retrieval
and classification. Thus far, work on joint representation learning for 3D
shapes and text has focused on improving embeddings through modeling of complex
attention between representations, or multi-task learning. We propose a
trimodal learning scheme over text, multi-view images and 3D shape voxels, and
show that with large batch contrastive learning we achieve good performance on
text-to-shape retrieval without complex attention mechanisms or losses. Our
experiments serve as a foundation for follow-up work on building trimodal
embeddings for text-image-shape.
Authors' comments: Accepted by WACV 2024
Zeyu Zhang, Thuy Vu, Alessandro Moschitti
Recent work has shown that an answer verification step introduced in Transformer-based answer selection models can significantly improve the state of the art in Question Answering. This step is performed by aggregating the embeddings of top $k$ answer candidates to support the verification of a target answer. Although the approach is intuitive and sound still shows two limitations: (i) the supporting candidates are ranked only according to the relevancy with the question and not with the answer, and (ii) the support provided by the other answer candidates is suboptimal as these are retrieved independently of the target answer. In this paper, we address both drawbacks by proposing (i) a double reranking model, which, for each target answer, selects the best support; and (ii) a second neural retrieval stage designed to encode question and answer pair as the query, which finds more specific verification information. The results on three well-known datasets for AS2 show consistent and significant improvement of the state of the art.
Weiwei Song, Zhi Gao, Renwei Dian, Pedram Ghamisi, Yongjun Zhang, Jón Atli Benediktsson
Remote sensing image retrieval (RSIR), aiming at searching for a set of
similar items to a given query image, is a very important task in remote
sensing applications. Deep hashing learning as the current mainstream method
has achieved satisfactory retrieval performance. On one hand, various deep
neural networks are used to extract semantic features of remote sensing images.
On the other hand, the hashing techniques are subsequently adopted to map the
high-dimensional deep features to the low-dimensional binary codes. This kind
of methods attempts to learn one hash function for both the query and database
samples in a symmetric way. However, with the number of database samples
increasing, it is typically time-consuming to generate the hash codes of
large-scale database images. In this paper, we propose a novel deep hashing
method, named asymmetric hash code learning (AHCL), for RSIR. The proposed AHCL
generates the hash codes of query and database images in an asymmetric way. In
more detail, the hash codes of query images are obtained by binarizing the
output of the network, while the hash codes of database images are directly
learned by solving the designed objective function. In addition, we combine the
semantic information of each image and the similarity information of pairs of
images as supervised information to train a deep hashing network, which
improves the representation ability of deep features and hash codes. The
experimental results on three public datasets demonstrate that the proposed
method outperforms symmetric methods in terms of retrieval accuracy and
efficiency. The source code is available at
https://github.com/weiweisong415/Demo AHCL for TGRS2022.
Authors' comments: 14 pages, 12 figures, and 2 tables
Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, Xiaohu Qie, Ping Luo
Pre-training a model to learn transferable video-text representation for
retrieval has attracted a lot of attention in recent years. Previous dominant
works mainly adopt two separate encoders for efficient retrieval, but ignore
local associations between videos and texts. Another line of research uses a
joint encoder to interact video with texts, but results in low efficiency since
each text-video pair needs to be fed into the model. In this work, we enable
fine-grained video-text interactions while maintaining high efficiency for
retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ),
where a parametric module BridgeFormer is trained to answer the "questions"
constructed by the text features via resorting to the video features.
Specifically, we exploit the rich semantics of text (i.e., nouns and verbs) to
build questions, with which the video encoder can be trained to capture more
regional content and temporal dynamics. In the form of questions and answers,
the semantic associations between local video-text features can be properly
established. BridgeFormer is able to be removed for downstream retrieval,
rendering an efficient and flexible model with only two encoders. Our method
outperforms state-of-the-art methods on the popular text-to-video retrieval
task in five datasets with different experimental setups (i.e., zero-shot and
fine-tune), including HowTo100M (one million videos). We further conduct
zero-shot action recognition, which can be cast as video-to-text retrieval, and
our approach also significantly surpasses its counterparts. As an additional
benefit, our method achieves competitive results with much shorter pre-training
videos on single-modality downstream tasks, e.g., action recognition with
linear evaluation.
Authors' comments: Accepted by CVPR 2022
Thomas Gerald, Laure Soulier
In information retrieval (IR) systems, trends and users' interests may change over time, altering either the distribution of requests or contents to be recommended. Since neural ranking approaches heavily depend on the training data, it is crucial to understand the transfer capacity of recent IR approaches to address new domains in the long term. In this paper, we first propose a dataset based upon the MSMarco corpus aiming at modeling a long stream of topics as well as IR property-driven controlled settings. We then in-depth analyze the ability of recent neural IR models while continually learning those streams. Our empirical study highlights in which particular cases catastrophic forgetting occurs (e.g., level of similarity between tasks, peculiarities on text length, and ways of learning models) to provide future directions in terms of model design.
Eva Breznik, Elisabeth Wetzer, Joakim Lindblad, Nataša Sladoje
In tissue characterization and cancer diagnostics, multimodal imaging has emerged as a powerful technique. Thanks to computational advances, large datasets can be exploited to discover patterns in pathologies and improve diagnosis. However, this requires efficient and scalable image retrieval methods. Cross-modality image retrieval is particularly challenging, since images of similar (or even the same) content captured by different modalities might share few common structures. We propose a new application-independent content-based image retrieval (CBIR) system for reverse (sub-)image search across modalities, which combines deep learning to generate representations (embedding the different modalities in a common space) with classical feature extraction and bag-of-words models for efficient and reliable retrieval. We illustrate its advantages through a replacement study, exploring a number of feature extractors and learned representations, as well as through comparison to recent (cross-modality) CBIR methods. For the task of (sub-)image retrieval on a (publicly available) dataset of brightfield and second harmonic generation microscopy images, the results show that our approach is superior to all tested alternatives. We discuss the shortcomings of the compared methods and observe the importance of equivariance and invariance properties of the learned representations and feature extractors in the CBIR pipeline. Code is available at: \url{https://github.com/MIDA-group/CrossModal_ImgRetrieval}.
Amanda Duarte, Samuel Albanie, Xavier Giró-i-Nieto, Gül Varol
Systems that can efficiently search collections of sign language videos have
been highlighted as a useful application of sign language technology. However,
the problem of searching videos beyond individual keywords has received limited
attention in the literature. To address this gap, in this work we introduce the
task of sign language retrieval with free-form textual queries: given a written
query (e.g., a sentence) and a large collection of sign language videos, the
objective is to find the signing video in the collection that best matches the
written query. We propose to tackle this task by learning cross-modal
embeddings on the recently introduced large-scale How2Sign dataset of American
Sign Language (ASL). We identify that a key bottleneck in the performance of
the system is the quality of the sign video embedding which suffers from a
scarcity of labeled training data. We, therefore, propose SPOT-ALIGN, a
framework for interleaving iterative rounds of sign spotting and feature
alignment to expand the scope and scale of available training data. We validate
the effectiveness of SPOT-ALIGN for learning a robust sign video embedding
through improvements in both sign recognition and the proposed video retrieval
task.
Authors' comments: In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) 2022
Ryoma Sato
Suppose we have a black-box function (e.g., deep neural network) that takes
an image as input and outputs a value that indicates preference. How can we
retrieve optimal images with respect to this function from an external database
on the Internet? Standard retrieval problems in the literature (e.g., item
recommendations) assume that an algorithm has full access to the set of items.
In other words, such algorithms are designed for service providers. In this
paper, we consider the retrieval problem under different assumptions.
Specifically, we consider how users with limited access to an image database
can retrieve images using their own black-box functions. This formulation
enables a flexible and finer-grained image search defined by each user. We
assume the user can access the database through a search query with tight API
limits. Therefore, a user needs to efficiently retrieve optimal images in terms
of the number of queries. We propose an efficient retrieval algorithm Tiara for
this problem. In the experiments, we confirm that our proposed method performs
better than several baselines under various settings.
Authors' comments: WSDM 2022