Yida Zhao, Yuqing Song, Qin Jin
Image retrieval with hybrid-modality queries, also known as composing text
and image for image retrieval (CTI-IR), is a retrieval task where the search
intention is expressed in a more complex query format, involving both vision
and text modalities. For example, a target product image is searched using a
reference product image along with text about changing certain attributes of
the reference image as the query. It is a more challenging image retrieval task
that requires both semantic space learning and cross-modal fusion. Previous
approaches that attempt to deal with both aspects achieve unsatisfactory
performance. In this paper, we decompose the CTI-IR task into a three-stage
learning problem to progressively learn the complex knowledge for image
retrieval with hybrid-modality queries. We first leverage the semantic
embedding space for open-domain image-text retrieval, and then transfer the
learned knowledge to the fashion-domain with fashion-related pre-training
tasks. Finally, we enhance the pre-trained model from single-query to
hybrid-modality query for the CTI-IR task. Furthermore, as the contribution of
individual modality in the hybrid-modality query varies for different retrieval
scenarios, we propose a self-supervised adaptive weighting strategy to
dynamically determine the importance of image and text in the hybrid-modality
query for better retrieval. Extensive experiments show that our proposed model
significantly outperforms state-of-the-art methods in the mean of Recall@K by
24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
Authors' comments: Accepted by SIGIR 2022
Eric Dodds, Jack Culpepper, Gaurav Srivastava
Retrieving relevant images from a catalog based on a query image together with a modifying caption is a challenging multimodal task that can particularly benefit domains like apparel shopping, where fine details and subtle variations may be best expressed through natural language. We introduce a new evaluation dataset, Challenging Fashion Queries (CFQ), as well as a modeling approach that achieves state-of-the-art performance on the existing Fashion IQ (FIQ) dataset. CFQ complements existing benchmarks by including relative captions with positive and negative labels of caption accuracy and conditional image similarity, where others provided only positive labels with a combined meaning. We demonstrate the importance of multimodal pretraining for the task and show that domain-specific weak supervision based on attribute labels can augment generic large-scale pretraining. While previous modality fusion mechanisms lose the benefits of multimodal pretraining, we introduce a residual attention fusion mechanism that improves performance. We release CFQ and our code to the research community.
Yifan Qiao, Yingrui Yang, Haixin Lin, Tianbo Xiong, Xiyue Wang, Tao Yang
This paper proposes a dual skipping guidance scheme with hybrid scoring to accelerate document retrieval that uses learned sparse representations while still delivering a good relevance. This scheme uses both lexical BM25 and learned neural term weights to bound and compose the rank score of a candidate document separately for skipping and final ranking, and maintains two top-k thresholds during inverted index traversal. This paper evaluates time efficiency and ranking relevance of the proposed scheme in searching MS MARCO TREC datasets.
Boshu Lei, Wenjie Ding, Limeng Qiao, Xi Qiu
Visual place retrieval aims to search images in the database that depict
similar places as the query image. However, global descriptors encoded by the
network usually fall into a low dimensional principal space, which is harmful
to the retrieval performance. We first analyze the cause of this phenomenon,
pointing out that it is due to degraded distribution of the gradients of
descriptors. Then, we propose Gradient Rectification Module(GRM) to alleviate
this issue. GRM is appended after the final pooling layer and can rectify
gradients to the complementary space of the principal space. With GRM, the
network is encouraged to generate descriptors more uniformly in the whole
space. At last, we conduct experiments on multiple datasets and generalize our
method to classification task under prototype learning framework.
Authors' comments: Accepted to the 2023 International Conference on Robotics and
Automation (ICRA 2023)
Jianwei Niu, Hok Shing Wong, Tieyong Zeng
Phase retrieval is an important problem with significant physical and industrial applications. In this paper, we consider the case where the magnitude of the measurement of an underlying signal is corrupted by Gaussian noise. We introduce a convex augmentation approach for phase retrieval based on total variation regularization. In contrast to popular convex relaxation models like PhaseLift, our model can be efficiently solved by a modified semi-proximal alternating direction method of multipliers (sPADMM). The modified sPADMM is more general and flexible than the standard one, and its convergence is also established in this paper. Extensive numerical experiments are conducted to showcase the effectiveness of the proposed method.
Leila Pishdad, Ran Zhang, Konstantinos G. Derpanis, Allan Jepson, Afsaneh Fazly
Probabilistic embeddings have proven useful for capturing polysemous word
meanings, as well as ambiguity in image matching. In this paper, we study the
advantages of probabilistic embeddings in a cross-modal setting (i.e., text and
images), and propose a simple approach that replaces the standard vector point
embeddings in extant image-text matching models with probabilistic
distributions that are parametrically learned. Our guiding hypothesis is that
the uncertainty encoded in the probabilistic embeddings captures the
cross-modal ambiguity in the input instances, and that it is through capturing
this uncertainty that the probabilistic models can perform better at downstream
tasks, such as image-to-text or text-to-image retrieval. Through extensive
experiments on standard and new benchmarks, we show a consistent advantage for
probabilistic representations in cross-modal retrieval, and validate the
ability of our embeddings to capture uncertainty.
Authors' comments: 13 pages, 7 figures
Ran Cui, Tianwen Qian, Pai Peng, Elena Daskalaki, Jingjing Chen, Xiaowei Guo, Huyang Sun, Yu-Gang Jiang
Video moment retrieval aims at finding the start and end timestamps of a
moment (part of a video) described by a given natural language query. Fully
supervised methods need complete temporal boundary annotations to achieve
promising results, which is costly since the annotator needs to watch the whole
moment. Weakly supervised methods only rely on the paired video and query, but
the performance is relatively poor. In this paper, we look closer into the
annotation process and propose a new paradigm called "glance annotation". This
paradigm requires the timestamp of only one single random frame, which we refer
to as a "glance", within the temporal boundary of the fully supervised
counterpart. We argue this is beneficial because comparing to weak supervision,
trivial cost is added yet more potential in performance is provided. Under the
glance annotation setting, we propose a method named as Video moment retrieval
via Glance Annotation (ViGA) based on contrastive learning. ViGA cuts the input
video into clips and contrasts between clips and queries, in which glance
guided Gaussian distributed weights are assigned to all clips. Our extensive
experiments indicate that ViGA achieves better results than the
state-of-the-art weakly supervised methods by a large margin, even comparable
to fully supervised methods in some cases.
Authors' comments: Accepted as full paper in SIGIR 2022
Mustafa Shukor, Guillaume Couairon, Asya Grechka, Matthieu Cord
Cross-modal image-recipe retrieval has gained significant attention in recent
years. Most work focuses on improving cross-modal embeddings using unimodal
encoders, that allow for efficient retrieval in large-scale databases, leaving
aside cross-attention between modalities which is more computationally
expensive. We propose a new retrieval framework, T-Food (Transformer Decoders
with MultiModal Regularization for Cross-Modal Food Retrieval) that exploits
the interaction between modalities in a novel regularization scheme, while
using only unimodal encoders at test time for efficient retrieval. We also
capture the intra-dependencies between recipe entities with a dedicated recipe
encoder, and propose new variants of triplet losses with dynamic margins that
adapt to the difficulty of the task. Finally, we leverage the power of the
recent Vision and Language Pretraining (VLP) models such as CLIP for the image
encoder. Our approach outperforms existing approaches by a large margin on the
Recipe1M dataset. Specifically, we achieve absolute improvements of 8.1 % (72.6
R@1) and +10.9 % (44.6 R@1) on the 1k and 10k test sets respectively. The code
is available here:https://github.com/mshukor/TFood
Authors' comments: Accepted at CVPR 2022, MULA Workshop. Code is available at
https://github.com/mshukor/TFood
Nazia Afroz Choudhury, Bariscan Yonel, Birsen Yazici
Robustness to noise and outliers is a desirable trait in phase retrieval algorithms for many applications in imaging and signal processing. In this paper, we develop novel robust phase retrieval algorithms based on the minimization of reverse Kullback-Leibler divergence (RKLD) within the Wirtinger Flow (WF) framework. We use RKLD over intensity-only measurements in two distinct ways: i) to design a novel initial estimate based on minimum distortion design of spectral estimates, and ii) as a loss function for iterative refinement based on WF. The RKLD-based loss function offers implicit regularization by processing data at the logarithmic scale and provides the following benefits: suppressing the influence of outliers and promoting projections orthogonal to noise subspace. We perform a quantitative analysis demonstrating the robustness of RKLD-based minimization as compared to that of the $\ell_2$ and Poisson loss-based minimization. We present three algorithms based on RKLD minimization, including two with truncation schemes to enhance the robustness to significant contamination. Our numerical study uses data generated based on synthetic coded diffraction patterns and real optical imaging data. The results demonstrate the advantages of our algorithms in terms of sample efficiency, convergence speed, and robustness with respect to outliers over the state-of-the-art techniques.
Georgii Mikriukov, Mahdyar Ravanbakhsh, Begüm Demir
The development of cross-modal retrieval systems that can search and retrieve
semantically relevant data across different modalities based on a query in any
modality has attracted great attention in remote sensing (RS). In this paper,
we focus our attention on cross-modal text-image retrieval, where queries from
one modality (e.g., text) can be matched to archive entries from another (e.g.,
image). Most of the existing cross-modal text-image retrieval systems in RS
require a high number of labeled training samples and also do not allow fast
and memory-efficient retrieval. These issues limit the applicability of the
existing cross-modal retrieval systems for large-scale applications in RS. To
address this problem, in this paper we introduce a novel unsupervised
cross-modal contrastive hashing (DUCH) method for text-image retrieval in RS.
To this end, the proposed DUCH is made up of two main modules: 1) feature
extraction module, which extracts deep representations of two modalities; 2)
hashing module that learns to generate cross-modal binary hash codes from the
extracted representations. We introduce a novel multi-objective loss function
including: i) contrastive objectives that enable similarity preservation in
intra- and inter-modal similarities; ii) an adversarial objective that is
enforced across two modalities for cross-modal representation consistency; and
iii) binarization objectives for generating hash codes. Experimental results
show that the proposed DUCH outperforms state-of-the-art methods. Our code is
publicly available at https://git.tu-berlin.de/rsim/duch.
Authors' comments: Our code is publicly available at https://git.tu-berlin.de/rsim/duch.
arXiv admin note: substantial text overlap with arXiv:2201.08125
Ananya Mukherjee, Nimmala Narendra
This work addresses the viability of \textit {Dirac phase leptogenesis}, in a
scenario where the light Majorana neutrinos acquire masses by the inverse
seesaw (ISS) mechanism. We show that, a successful leptogenesis in the ISS,
driven (only) by the Dirac CP phase can be achieved with the involvement of an
unorthodox form of the rotational matrix $R = e^{i{\bf A}} \,\,\,(e^{{\bf A}})$
in the Casas-Ibarra parametrisation. This particular structure of $R$ turns out
to be an artefact in explaining the observed baryon asymmetry of the Universe
in a pure ISS scenario. We detail here the confined regions of the $R$ matrix
parameter space, essential for a successful leptogenesis. The $R$-matrix
parameter space assists in rescuing the ISS parameter space needed for
successful leptogenesis. This finding is otherwise unprecedented in the ISS set
up. Making use of the resulted $R$ matrix parameter space we have calculated
the branching ratio for the LFV decay $\mu \rightarrow e\gamma$. This accounts
for an indirect probe of the $R$-matrix parameter space. The branching ratio
obtained from the leptogenesis parameter space surpasses the existing bound on
the branching ratio that resulted in a scenario of combined effect of linear
and inverse seesaw. We also report here that, for $R = e^{i{\bf A}}$ choice
leptogenesis demands the Dirac CP phase ($\delta$) to oscillate around $\pi/2$,
although for the later choice the constraint on $\delta$ is much relaxed.
Authors' comments: 28 pages, 12 figures
Md Rizwan Parvez, Jianfeng Chi, Wasi Uddin Ahmad, Yuan Tian, Kai-Wei Chang
Prior studies in privacy policies frame the question answering (QA) task as
identifying the most relevant text segment or a list of sentences from a policy
document given a user query. Existing labeled datasets are heavily imbalanced
(only a few relevant segments), limiting the QA performance in this domain. In
this paper, we develop a data augmentation framework based on ensembling
retriever models that captures the relevant text segments from unlabeled policy
documents and expand the positive examples in the training set. In addition, to
improve the diversity and quality of the augmented data, we leverage multiple
pre-trained language models (LMs) and cascade them with noise reduction filter
models. Using our augmented data on the PrivacyQA benchmark, we elevate the
existing baseline by a large margin (10\% F1) and achieve a new
state-of-the-art F1 score of 50\%. Our ablation studies provide further
insights into the effectiveness of our approach.
Authors' comments: EACL 2023
Leon Gugel, Shai Dekel
Phase retrieval is a well known ill-posed inverse problem where one tries to recover images given only the magnitude values of their Fourier transform as input. In recent years, new algorithms based on deep learning have been proposed, providing breakthrough results that surpass the results of the classical methods. In this work we provide a novel deep learning architecture PR-DAD (Phase Retrieval Using Deep Auto- Decoders), whose components are carefully designed based on mathematical modeling of the phase retrieval problem. The architecture provides experimental results that surpass all current results.
Constantinos Chamzas, Aedan Cullen, Anshumali Shrivastava, Lydia E. Kavraki
Recent work has demonstrated that motion planners' performance can be
significantly improved by retrieving past experiences from a database.
Typically, the experience database is queried for past similar problems using a
similarity function defined over the motion planning problems. However, to
date, most works rely on simple hand-crafted similarity functions and fail to
generalize outside their corresponding training dataset. To address this
limitation, we propose (FIRE), a framework that extracts local representations
of planning problems and learns a similarity function over them. To generate
the training data we introduce a novel self-supervised method that identifies
similar and dissimilar pairs of local primitives from past solution paths. With
these pairs, a Siamese network is trained with the contrastive loss and the
similarity function is realized in the network's latent space. We evaluate FIRE
on an 8-DOF manipulator in five categories of motion planning problems with
sensed environments. Our experiments show that FIRE retrieves relevant
experiences which can informatively guide sampling-based planners even in
problems outside its training distribution, outperforming other baselines.
Authors' comments: Accepted in International Conference on Robotics and Automation
(ICRA), 2022. Code will be released soon
Megan Leszczynski, Daniel Y. Fu, Mayee F. Chen, Christopher Ré
Entity retrieval--retrieving information about entity mentions in a query--is
a key step in open-domain tasks, such as question answering or fact checking.
However, state-of-the-art entity retrievers struggle to retrieve rare entities
for ambiguous mentions due to biases towards popular entities. Incorporating
knowledge graph types during training could help overcome popularity biases,
but there are several challenges: (1) existing type-based retrieval methods
require mention boundaries as input, but open-domain tasks run on unstructured
text, (2) type-based methods should not compromise overall performance, and (3)
type-based methods should be robust to noisy and missing types. In this work,
we introduce TABi, a method to jointly train bi-encoders on knowledge graph
types and unstructured text for entity retrieval for open-domain tasks. TABi
leverages a type-enforced contrastive loss to encourage entities and queries of
similar types to be close in the embedding space. TABi improves retrieval of
rare entities on the Ambiguous Entity Retrieval (AmbER) sets, while maintaining
strong overall retrieval performance on open-domain tasks in the KILT benchmark
compared to state-of-the-art retrievers. TABi is also robust to incomplete type
systems, improving rare entity retrieval over baselines with only 5% type
coverage of the training dataset. We make our code publicly available at
https://github.com/HazyResearch/tabi.
Authors' comments: Accepted to Findings of ACL 2022
Hwanhee Lee, Cheoneum Park, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Juae Kim, Kyomin Jung
Despite the recent advancements in abstractive summarization systems
leveraged from large-scale datasets and pre-trained language models, the
factual correctness of the summary is still insufficient. One line of trials to
mitigate this problem is to include a post-editing process that can detect and
correct factual errors in the summary. In building such a post-editing system,
it is strongly required that 1) the process has a high success rate and
interpretability and 2) has a fast running time. Previous approaches focus on
regeneration of the summary using the autoregressive models, which lack
interpretability and require high computing resources. In this paper, we
propose an efficient factual error correction system RFEC based on entities
retrieval post-editing process. RFEC first retrieves the evidence sentences
from the original document by comparing the sentences with the target summary.
This approach greatly reduces the length of text for a system to analyze. Next,
RFEC detects the entity-level errors in the summaries by considering the
evidence sentences and substitutes the wrong entities with the accurate
entities from the evidence sentences. Experimental results show that our
proposed error correction system shows more competitive performance than
baseline methods in correcting the factual errors with a much faster speed.
Authors' comments: 6 pages, 3 figures
Yunhao Du, Binyu Zhang, Xiangning Ruan, Fei Su, Zhicheng Zhao, Hong Chen
Retrieving tracked-vehicles by natural language descriptions plays a critical
role in smart city construction. It aims to find the best match for the given
texts from a set of tracked vehicles in surveillance videos. Existing works
generally solve it by a dual-stream framework, which consists of a text
encoder, a visual encoder and a cross-modal loss function. Although some
progress has been made, they failed to fully exploit the information at various
levels of granularity. To tackle this issue, we propose a novel framework for
the natural language-based vehicle retrieval task, OMG, which Observes Multiple
Granularities with respect to visual representation, textual representation and
objective functions. For the visual representation, target features, context
features and motion features are encoded separately. For the textual
representation, one global embedding, three local embeddings and a color-type
prompt embedding are extracted to represent various granularities of semantic
features. Finally, the overall framework is optimized by a cross-modal
multi-granularity contrastive loss function. Experiments demonstrate the
effectiveness of our method. Our OMG significantly outperforms all previous
methods and ranks the 9th on the 6th AI City Challenge Track2. The codes are
available at https://github.com/dyhBUPT/OMG.
Authors' comments: CVPR 2022 Workshop
Xunguang Wang, Yiqun Lin, Xiaomeng Li
Deep hashing has been extensively utilized in massive image retrieval because of its efficiency and effectiveness. However, deep hashing models are vulnerable to adversarial examples, making it essential to develop adversarial defense methods for image retrieval. Existing solutions achieved limited defense performance because of using weak adversarial samples for training and lacking discriminative optimization objectives to learn robust features. In this paper, we present a min-max based Center-guided Adversarial Training, namely CgAT, to improve the robustness of deep hashing networks through worst adversarial examples. Specifically, we first formulate the center code as a semantically-discriminative representative of the input image content, which preserves the semantic similarity with positive samples and dissimilarity with negative examples. We prove that a mathematical formula can calculate the center code immediately. After obtaining the center codes in each optimization iteration of the deep hashing network, they are adopted to guide the adversarial training process. On the one hand, CgAT generates the worst adversarial examples as augmented data by maximizing the Hamming distance between the hash codes of the adversarial examples and the center codes. On the other hand, CgAT learns to mitigate the effects of adversarial samples by minimizing the Hamming distance to the center codes. Extensive experiments on the benchmark datasets demonstrate the effectiveness of our adversarial training algorithm in defending against adversarial attacks for deep hashing-based retrieval. Compared with the current state-of-the-art defense method, we significantly improve the defense performance by an average of 18.61\%, 12.35\%, and 11.56\% on FLICKR-25K, NUS-WIDE, and MS-COCO, respectively. The code is available at https://github.com/xunguangwang/CgAT.
Johannes Villmow, Viola Campos, Adrian Ulges, Ulrich Schwanecke
We address contextualized code retrieval, the search for code snippets
helpful to fill gaps in a partial input program. Our approach facilitates a
large-scale self-supervised contrastive training by splitting source code
randomly into contexts and targets. To combat leakage between the two, we
suggest a novel approach based on mutual identifier masking, dedentation, and
the selection of syntax-aligned targets. Our second contribution is a new
dataset for direct evaluation of contextualized code retrieval, based on a
dataset of manually aligned subpassages of code clones. Our experiments
demonstrate that our approach improves retrieval substantially, and yields new
state-of-the-art results for code clone and defect detection.
Authors' comments: 4 pages, 5 figures
Devendra Singh Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen-tau Yih, Joelle Pineau, Luke Zettlemoyer
We propose a simple and effective re-ranking method for improving passage
retrieval in open question answering. The re-ranker re-scores retrieved
passages with a zero-shot question generation model, which uses a pre-trained
language model to compute the probability of the input question conditioned on
a retrieved passage. This approach can be applied on top of any retrieval
method (e.g. neural or keyword-based), does not require any domain- or
task-specific training (and therefore is expected to generalize better to data
distribution shifts), and provides rich cross-attention between query and
passage (i.e. it must explain every token in the question). When evaluated on a
number of open-domain retrieval datasets, our re-ranker improves strong
unsupervised retrieval models by 6%-18% absolute and strong supervised models
by up to 12% in terms of top-20 passage retrieval accuracy. We also obtain new
state-of-the-art results on full open-domain question answering by simply
adding the new re-ranker to existing models with no further changes.
Authors' comments: EMNLP 2022 camera-ready version. Code is available at:
https://github.com/DevSinghSachan/unsupervised-passage-reranking