Meng Huang, Shixiang Sun, Zhiqiang Xu
Affine phase retrieval is the problem of recovering signals from the
magnitude-only measurements with a priori information. In this paper, we use
the $\ell_1$ minimization to exploit the sparsity of signals for affine phase
retrieval, showing that $O(k\log(en/k))$ Gaussian random measurements are
sufficient to recover all $k$-sparse signals by solving a natural $\ell_1$
minimization program, where $n$ is the dimension of signals. For the case where
measurements are corrupted by noises, the reconstruction error bounds are given
for both real-valued and complex-valued signals. Our results demonstrate that
the natural $\ell_1$ minimization program for affine phase retrieval is stable.
Authors' comments: 22 pages
Ling Luo, Yulia Gryaditskaya, Tao Xiang, Yi-Zhe Song
We study the practical task of fine-grained 3D-VR-sketch-based 3D shape
retrieval. This task is of particular interest as 2D sketches were shown to be
effective queries for 2D images. However, due to the domain gap, it remains
hard to achieve strong performance in 3D shape retrieval from 2D sketches.
Recent work demonstrated the advantage of 3D VR sketching on this task. In our
work, we focus on the challenge caused by inherent inaccuracies in 3D VR
sketches. We observe that retrieval results obtained with a triplet loss with a
fixed margin value, commonly used for retrieval tasks, contain many irrelevant
shapes and often just one or few with a similar structure to the query. To
mitigate this problem, we for the first time draw a connection between adaptive
margin values and shape similarities. In particular, we propose to use a
triplet loss with an adaptive margin value driven by a "fitting gap", which is
the similarity of two shapes under structure-preserving deformations. We also
conduct a user study which confirms that this fitting gap is indeed a suitable
criterion to evaluate the structural similarity of shapes. Furthermore, we
introduce a dataset of 202 VR sketches for 202 3D shapes drawn from memory
rather than from observation. The code and data are available at
https://github.com/Rowl1ng/Structure-Aware-VR-Sketch-Shape-Retrieval.
Authors' comments: Accepted by 3DV 2022
Dayou Yang, Susana F. Huelga, Martin B. Plenio
Continuous monitoring of driven-dissipative quantum optical systems is a
crucial element in the implementation of quantum metrology, providing essential
strategies for achieving highly precise measurements beyond the classical
limit. In this context, the relevant figure of merit is the quantum Fisher
information of the radiation field emitted by the driven-dissipative sensor.
Saturation of the corresponding precision limit as defined by the quantum
Cramer-Rao bound is typically not achieved by conventional, temporally local
continuous measurement schemes such as counting or homodyning. To address the
outstanding open challenge of efficient retrieval of the quantum Fisher
information of the emission field, we design a novel continuous measurement
strategy featuring temporally quasilocal measurement bases as captured by
matrix product states. Such measurement can be implemented effectively by
injecting the emission field of the sensor into an auxiliary open system, a
`quantum decoder' module, which `decodes' specific input matrix product states
into simple product states as its output field, and performing conventional
continuous measurement at the output. We devise a universal recipe for the
construction of the decoder by exploiting time reversal transformation of
quantum optical input-output channels, thereby establishing a universal method
to achieve the quantum Cramer-Rao precision limit for generic sensors based on
continuous measurement. As a by-product, we establish an effective formula for
the evaluation of the quantum Fisher information of the emission field of
generic driven-dissipative open sensors. We illustrate the power of our scheme
with paramagnetic open sensor designs including linear force sensors,
fibre-interfaced nonlinear emitters, and driven-dissipative many-body sensors,
and demonstrate that it can be robustly implemented under realistic
experimental imperfections.
Authors' comments: published version
Qiang Wang, Rongxiang Weng, Ming Chen
K-Nearest Neighbor Neural Machine Translation (kNN-MT) successfully
incorporates external corpus by retrieving word-level representations at test
time. Generally, kNN-MT borrows the off-the-shelf context representation in the
translation task, e.g., the output of the last decoder layer, as the query
vector of the retrieval task. In this work, we highlight that coupling the
representations of these two tasks is sub-optimal for fine-grained retrieval.
To alleviate it, we leverage supervised contrastive learning to learn the
distinctive retrieval representation derived from the original context
representation. We also propose a fast and effective approach to constructing
hard negative samples. Experimental results on five domains show that our
approach improves the retrieval accuracy and BLEU score compared to vanilla
kNN-MT.
Authors' comments: Accepted by COLING 2022
Jiawen Wu, Xinyu Zhang, Yutao Zhu, Zheng Liu, Zikai Guo, Zhaoye Fei, Ruofei Lai, Yongkang Wu et al.
Recent years have witnessed great progress on applying pre-trained language
models, e.g., BERT, to information retrieval (IR) tasks. Hyperlinks, which are
commonly used in Web pages, have been leveraged for designing pre-training
objectives. For example, anchor texts of the hyperlinks have been used for
simulating queries, thus constructing tremendous query-document pairs for
pre-training. However, as a bridge across two web pages, the potential of
hyperlinks has not been fully explored. In this work, we focus on modeling the
relationship between two documents that are connected by hyperlinks and
designing a new pre-training objective for ad-hoc retrieval. Specifically, we
categorize the relationships between documents into four groups: no link,
unidirectional link, symmetric link, and the most relevant symmetric link. By
comparing two documents sampled from adjacent groups, the model can gradually
improve its capability of capturing matching signals. We propose a progressive
hyperlink predication ({PHP}) framework to explore the utilization of
hyperlinks in pre-training. Experimental results on two large-scale ad-hoc
retrieval datasets and six question-answering datasets demonstrate its
superiority over existing pre-training methods.
Authors' comments: work in progress
Meng Huang, Zhiqiang Xu
Fourier phase retrieval, which seeks to reconstruct a signal from its Fourier
magnitude, is of fundamental importance in fields of engineering and science.
In this paper, we give a theoretical understanding of algorithms for Fourier
phase retrieval. Particularly, we show if there exists an algorithm which could
reconstruct an arbitrary signal ${\mathbf x}\in {\mathbb C}^N$ in $
\mbox{Poly}(N) \log(1/\epsilon)$ time to reach $\epsilon$-precision from its
magnitude of discrete Fourier transform and its initial value $x(0)$, then
$\mathcal{ P}=\mathcal{NP}$. This demystifies the phenomenon that, although
almost all signals are determined uniquely by their Fourier magnitude with a
prior conditions, there is no algorithm with theoretical guarantees being
proposed over the past few decades. Our proofs employ the result in
computational complexity theory that Product Partition problem is NP-complete
in the strong sense.
Authors' comments: 18 pages
Tomáš Nekvinda, Ondřej Dušek
We introduce AARGH, an end-to-end task-oriented dialog system combining
retrieval and generative approaches in a single model, aiming at improving
dialog management and lexical diversity of outputs. The model features a new
response selection method based on an action-aware training objective and a
simplified single-encoder retrieval architecture which allow us to build an
end-to-end retrieval-enhanced generation model where retrieval and generation
share most of the parameters. On the MultiWOZ dataset, we show that our
approach produces more diverse outputs while maintaining or improving state
tracking and context-to-response generation performance, compared to
state-of-the-art baselines.
Authors' comments: SIGDIAL 2022, with updated examples in Table 4
Feyza Yavuz, Sinan Kalkan
Logo retrieval is a challenging problem since the definition of similarity is
more subjective compared to image retrieval tasks and the set of known
similarities is very scarce. To tackle this challenge, in this paper, we
propose a simple but effective segment-based augmentation strategy to introduce
artificially similar logos for training deep networks for logo retrieval. In
this novel augmentation strategy, we first find segments in a logo and apply
transformations such as rotation, scaling, and color change, on the segments,
unlike the conventional image-level augmentation strategies. Moreover, we
evaluate whether the recently introduced ranking-based loss function,
Smooth-AP, is a better approach for learning similarity for logo retrieval. On
the large scale METU Trademark Dataset, we show that (i) our segment-based
augmentation strategy improves retrieval performance compared to the baseline
model or image-level augmentation strategies, and (ii) Smooth-AP indeed
performs better than conventional losses for logo retrieval.
Authors' comments: ICPR2022, Poster Presentation
Andreas Specker, Mickael Cormier, Jürgen Beyerer
Recognizing soft-biometric pedestrian attributes is essential in video surveillance and fashion retrieval. Recent works show promising results on single datasets. Nevertheless, the generalization ability of these methods under different attribute distributions, viewpoints, varying illumination, and low resolutions remains rarely understood due to strong biases and varying attributes in current datasets. To close this gap and support a systematic investigation, we present UPAR, the Unified Person Attribute Recognition Dataset. It is based on four well-known person attribute recognition datasets: PA100K, PETA, RAPv2, and Market1501. We unify those datasets by providing 3,3M additional annotations to harmonize 40 important binary attributes over 12 attribute categories across the datasets. We thus enable research on generalizable pedestrian attribute recognition as well as attribute-based person retrieval for the first time. Due to the vast variance of the image distribution, pedestrian pose, scale, and occlusion, existing approaches are greatly challenged both in terms of accuracy and efficiency. Furthermore, we develop a strong baseline for PAR and attribute-based person retrieval based on a thorough analysis of regularization methods. Our models achieve state-of-the-art performance in cross-domain and specialization settings on PA100k, PETA, RAPv2, Market1501-Attributes, and UPAR. We believe UPAR and our strong baseline will contribute to the artificial intelligence community and promote research on large-scale, generalizable attribute recognition systems.
Kung-Hsiang Huang, ChengXiang Zhai, Heng Ji
Fact-checking has gained increasing attention due to the widespread of
falsified information. Most fact-checking approaches focus on claims made in
English only due to the data scarcity issue in other languages. The lack of
fact-checking datasets in low-resource languages calls for an effective
cross-lingual transfer technique for fact-checking. Additionally, trustworthy
information in different languages can be complementary and helpful in
verifying facts. To this end, we present the first fact-checking framework
augmented with cross-lingual retrieval that aggregates evidence retrieved from
multiple languages through a cross-lingual retriever. Given the absence of
cross-lingual information retrieval datasets with claim-like queries, we train
the retriever with our proposed Cross-lingual Inverse Cloze Task (X-ICT), a
self-supervised algorithm that creates training instances by translating the
title of a passage. The goal for X-ICT is to learn cross-lingual retrieval in
which the model learns to identify the passage corresponding to a given
translated title. On the X-Fact dataset, our approach achieves 2.23% absolute
F1 improvement in the zero-shot cross-lingual setup over prior systems. The
source code and data are publicly available at
https://github.com/khuangaf/CONCRETE.
Authors' comments: Accepted by COLING 2022
Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang
In large-scale retrieval, the lexicon-weighting paradigm, learning weighted
sparse representations in vocabulary space, has shown promising results with
high quality and low latency. Despite it deeply exploiting the
lexicon-representing capability of pre-trained language models, a crucial gap
remains between language modeling and lexicon-weighting retrieval -- the former
preferring certain or low-entropy words whereas the latter favoring pivot or
high-entropy words -- becoming the main barrier to lexicon-weighting
performance for large-scale retrieval. To bridge this gap, we propose a
brand-new pre-training framework, lexicon-bottlenecked masked autoencoder
(LexMAE), to learn importance-aware lexicon representations. Essentially, we
present a lexicon-bottlenecked module between a normal language modeling
encoder and a weakened decoder, where a continuous bag-of-words bottleneck is
constructed to learn a lexicon-importance distribution in an unsupervised
fashion. The pre-trained LexMAE is readily transferred to the lexicon-weighting
retrieval via fine-tuning. On the ad-hoc retrieval benchmark, MS-Marco, it
achieves 42.6% MRR@10 with 45.8 QPS for the passage dataset and 44.4% MRR@100
with 134.8 QPS for the document dataset, by a CPU machine. And LexMAE shows
state-of-the-art zero-shot transfer capability on BEIR benchmark with 12
datasets.
Authors' comments: Appeared at ICLR 2023
Ningze Wang, Anoosheh Heidarzadeh, Alex Sprintson
In recent years, the Multi-message Private Information Retrieval (MPIR) problem has received significant attention from the research community. In this problem, a user wants to privately retrieve $D$ messages out of $K$ messages whose identical copies are stored on $N$ remote servers, while maximizing the download rate. The MPIR schemes can find applications in many practical scenarios and can serve as an important building block for private computation and private machine learning applications. The existing solutions for MPIR require a large degree of subpacketization, which can result in large overheads, high complexity, and impose constraints on the system parameters. These factors can limit practical applications of the existing solutions. In this paper, we present a methodology for the design of scalar-linear MPIR schemes. Such schemes are easy to implement in practical systems as they do not require partitioning of messages into smaller size sub-messages and do not impose any constraints on the minimum required size of the messages. Focusing on the case of $N=D+1$, we show that when $D$ divides $K$, our scheme achieves the capacity, where the capacity is defined as the maximum achievable download rate. When the divisibility condition does not hold, the performance of our scheme is the same or within a small additive margin compared to the best known scheme that requires a high degree of subpacketization.
Yabing Wang, Jianfeng Dong, Tianxiang Liang, Minsong Zhang, Rui Cai, Xun Wang
Despite the recent developments in the field of cross-modal retrieval, there
has been less research focusing on low-resource languages due to the lack of
manually annotated datasets. In this paper, we propose a noise-robust
cross-lingual cross-modal retrieval method for low-resource languages. To this
end, we use Machine Translation (MT) to construct pseudo-parallel sentence
pairs for low-resource languages. However, as MT is not perfect, it tends to
introduce noise during translation, rendering textual embeddings corrupted and
thereby compromising the retrieval performance. To alleviate this, we introduce
a multi-view self-distillation method to learn noise-robust target-language
representations, which employs a cross-attention module to generate soft
pseudo-targets to provide direct supervision from the similarity-based view and
feature-based view. Besides, inspired by the back-translation in unsupervised
MT, we minimize the semantic discrepancies between origin sentences and
back-translated sentences to further improve the noise robustness of the
textual encoder. Extensive experiments are conducted on three video-text and
image-text cross-modal retrieval benchmarks across different languages, and the
results demonstrate that our method significantly improves the overall
performance without using extra human-labeled data. In addition, equipped with
a pre-trained visual encoder from a recent vision-and-language pre-training
framework, i.e., CLIP, our model achieves a significant performance gain,
showing that our method is compatible with popular pre-training models. Code
and data are available at https://github.com/HuiGuanLab/nrccr.
Authors' comments: Accepted by ACM MM 2022. Code and data are available at
https://github.com/HuiGuanLab/nrccr
Zhengyang Tang, Benyou Wang, Ting Yao
Deep prompt tuning (DPT) has gained great success in most natural language
processing~(NLP) tasks. However, it is not well-investigated in dense retrieval
where fine-tuning~(FT) still dominates. When deploying multiple retrieval tasks
using the same backbone model~(e.g., RoBERTa), FT-based methods are unfriendly
in terms of deployment cost: each new retrieval model needs to repeatedly
deploy the backbone model without reuse. To reduce the deployment cost in such
a scenario, this work investigates applying DPT in dense retrieval. The
challenge is that directly applying DPT in dense retrieval largely
underperforms FT methods. To compensate for the performance drop, we propose
two model-agnostic and task-agnostic strategies for DPT-based retrievers,
namely retrieval-oriented intermediate pretraining and unified negative mining,
as a general approach that could be compatible with any pre-trained language
model and retrieval task. The experimental results show that the proposed
method (called DPTDR) outperforms previous state-of-the-art models on both
MS-MARCO and Natural Questions. We also conduct ablation studies to examine the
effectiveness of each strategy in DPTDR. We believe this work facilitates the
industry, as it saves enormous efforts and costs of deployment and increases
the utility of computing resources. Our code is available at
https://github.com/tangzhy/DPTDR.
Authors' comments: Accepted in COLING 2022
Xu Yan, Chunhui Ai, Ziqiang Cao, Min Cao, Sujian Li, Wenjie Chen, Guohong Fu
An outstanding image-text retrieval model depends on high-quality labeled
data. While the builders of existing image-text retrieval datasets strive to
ensure that the caption matches the linked image, they cannot prevent a caption
from fitting other images. We observe that such a many-to-many matching
phenomenon is quite common in the widely-used retrieval datasets, where one
caption can describe up to 178 images. These large matching-lost data not only
confuse the model in training but also weaken the evaluation accuracy. Inspired
by visual and textual entailment tasks, we propose a multi-modal entailment
classifier to determine whether a sentence is entailed by an image plus its
linked captions. Subsequently, we revise the image-text retrieval datasets by
adding these entailed captions as additional weak labels of an image and
develop a universal variable learning rate strategy to teach a retrieval model
to distinguish the entailed captions from other negative samples. In
experiments, we manually annotate an entailment-corrected image-text retrieval
dataset for evaluation. The results demonstrate that the proposed entailment
classifier achieves about 78% accuracy and consistently improves the
performance of image-text retrieval baselines.
Authors' comments: 10 pages
Xing Wu, Guangyuan Ma, Meng Lin, Zijia Lin, Zhongyuan Wang, Songlin Hu
Dense passage retrieval aims to retrieve the relevant passages of a query
from a large corpus based on dense representations (i.e., vectors) of the query
and the passages. Recent studies have explored improving pre-trained language
models to boost dense retrieval performance. This paper proposes CoT-MAE
(ConTextual Masked Auto-Encoder), a simple yet effective generative
pre-training method for dense passage retrieval. CoT-MAE employs an asymmetric
encoder-decoder architecture that learns to compress the sentence semantics
into a dense vector through self-supervised and context-supervised masked
auto-encoding. Precisely, self-supervised masked auto-encoding learns to model
the semantics of the tokens inside a text span, and context-supervised masked
auto-encoding learns to model the semantical correlation between the text
spans. We conduct experiments on large-scale passage retrieval benchmarks and
show considerable improvements over strong baselines, demonstrating the high
efficiency of CoT-MAE. Our code is available at
https://github.com/caskcsg/ir/tree/main/cotmae.
Authors' comments: This paper has been accepted by AAAI2023
Blaž Škrlj, Boshko Koloski, Senja Pollak
Efficiently identifying keyphrases that represent a given document is a challenging task. In the last years, plethora of keyword detection approaches were proposed. These approaches can be based on statistical (frequency-based) properties of e.g., tokens, specialized neural language models, or a graph-based structure derived from a given document. The graph-based methods can be computationally amongst the most efficient ones, while maintaining the retrieval performance. One of the main properties, common to graph-based methods, is their immediate conversion of token space into graphs, followed by subsequent processing. In this paper, we explore a novel unsupervised approach which merges parts of a document in sequential form, prior to construction of the token graph. Further, by leveraging personalized PageRank, which considers frequencies of such sub-phrases alongside token lengths during node ranking, we demonstrate state-of-the-art retrieval capabilities while being up to two orders of magnitude faster than current state-of-the-art unsupervised detectors such as YAKE and MultiPartiteRank. The proposed method's scalability was also demonstrated by computing keyphrases for a biomedical corpus comprised of 14 million documents in less than a minute.
Guoping Zhao, Bingqing Zhang, Mingyu Zhang, Yaxian Li, Jiajun Liu, Ji-Rong Wen
We propose a video feature representation learning framework called STAR-GNN,
which applies a pluggable graph neural network component on a multi-scale
lattice feature graph. The essence of STAR-GNN is to exploit both the temporal
dynamics and spatial contents as well as visual connections between regions at
different scales in the frames. It models a video with a lattice feature graph
in which the nodes represent regions of different granularity, with weighted
edges that represent the spatial and temporal links. The contextual nodes are
aggregated simultaneously by graph neural networks with parameters trained with
retrieval triplet loss. In the experiments, we show that STAR-GNN effectively
implements a dynamic attention mechanism on video frame sequences, resulting in
the emphasis for dynamic and semantically rich content in the video, and is
robust to noise and redundancies. Empirical results show that STAR-GNN achieves
state-of-the-art performance for Content-Based Video Retrieval.
Authors' comments: 6 pages, 2 figures, ICME 2022 accepted paper
Sophia Althammer, Sebastian Hofstätter, Suzan Verberne, Allan Hanbury
Robust test collections are crucial for Information Retrieval research.
Recently there is a growing interest in evaluating retrieval systems for
domain-specific retrieval tasks, however these tasks often lack a reliable test
collection with human-annotated relevance assessments following the Cranfield
paradigm. In the medical domain, the TripClick collection was recently
proposed, which contains click log data from the Trip search engine and
includes two click-based test sets. However the clicks are biased to the
retrieval model used, which remains unknown, and a previous study shows that
the test sets have a low judgement coverage for the Top-10 results of lexical
and neural retrieval models. In this paper we present the novel, relevance
judgement test collection TripJudge for TripClick health retrieval. We collect
relevance judgements in an annotation campaign and ensure the quality and
reusability of TripJudge by a variety of ranking methods for pool creation, by
multiple judgements per query-document pair and by an at least moderate
inter-annotator agreement. We compare system evaluation with TripJudge and
TripClick and find that that click and judgement-based evaluation can lead to
substantially different system rankings.
Authors' comments: To be published at CIKM 2022 as resource paper
Zhongyan Zhang, Lei Wang, Yang Wang, Luping Zhou, Jianjia Zhang, Peng Wang, Fang Chen
Quality feature representation is key to instance image retrieval. To attain it, existing methods usually resort to a deep model pre-trained on benchmark datasets or even fine-tune the model with a task-dependent labelled auxiliary dataset. Although achieving promising results, this approach is restricted by two issues: 1) the domain gap between benchmark datasets and the dataset of a given retrieval task; 2) the required auxiliary dataset cannot be readily obtained. In light of this situation, this work looks into a different approach which has not been well investigated for instance image retrieval previously: {can we learn feature representation \textit{specific to} a given retrieval task in order to achieve excellent retrieval?} Our finding is encouraging. By adding an object proposal generator to generate image regions for self-supervised learning, the investigated approach can successfully learn feature representation specific to a given dataset for retrieval. This representation can be made even more effective by boosting it with image similarity information mined from the dataset. As experimentally validated, such a simple ``self-supervised learning + self-boosting'' approach can well compete with the relevant state-of-the-art retrieval methods. Ablation study is conducted to show the appealing properties of this approach and its limitation on generalisation across datasets.