Albert Fannjiang
3D tomographic phase retrieval under the Born approximation for discrete objects supported on a $n\times n\times n$ grid is analyzed. It is proved that $n$ projections are sufficient and necessary for unique determination by computed tomography (CT) with full projected field measurements and that $n+1$ coded projected diffraction patterns are sufficient for unique determination, up to a global phase factor, in tomographic phase retrieval. Hence $n+1$ is nearly, if not exactly, the minimum number of diffractions patterns needed for 3D tomographic phase retrieval under the Born approximation.
Mariya Hendriksen, Maurits Bleeker, Svitlana Vakulenko, Nanne van Noord, Ernst Kuiper, Maarten de Rijke
E-commerce provides rich multimodal data that is barely leveraged in
practice. One aspect of this data is a category tree that is being used in
search and recommendation. However, in practice, during a user's session there
is often a mismatch between a textual and a visual representation of a given
category. Motivated by the problem, we introduce the task of category-to-image
retrieval in e-commerce and propose a model for the task, CLIP-ITA. The model
leverages information from multiple modalities (textual, visual, and attribute
modality) to create product representations. We explore how adding information
from multiple modalities (textual, visual, and attribute modality) impacts the
model's performance. In particular, we observe that CLIP-ITA significantly
outperforms a comparable model that leverages only the visual modality and a
comparable model that leverages the visual and attribute modality.
Authors' comments: 15 pages, accepted as a full paper at ECIR 2022
Robert Litschko, Ivan Vulić, Simone Paolo Ponzetto, Goran Glavaš
In this work we present a systematic empirical study focused on the
suitability of the state-of-the-art multilingual encoders for cross-lingual
document and sentence retrieval tasks across a number of diverse language
pairs. We first treat these models as multilingual text encoders and benchmark
their performance in unsupervised ad-hoc sentence- and document-level CLIR. In
contrast to supervised language understanding, our results indicate that for
unsupervised document-level CLIR -- a setup with no relevance judgments for
IR-specific fine-tuning -- pretrained multilingual encoders on average fail to
significantly outperform earlier models based on CLWEs. For sentence-level
retrieval, we do obtain state-of-the-art performance: the peak scores, however,
are met by multilingual encoders that have been further specialized, in a
supervised fashion, for sentence understanding tasks, rather than using their
vanilla 'off-the-shelf' variants. Following these results, we introduce
localized relevance matching for document-level CLIR, where we independently
score a query against document sections. In the second part, we evaluate
multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to
rank) on English relevance data in a series of zero-shot language and domain
transfer CLIR experiments. Our results show that supervised re-ranking rarely
improves the performance of multilingual transformers as unsupervised base
rankers. Finally, only with in-domain contrastive fine-tuning (i.e., same
domain, only language transfer), we manage to improve the ranking quality. We
uncover substantial empirical differences between cross-lingual retrieval
results and results of (zero-shot) cross-lingual transfer for monolingual
retrieval in target languages, which point to "monolingual overfitting" of
retrieval models trained on monolingual data.
Authors' comments: to appear in IRJ ECIR 2021 Special Issue. arXiv admin note:
substantial text overlap with arXiv:2101.08370
Sarah Ibrahimi, Arnaud Sors, Rafael Sampaio de Rezende, Stéphane Clinchant
Learning with noisy labels is an active research area for image
classification. However, the effect of noisy labels on image retrieval has been
less studied. In this work, we propose a noise-resistant method for image
retrieval named Teacher-based Selection of Interactions, T-SINT, which
identifies noisy interactions, ie. elements in the distance matrix, and selects
correct positive and negative interactions to be considered in the retrieval
loss by using a teacher-based training setup which contributes to the
stability. As a result, it consistently outperforms state-of-the-art methods on
high noise rates across benchmark datasets with synthetic noise and more
realistic noise.
Authors' comments: Accepted at WACV 2022. 13 pages, 5 figures
Mark Iwen, Michael Perlmutter, Mark Philip Roach
Ptychography is an imaging technique which involves a sample being illuminated by a coherent, localized probe of illumination. When the probe interacts with the sample, the light is diffracted and a diffraction pattern is detected. Then the sample (or probe) is shifted laterally in space to illuminate a new area of the sample whilst ensuring sufficient overlap. Near-field Ptychography (NFP) occurs when the sample is placed at a short defocus distance having a large Fresnel number. In this paper, we prove that certain NFP measurements are robustly invertible (up to an unavoidable global phase ambiguity) by constructing a point spread function and physical mask which leads to a well-conditioned lifted linear system. We then apply a block phase retrieval algorithm using weighted angular synchronization and prove that the proposed approach accurately recovers the measured sample. Finally, we also propose using a Wirtinger Flow for NFP problems and numerically evaluate that alternate approach both against our main proposed approach, as well as with NFP measurements for which our main approach does not apply.
Jheng-Hong Yang, Xueguang Ma, Jimmy Lin
Sparse lexical representation learning has demonstrated much progress in
improving passage retrieval effectiveness in recent models such as DeepImpact,
uniCOIL, and SPLADE. This paper describes a straightforward yet effective
approach for sparsifying lexical representations for passage retrieval,
building on SPLADE by introducing a top-$k$ masking scheme to control sparsity
and a self-learning method to coax masked representations to mimic unmasked
representations. A basic implementation of our model is competitive with more
sophisticated approaches and achieves a good balance between effectiveness and
efficiency. The simplicity of our methods opens the door for future
explorations in lexical representation learning for passage retrieval.
Authors' comments: 8 pages, 1 figure
A. Sophia Koepke, Andreea-Maria Oncescu, João F. Henriques, Zeynep Akata, Samuel Albanie
The objectives of this work are cross-modal text-audio and audio-text
retrieval, in which the goal is to retrieve the audio content from a pool of
candidates that best matches a given written description and vice versa.
Text-audio retrieval enables users to search large databases through an
intuitive interface: they simply issue free-form natural language descriptions
of the sound they would like to hear. To study the tasks of text-audio and
audio-text retrieval, which have received limited attention in the existing
literature, we introduce three challenging new benchmarks. We first construct
text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho
audio captioning datasets. Additionally, we introduce the SoundDescs benchmark,
which consists of paired audio and natural language descriptions for a diverse
collection of sounds that are complementary to those found in AudioCaps and
Clotho. We employ these three benchmarks to establish baselines for cross-modal
text-audio and audio-text retrieval, where we demonstrate the benefits of
pre-training on diverse audio tasks. We hope that our benchmarks will inspire
further research into audio retrieval with free-form text queries. Code, audio
features for all datasets used, and the SoundDescs dataset are publicly
available at https://github.com/akoepke/audio-retrieval-benchmark.
Authors' comments: Submitted to Transactions on Multimedia. arXiv admin note:
substantial text overlap with arXiv:2105.02192
Ohad Rubin, Jonathan Herzig, Jonathan Berant
In-context learning is a recent paradigm in natural language understanding,
where a large pre-trained language model (LM) observes a test instance and a
few training examples as its input, and directly decodes the output without any
update to its parameters. However, performance has been shown to strongly
depend on the selected training examples (termed prompt). In this work, we
propose an efficient method for retrieving prompts for in-context learning
using annotated data and a LM. Given an input-output pair, we estimate the
probability of the output given the input and a candidate training example as
the prompt, and label training examples as positive or negative based on this
probability. We then train an efficient dense retriever from this data, which
is used to retrieve training examples as prompts at test time. We evaluate our
approach on three sequence-to-sequence tasks where language utterances are
mapped to meaning representations, and find that it substantially outperforms
prior work and multiple baselines across the board.
Authors' comments: NAACL-HLT 2022
Luis Welbanks, Nikku Madhusudhan
The complexity of atmospheric retrieval models is largely data-driven and
one-dimensional models have generally been considered adequate with current
data quality. However, recent studies have suggested that using 1D models in
retrievals can result in anomalously cool terminator temperatures and biased
abundance estimates even with existing transmission spectra of hot Jupiters.
Motivated by these claims and upcoming high-quality transmission spectra we
systematically explore the limitations of 1D models using synthetic and current
observations. We use 1D models of varying complexity, both analytic and
numerical, to revisit claims of biases when interpreting transmission spectra
of hot Jupiters with inhomogeneous terminator compositions. Overall, we find
the reported biases to be resulting from specific model assumptions rather than
intrinsic limitations of 1D atmospheric models in retrieving current
observations of asymmetric terminators. Additionally, we revise atmospheric
retrievals of the hot Jupiter WASP-43b ($T_{\rm eq}=1440$ K) and the ultra-hot
Jupiter WASP-103b ($T_{\rm eq}=2484$ K ) for which previous studies inferred
abnormally cool atmospheric temperatures. We retrieve temperatures consistent
with expectations. We note, however, that in the limit of extreme terminator
inhomogeneities and high data quality some atmospheric inferences may
conceivably be biased, although to a lesser extent than previously claimed. To
address such cases, we implement a 2D retrieval framework for transmission
spectra which allows accurate constraints on average atmospheric properties and
provides insights into the spectral ranges where the imprints of atmospheric
inhomogeneities are strongest. Our study highlights the need for careful
considerations of model assumptions and data quality before attributing biases
in retrieved estimates to unaccounted atmospheric inhomogeneities.
Authors' comments: Replaced with version accepted for publication in The Astrophysical
Journal
Xin Liu, Dayiheng Liu, Baosong Yang, Haibo Zhang, Junwei Ding, Wenqing Yao, Weihua Luo, Haiying Zhang et al.
Generative commonsense reasoning requires machines to generate sentences describing an everyday scenario given several concepts, which has attracted much attention recently. However, existing models cannot perform as well as humans, since sentences they produce are often implausible and grammatically incorrect. In this paper, inspired by the process of humans creating sentences, we propose a novel Knowledge-enhanced Commonsense Generation framework, termed KGR^4, consisting of four stages: Retrieval, Retrospect, Refine, Rethink. Under this framework, we first perform retrieval to search for relevant sentences from external corpus as the prototypes. Then, we train the generator that either edits or copies these prototypes to generate candidate sentences, of which potential errors will be fixed by an autoencoder-based refiner. Finally, we select the output sentence from candidate sentences produced by generators with different hyper-parameters. Experimental results and in-depth analysis on the CommonGen benchmark strongly demonstrate the effectiveness of our framework. Particularly, KGR^4 obtains 33.56 SPICE points in the official leaderboard, outperforming the previously-reported best result by 2.49 SPICE points and achieving state-of-the-art performance.
Jian-Feng Cai, Meng Huang, Dong Li, Yang Wang
A fundamental problem in phase retrieval is to reconstruct an unknown signal
from a set of magnitude-only measurements. In this work we introduce three
novel quotient intensity-based models (QIMs) based a deep modification of the
traditional intensity-based models. A remarkable feature of the new loss
functions is that the corresponding geometric landscape is benign under the
optimal sampling complexity. When the measurements $ a_i\in \Rn$ are Gaussian
random vectors and the number of measurements $m\ge Cn$, the QIMs admit no
spurious local minimizers with high probability, i.e., the target solution $ x$
is the unique global minimizer (up to a global phase) and the loss function has
a negative directional curvature around each saddle point. Such benign
geometric landscape allows the gradient descent methods to find the global
solution $x$ (up to a global phase) without spectral initialization.
Authors' comments: 41 pages
Jian-Feng Cai, Meng Huang, Dong Li, Yang Wang
A fundamental task in phase retrieval is to recover an unknown signal $\vx\in
\Rn$ from a set of magnitude-only measurements $y_i=\abs{\nj{\va_i,\vx}}, \;
i=1,\ldots,m$. In this paper, we propose two novel perturbed amplitude models
(PAMs) which have non-convex and quadratic-type loss function. When the
measurements $ \va_i \in \Rn$ are Gaussian random vectors and the number of
measurements $m\ge Cn$, we rigorously prove that the PAMs admit no spurious
local minimizers with high probability, i.e., the target solution $ \vx$ is the
unique global minimizer (up to a global phase) and the loss function has a
negative directional curvature around each saddle point. Thanks to the
well-tamed benign geometric landscape, one can employ the vanilla gradient
descent method to locate the global minimizer $\vx$ (up to a global phase)
without spectral initialization. We carry out extensive numerical experiments
to show that the gradient descent algorithm with random initialization
outperforms state-of-the-art algorithms with spectral initialization in
empirical success rate and convergence speed.
Authors' comments: 60 pages
Jian-Feng Cai, Yuling Jiao, Xiliang Lu, Juntao You
In this work we propose a nonconvex two-stage \underline{s}tochastic \underline{a}lternating \underline{m}inimizing (SAM) method for sparse phase retrieval. The proposed algorithm is guaranteed to have an exact recovery from $O(s\log n)$ samples if provided the initial guess is in a local neighbour of the ground truth. Thus, the proposed algorithm is two-stage, first we estimate a desired initial guess (e.g. via a spectral method), and then we introduce a randomized alternating minimization strategy for local refinement. Also, the hard-thresholding pursuit algorithm is employed to solve the sparse constraint least square subproblems. We give the theoretical justifications that SAM find the underlying signal exactly in a finite number of iterations (no more than $O(\log m)$ steps) with high probability. Further, numerical experiments illustrates that SAM requires less measurements than state-of-the-art algorithms for sparse phase retrieval problem.
Yulong Li, Martin Franz, Md Arafat Sultan, Bhavani Iyer, Young-Suk Lee, Avirup Sil
We present DR.DECR (Dense Retrieval with Distillation-Enhanced Cross-Lingual
Representation), a new cross-lingual information retrieval (CLIR) system
trained using multi-stage knowledge distillation (KD). The teacher of DR.DECR
relies on a highly effective but computationally expensive two-stage inference
process consisting of query translation and monolingual IR, while the student,
DR.DECR, executes a single CLIR step. We teach DR.DECR powerful multilingual
representations as well as CLIR by optimizing two corresponding KD objectives.
Learning useful representations of non-English text from an English-only
retriever is accomplished through a cross-lingual token alignment algorithm
that relies on the representation capabilities of the underlying multilingual
encoders. In both in-domain and zero-shot out-of-domain evaluation, DR.DECR
demonstrates far superior accuracy over direct fine-tuning with labeled CLIR
data. It is also the best single-model retriever on the XOR-TyDi benchmark at
the time of this writing.
Authors' comments: Presented at NAACL 2022 main conference Code can be found at:
https://github.com/primeqa/primeqa
Zeqiu Wu, Yi Luan, Hannah Rashkin, David Reitter, Hannaneh Hajishirzi, Mari Ostendorf, Gaurav Singh Tomar
Compared to standard retrieval tasks, passage retrieval for conversational
question answering (CQA) poses new challenges in understanding the current user
question, as each question needs to be interpreted within the dialogue context.
Moreover, it can be expensive to re-train well-established retrievers such as
search engines that are originally developed for non-conversational queries. To
facilitate their use, we develop a query rewriting model CONQRR that rewrites a
conversational question in the context into a standalone question. It is
trained with a novel reward function to directly optimize towards retrieval
using reinforcement learning and can be adapted to any off-the-shelf retriever.
CONQRR achieves state-of-the-art results on a recent open-domain CQA dataset
containing conversations from three different sources, and is effective for two
different off-the-shelf retrievers. Our extensive analysis also shows the
robustness of CONQRR to out-of-domain dialogues as well as to zero query
rewriting supervision.
Authors' comments: EMNLP 2022 camera-ready
Mingfei Gao, Le Xue, Chetan Ramaiah, Chen Xing, Ran Xu, Caiming Xiong
We propose value retrieval with arbitrary queries for form-like documents to reduce human effort of processing forms. Unlike previous methods that only address a fixed set of field items, our method predicts target value for an arbitrary query based on the understanding of the layout and semantics of a form. To further boost model performance, we propose a simple document language modeling (SimpleDLM) strategy to improve document understanding on large-scale model pre-training. Experimental results show that our method outperforms previous designs significantly and the SimpleDLM further improves our performance on value retrieval by around 17% F1 score compared with the state-of-the-art pre-training method. Code is available at https://github.com/salesforce/QVR-SimpleDLM.
Anjali A. A. Piette, Nikku Madhusudhan, Avi M. Mandell
Emission spectroscopy is a promising technique to observe atmospheres of
rocky exoplanets, probing both their chemistry and thermal profiles. We present
HyDRo, an atmospheric retrieval framework for thermal emission spectra of rocky
exoplanets. HyDRo does not make prior assumptions about the background
atmospheric composition, and can therefore be used to interpret spectra of
secondary atmospheres with unknown compositions. We use HyDRo to assess the
chemical constraints which can be placed on rocky exoplanet atmospheres using
JWST. Firstly, we identify the best currently-known rocky exoplanet candidates
for spectroscopic observations in thermal emission with JWST, finding >30 known
rocky exoplanets whose thermal emission will be detectable by JWST/MIRI in
fewer than 10 eclipses at R~10. We then consider the observations required to
characterise the atmospheres of three promising rocky exoplanets across the
~400-800 K equilibrium temperature range: Trappist-1 b, GJ 1132 b, and LHS 3844
b. Considering a range of CO_2- to H_2O-rich atmospheric compositions, we find
that as few as 8 eclipses of LHS 3844 b or GJ 1132 b with MIRI will be able to
place important constraints on the chemical compositions of their atmospheres.
This includes confident detections of CO_2 and H_2O in the case of a cloud-free
CO_2-rich composition, besides ruling out a bare rock scenario. Similarly, 30
eclipses of Trappist-1 b with MIRI/LRS can allow detections of a cloud-free
CO_2-rich or CO_2-H_2O atmosphere. HyDRo will allow important atmospheric
constraints for rocky exoplanets using JWST observations, providing clues about
their geochemical environments.
Authors' comments: Accepted for publication in MNRAS. 21 pages, 10 figures, 4 tables
Sheng-Chieh Lin, Jimmy Lin
Learned sparse and dense representations capture different successful approaches to text retrieval and the fusion of their results has proven to be more effective and robust. Prior work combines dense and sparse retrievers by fusing their model scores. As an alternative, this paper presents a simple approach to densifying sparse representations for text retrieval that does not involve any training. Our densified sparse representations (DSRs) are interpretable and can be easily combined with dense representations for end-to-end retrieval. We demonstrate that our approach can jointly learn sparse and dense representations within a single model and then combine them for dense retrieval. Experimental results suggest that combining our DSRs and dense representations yields a balanced tradeoff between effectiveness and efficiency.
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau et al.
We enhance auto-regressive language models by conditioning on document chunks
retrieved from a large corpus, based on local similarity with preceding tokens.
With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO)
obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite
using 25$\times$ fewer parameters. After fine-tuning, RETRO performance
translates to downstream knowledge-intensive tasks such as question answering.
RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked
cross-attention mechanism to predict tokens based on an order of magnitude more
data than what is typically consumed during training. We typically train RETRO
from scratch, yet can also rapidly RETROfit pre-trained transformers with
retrieval and still achieve good performance. Our work opens up new avenues for
improving language models through explicit memory at unprecedented scale.
Authors' comments: Fix incorrect reported numbers in Table 14
Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass et al.
Multi-modal learning from video data has seen increased attention recently as
it allows to train semantically meaningful embeddings without human annotation
enabling tasks like zero-shot retrieval and classification. In this work, we
present a multi-modal, modality agnostic fusion transformer approach that
learns to exchange information between multiple modalities, such as video,
audio, and text, and integrate them into a joined multi-modal representation to
obtain an embedding that aggregates multi-modal temporal information. We
propose to train the system with a combinatorial loss on everything at once,
single modalities as well as pairs of modalities, explicitly leaving out any
add-ons such as position or modality encoding. At test time, the resulting
model can process and fuse any number of input modalities. Moreover, the
implicit properties of the transformer allow to process inputs of different
lengths. To evaluate the proposed approach, we train the model on the large
scale HowTo100M dataset and evaluate the resulting embedding space on four
challenging benchmark datasets obtaining state-of-the-art results in zero-shot
video retrieval and zero-shot video action localization.
Authors' comments: CVPR2022. The final published version of the proceedings will be
available on IEEE Xplore