Huang Xie, Okko Räsänen, Tuomas Virtanen
This paper investigates negative sampling for contrastive learning in the
context of audio-text retrieval. The strategy for negative sampling refers to
selecting negatives (either audio clips or textual descriptions) from a pool of
candidates for a positive audio-text pair. We explore sampling strategies via
model-estimated within-modality and cross-modality relevance scores for audio
and text samples. With a constant training setting on the retrieval system from
[1], we study eight sampling strategies, including hard and semi-hard negative
sampling. Experimental results show that retrieval performance varies
dramatically among different strategies. Particularly, by selecting semi-hard
negatives with cross-modality scores, the retrieval system gains improved
performance in both text-to-audio and audio-to-text retrieval. Besides, we show
that feature collapse occurs while sampling hard negatives with cross-modality
scores.
Authors' comments: Accepted at ICASSP2023
Yue Guo, Wei Qiu, Gondy Leroy, Sheng Wang, Trevor Cohen
Recent lay language generation systems have used Transformer models trained on a parallel corpus to increase health information accessibility. However, the applicability of these models is constrained by the limited size and topical breadth of available corpora. We introduce CELLS, the largest (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. The abstract and the corresponding lay language summary are written by domain experts, assuring the quality of our dataset. Furthermore, qualitative evaluation of expert-authored plain language summaries has revealed background explanation as a key strategy to increase accessibility. Such explanation is challenging for neural models to generate because it goes beyond simplification by adding content absent from the source. We derive two specialized paired corpora from CELLS to address key challenges in lay language generation: generating background explanations and simplifying the original abstract. We adopt retrieval-augmented models as an intuitive fit for the task of background explanation generation, and show improvements in summary quality and simplicity while maintaining factual correctness. Taken together, this work presents the first comprehensive study of background explanation for lay language generation, paving the path for disseminating scientific knowledge to a broader audience. CELLS is publicly available at: https://github.com/LinguisticAnomalies/pls_retrieval.
Anuj Diwan, Puyuan Peng, Raymond J. Mooney
For the majority of the machine learning community, the expensive nature of
collecting high-quality human-annotated data and the inability to efficiently
finetune very large state-of-the-art pretrained models on limited compute are
major bottlenecks for building models for new tasks. We propose a zero-shot
simple approach for one such task, Video Moment Retrieval (VMR), that does not
perform any additional finetuning and simply repurposes off-the-shelf models
trained on other tasks. Our three-step approach consists of moment proposal,
moment-query matching and postprocessing, all using only off-the-shelf models.
On the QVHighlights benchmark for VMR, we vastly improve performance of
previous zero-shot approaches by at least 2.5x on all metrics and reduce the
gap between zero-shot and state-of-the-art supervised by over 74%. Further, we
also show that our zero-shot approach beats non-pretrained supervised models on
the Recall metrics and comes very close on mAP metrics; and that it also
performs better than the best pretrained supervised model on shorter moments.
Finally, we ablate and analyze our results and propose interesting future
directions.
Authors' comments: Accepted to the NeurIPS 2022 Workshop on Transfer Learning for NLP
(TL4NLP). 12 pages, 5 figures
Atal Tewari, Vikrant Jain, Nitin Khanna
Impact craters are formed due to continuous impacts on the surface of planetary bodies. Most recent deep learning-based crater detection methods treat craters as circular shapes, and less attention is paid to extracting the exact shapes of craters. Extracting precise shapes of the craters can be helpful for many advanced analyses, such as crater formation. This paper proposes a combination of unsupervised non-deep learning and semi-supervised deep learning approach to accurately extract shapes of the craters and detect missing craters from the existing catalog. In unsupervised non-deep learning, we have proposed an adaptive rim extraction algorithm to extract craters' shapes. In this adaptive rim extraction algorithm, we utilized the elevation profiles of DEMs and applied morphological operation on DEM-derived slopes to extract craters' shapes. The extracted shapes of the craters are used in semi-supervised deep learning to get the locations, size, and refined shapes. Further, the extracted shapes of the craters are utilized to improve the estimate of the craters' diameter, depth, and other morphological factors. The craters' shape, estimated diameter, and depth with other morphological factors will be publicly available.
Shujian Zhang, Chengyue Gong, Xingchao Liu
Retriever-reader models achieve competitive performance across many different
NLP tasks such as open question answering and dialogue conversations. In this
work, we notice these models easily overfit the top-rank retrieval passages and
standard training fails to reason over the entire retrieval passages. We
introduce a learnable passage mask mechanism which desensitizes the impact from
the top-rank retrieval passages and prevents the model from overfitting.
Controlling the gradient variance with fewer mask candidates and selecting the
mask candidates with one-shot bi-level optimization, our learnable
regularization strategy enforces the answer generation to focus on the entire
retrieval passages. Experiments on different tasks across open question
answering, dialogue conversation, and fact verification show that our method
consistently outperforms its baselines. Extensive experiments and ablation
studies demonstrate that our method can be general, effective, and beneficial
for many NLP tasks.
Authors' comments: EMNLP 2022
Zhong Zhuang, David Yang, Felix Hofmann, David Barmherzig, Ju Sun
Phase retrieval (PR) concerns the recovery of complex phases from complex magnitudes. We identify the connection between the difficulty level and the number and variety of symmetries in PR problems. We focus on the most difficult far-field PR (FFPR), and propose a novel method using double deep image priors. In realistic evaluation, our method outperforms all competing methods by large margins. As a single-instance method, our method requires no training data and minimal hyperparameter tuning, and hence enjoys good practicality.
Si Sun, Chenyan Xiong, Yue Yu, Arnold Overwijk, Zhiyuan Liu, Jie Bao
In this paper, we investigate the instability in the standard dense retrieval
training, which iterates between model training and hard negative selection
using the being-trained model. We show the catastrophic forgetting phenomena
behind the training instability, where models learn and forget different
negative groups during training iterations. We then propose ANCE-Tele, which
accumulates momentum negatives from past iterations and approximates future
iterations using lookahead negatives, as "teleportations" along the time axis
to smooth the learning process. On web search and OpenQA, ANCE-Tele outperforms
previous state-of-the-art systems of similar size, eliminates the dependency on
sparse retrieval negatives, and is competitive among systems using
significantly more (50x) parameters. Our analysis demonstrates that
teleportation negatives reduce catastrophic forgetting and improve convergence
speed for dense retrieval training. Our code is available at
https://github.com/OpenMatch/ANCE-Tele.
Authors' comments: Accepted to EMNLP 2022 main conference
Yan Wang, Xin Luo, Zhen-Duo Chen, Peng-Fei Zhang, Meng Liu, Xin-Shun Xu
Despite the great success achieved, existing video moment retrieval (VMR) methods are developed under the assumption that data are centralizedly stored. However, in real-world applications, due to the inherent nature of data generation and privacy concerns, data are often distributed on different silos, bringing huge challenges to effective large-scale training. In this work, we try to overcome above limitation by leveraging the recent success of federated learning. As the first that is explored in VMR field, the new task is defined as video moment retrieval with distributed data. Then, a novel federated learning method named FedVMR is proposed to facilitate large-scale and secure training of VMR models in decentralized environment. Experiments on benchmark datasets demonstrate its effectiveness. This work is the very first attempt to enable safe and efficient VMR training in decentralized scene, which is hoped to pave the way for further study in the related research field.
Rihao Chang, Yongtao Ma, Tong Hao, Weizhi Nie
The surge in 3D modeling has led to a pronounced research emphasis on the field of 3D shape retrieval. Numerous contemporary approaches have been put forth to tackle this intricate challenge. Nevertheless, effectively addressing the intricacies of cross-modal 3D shape retrieval remains a formidable undertaking, owing to inherent modality-based disparities. This study presents an innovative notion, termed "geometric words", which functions as elemental constituents for representing entities through combinations. To establish the knowledge graph, we employ geometric words as nodes, connecting them via shape categories and geometry attributes. Subsequently, we devise a unique graph embedding method for knowledge acquisition. Finally, an effective similarity measure is introduced for retrieval purposes. Importantly, each 3D or 2D entity can anchor its geometric terms within the knowledge graph, thereby serving as a link between cross-domain data. As a result, our approach facilitates multiple cross-domain 3D shape retrieval tasks. We evaluate the proposed method's performance on the ModelNet40 and ShapeNetCore55 datasets, encompassing scenarios related to 3D shape retrieval and cross-domain retrieval. Furthermore, we employ the established cross-modal dataset (MI3DOR) to assess cross-modal 3D shape retrieval. The resulting experimental outcomes, in conjunction with comparisons against state-of-the-art techniques, clearly highlight the superiority of our approach.
Lahari Poddar, György Szarvas, Cheng Wang, Jorge Balazs, Pavel Danchenko, Patrick Ernst
Task-oriented dialogue systems in industry settings need to have high
conversational capability, be easily adaptable to changing situations and
conform to business constraints. This paper describes a 3-step procedure to
develop a conversational model that satisfies these criteria and can
efficiently scale to rank a large set of response candidates. First, we provide
a simple algorithm to semi-automatically create a high-coverage template set
from historic conversations without any annotation. Second, we propose a neural
architecture that encodes the dialogue context and applicable business
constraints as profile features for ranking the next turn. Third, we describe a
two-stage learning strategy with self-supervised training, followed by
supervised fine-tuning on limited data collected through a human-in-the-loop
platform. Finally, we describe offline experiments and present results of
deploying our model with human-in-the-loop to converse with live customers
online.
Authors' comments: Accepted at EMNLP 2022
Junren Chen, Michael K. Ng
The main aim of this paper is to study quaternion phase retrieval (QPR),
i.e., the recovery of quaternion signal from the magnitude of quaternion linear
measurements. We show that all $d$-dimensional quaternion signals can be
reconstructed up to a global right quaternion phase factor from $O(d)$
phaseless measurements. We also develop the scalable algorithm quaternion
Wirtinger flow (QWF) for solving QPR, and establish its linear convergence
guarantee. Compared with the analysis of complex Wirtinger flow, a series of
different treatments are employed to overcome the difficulties of the
non-commutativity of quaternion multiplication. Moreover, we develop a variant
of QWF that can effectively utilize a pure quaternion priori (e.g., for color
images) by incorporating a quaternion phase factor estimate into QWF
iterations. The estimate can be computed efficiently as it amounts to finding a
singular vector of a $4\times 4$ real matrix. Motivated by the variants of
Wirtinger flow in prior work, we further propose quaternion truncated Wirtinger
flow (QTWF), quaternion truncated amplitude flow (QTAF) and their pure
quaternion versions. Experimental results on synthetic data and color images
are presented to validate our theoretical results. In particular, for pure
quaternion signal recovery, our quaternion method often succeeds with
measurements notably fewer than real methods based on monochromatic model or
concatenation model.
Authors' comments: IEEE Transactions on Signal Processing, camera ready
Timo Breuer, Narges Tavakolpoursaleh, Johann Schaible, Daniel Hienert, Philipp Schaer, Leyla Jael Castro
Involving users in early phases of software development has become a common
strategy as it enables developers to consider user needs from the beginning.
Once a system is in production, new opportunities to observe, evaluate and
learn from users emerge as more information becomes available. Gathering
information from users to continuously evaluate their behavior is a common
practice for commercial software, while the Cranfield paradigm remains the
preferred option for Information Retrieval (IR) and recommendation systems in
the academic world. Here we introduce the Infrastructures for Living Labs
STELLA project which aims to create an evaluation infrastructure allowing
experimental systems to run along production web-based academic search systems
with real users. STELLA combines user interactions and log files analyses to
enable large-scale A/B experiments for academic search.
Authors' comments: arXiv admin note: text overlap with arXiv:2203.05430
Zhaopeng Dou, Zhongdao Wang, Weihua Chen, Yali Li, Shengjin Wang
Current person image retrieval methods have achieved great improvements in accuracy metrics. However, they rarely describe the reliability of the prediction. In this paper, we propose an Uncertainty-Aware Learning (UAL) method to remedy this issue. UAL aims at providing reliability-aware predictions by considering data uncertainty and model uncertainty simultaneously. Data uncertainty captures the ``noise" inherent in the sample, while model uncertainty depicts the model's confidence in the sample's prediction. Specifically, in UAL, (1) we propose a sampling-free data uncertainty learning method to adaptively assign weights to different samples during training, down-weighting the low-quality ambiguous samples. (2) we leverage the Bayesian framework to model the model uncertainty by assuming the parameters of the network follow a Bernoulli distribution. (3) the data uncertainty and the model uncertainty are jointly learned in a unified network, and they serve as two fundamental criteria for the reliability assessment: if a probe is high-quality (low data uncertainty) and the model is confident in the prediction of the probe (low model uncertainty), the final ranking will be assessed as reliable. Experiments under the risk-controlled settings and the multi-query settings show the proposed reliability assessment is effective. Our method also shows superior performance on three challenging benchmarks under the vanilla single query settings.
Gyuwan Kim, Jinhyuk Lee, Barlas Oguz, Wenhan Xiong, Yizhe Zhang, Yashar Mehdad, William Yang Wang
Building dense retrievers requires a series of standard procedures, including
training and validating neural models and creating indexes for efficient
search. However, these procedures are often misaligned in that training
objectives do not exactly reflect the retrieval scenario at inference time. In
this paper, we explore how the gap between training and inference in dense
retrieval can be reduced, focusing on dense phrase retrieval (Lee et al., 2021)
where billions of representations are indexed at inference. Since validating
every dense retriever with a large-scale index is practically infeasible, we
propose an efficient way of validating dense retrievers using a small subset of
the entire corpus. This allows us to validate various training strategies
including unifying contrastive loss terms and using hard negatives for phrase
retrieval, which largely reduces the training-inference discrepancy. As a
result, we improve top-1 phrase retrieval accuracy by 2~3 points and top-20
passage retrieval accuracy by 2~4 points for open-domain question answering.
Our work urges modeling dense retrievers with careful consideration of training
and inference via efficient validation while advancing phrase retrieval as a
general solution for dense retrieval.
Authors' comments: Findings of EMNLP 2022; 12 pages, 3 figures
Emily Calamari, Jacqueline K. Faherty, Ben Burningham, Eileen Gonzales, Daniella Bardalez-Gagliuffi, Johanna M. Vos, Marina Gemma, Niall Whiteford et al.
We present results from an atmospheric retrieval analysis of Gl 229B using
the BREWSTER retrieval code. We find the best fit model to be cloud-free,
consistent with the T dwarf retrieval work of Line et al. 2017, Zalesky et al.
2022 and Gonzales et al. 2020. Fundamental parameters (mass, radius,
log(L_{Bol}/L_{Sun}), log(g)) determined from our model agree within 1\sigma to
SED-derived values except for T_{eff} where our retrieved T_{eff} is
approximately 100 K cooler than the evolutionary model-based SED value. We find
a retrieved mass of 50^{+12}_{-9} M_{Jup}, however, we also find that the
observables of Gl 229B can be explained by a cloud-free model with a prior on
mass at the dynamical value, 70 M_{Jup}. We are able to constrain abundances
for H_2O, CO, CH_4, NH_3, Na and K and find a supersolar C/O ratio as compared
to its primary, Gl 229A. We report an overall subsolar metallicity due to
atmospheric oxygen depletion but find a solar [C/H], which matches that of the
primary. We find that this work contributes to a growing trend in
retrieval-based studies, particularly for brown dwarfs, toward supersolar C/O
ratios and discuss the implications of this result on formation mechanisms,
internal physical processes as well as model biases.
Authors' comments: 26 pages, 8 tables, 8 figures. Accepted for publication in ApJ
Minjoon Jung, Seongho Choi, Joochan Kim, Jin-Hwa Kim, Byoung-Tak Zhang
Video corpus moment retrieval (VCMR) is the task to retrieve the most
relevant video moment from a large video corpus using a natural language query.
For narrative videos, e.g., dramas or movies, the holistic understanding of
temporal dynamics and multimodal reasoning is crucial. Previous works have
shown promising results; however, they relied on the expensive query
annotations for VCMR, i.e., the corresponding moment intervals. To overcome
this problem, we propose a self-supervised learning framework: Modal-specific
Pseudo Query Generation Network (MPGN). First, MPGN selects candidate temporal
moments via subtitle-based moment sampling. Then, it generates pseudo queries
exploiting both visual and textual information from the selected temporal
moments. Through the multimodal information in the pseudo queries, we show that
MPGN successfully learns to localize the video corpus moment without any
explicit annotation. We validate the effectiveness of MPGN on the TVR dataset,
showing competitive results compared with both supervised models and
unsupervised setting models.
Authors' comments: Accepted by EMNLP 2022 main conference
Liyuan Ma, Hongxia Wang, Ningyi Leng, Ziyang Yuan
Fourier phase retrieval (FPR) is an inverse problem that recovers the signal from its Fourier magnitude measurement, it's ill-posed especially when the sampling rates are low. In this paper, an untrained generative prior is introduced to attack the ill-posedness. Based on the alternating direction method of multipliers (ADMM), an algorithm utilizing the untrained generative network called Net-ADM is proposed to solve the FPR problem. Firstly, the objective function is smoothed and the dimension of the variable is raised to facilitate calculation. Then an untrained generative network is embedded in the iterative process of ADMM to project an estimated signal into the generative space, and the projected signal is applied to next iteration of ADMM. We theoretically analyzed the two projections included in the algorithm, one makes the objective function descent, and the other gets the estimation closer to the optimal solution. Numerical experiments show that the reconstruction performance and robustness of the proposed algorithm are superior to prior works, especially when the sampling rates are low.
Wenhao Yu, Chenguang Zhu, Zhihan Zhang, Shuohang Wang, Zhuosheng Zhang, Yuwei Fang, Meng Jiang
A common thread of retrieval-augmented methods in the existing literature
focuses on retrieving encyclopedic knowledge, such as Wikipedia, which
facilitates well-defined entity and relation spaces that can be modeled.
However, applying such methods to commonsense reasoning tasks faces two unique
challenges, i.e., the lack of a general large-scale corpus for retrieval and a
corresponding effective commonsense retriever. In this paper, we systematically
investigate how to leverage commonsense knowledge retrieval to improve
commonsense reasoning tasks. We proposed a unified framework of
retrieval-augmented commonsense reasoning (called RACo), including a newly
constructed commonsense corpus with over 20 million documents and novel
strategies for training a commonsense retriever. We conducted experiments on
four different commonsense reasoning tasks. Extensive evaluation results showed
that our proposed RACo can significantly outperform other knowledge-enhanced
method counterparts, achieving new SoTA performance on the CommonGen and CREAK
leaderboards.
Authors' comments: EMNLP 2022 (main)
Leonard Adolphs, Michelle Chen Huebscher, Christian Buck, Sertan Girgin, Olivier Bachem, Massimiliano Ciaramita, Thomas Hofmann
Neural retrieval models have superseded classic bag-of-words methods such as BM25 as the retrieval framework of choice. However, neural systems lack the interpretability of bag-of-words models; it is not trivial to connect a query change to a change in the latent space that ultimately determines the retrieval results. To shed light on this embedding space, we learn a "query decoder" that, given a latent representation of a neural search engine, generates the corresponding query. We show that it is possible to decode a meaningful query from its latent representation and, when moving in the right direction in latent space, to decode a query that retrieves the relevant paragraph. In particular, the query decoder can be useful to understand "what should have been asked" to retrieve a particular paragraph from the collection. We employ the query decoder to generate a large synthetic dataset of query reformulations for MSMarco, leading to improved retrieval performance. On this data, we train a pseudo-relevance feedback (PRF) T5 model for the application of query suggestion that outperforms both query reformulation and PRF information retrieval baselines.
Hong Xuan, Xi Chen
Visual-Semantic Embedding (VSE) is a prevalent approach in image-text
retrieval by learning a joint embedding space between the image and language
modalities where semantic similarities would be preserved. The triplet loss
with hard-negative mining has become the de-facto objective for most VSE
methods. Inspired by recent progress in deep metric learning (DML) in the image
domain which gives rise to new loss functions that outperform triplet loss, in
this paper, we revisit the problem of finding better objectives for VSE in
image-text matching. Despite some attempts in designing losses based on
gradient movement, most DML losses are defined empirically in the embedding
space. Instead of directly applying these loss functions which may lead to
sub-optimal gradient updates in model parameters, in this paper we present a
novel Gradient-based Objective AnaLysis framework, or \textit{GOAL}, to
systematically analyze the combinations and reweighting of the gradients in
existing DML functions. With the help of this analysis framework, we further
propose a new family of objectives in the gradient space exploring different
gradient combinations. In the event that the gradients are not integrable to a
valid loss function, we implement our proposed objectives such that they would
directly operate in the gradient space instead of on the losses in the
embedding space. Comprehensive experiments have demonstrated that our novel
objectives have consistently improved performance over baselines across
different visual/text features and model frameworks. We also showed the
generalizability of the GOAL framework by extending it to other models using
triplet family losses including vision-language model with heavy cross-modal
interactions and have achieved state-of-the-art results on the image-text
retrieval tasks on COCO and Flick30K.
Authors' comments: arXiv admin note: text overlap with arXiv:2201.11307