Tengtao Song, Nuo Chen, Ji Jiang, Zhihong Zhu, Yuexian Zou
Multi-turn response selection is a challenging task due to its high demands on efficient extraction of the matching features from abundant information provided by context utterances. Since incorporating syntactic information like dependency structures into neural models can promote a better understanding of the sentences, such a method has been widely used in NLP tasks. Though syntactic information helps models achieved pleasing results, its application in retrieval-based dialogue systems has not been fully explored. Meanwhile, previous works focus on intra-sentence syntax alone, which is far from satisfactory for the task of multi-turn response where dialogues usually contain multiple sentences. To this end, we propose SIA, Syntax-Informed Attention, considering both intra- and inter-sentence syntax information. While the former restricts attention scope to only between tokens and corresponding dependents in the syntax tree, the latter allows attention in cross-utterance pairs for those syntactically important tokens. We evaluate our method on three widely used benchmarks and experimental results demonstrate the general superiority of our method on dialogue response selection.
Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Zehuan Yuan, Huchuan Lu
All instance perception tasks aim at finding certain objects specified by
some queries such as category names, language expressions, and target
annotations, but this complete field has been split into multiple independent
subtasks. In this work, we present a universal instance perception model of the
next generation, termed UNINEXT. UNINEXT reformulates diverse instance
perception tasks into a unified object discovery and retrieval paradigm and can
flexibly perceive different types of objects by simply changing the input
prompts. This unified formulation brings the following benefits: (1) enormous
data from different tasks and label vocabularies can be exploited for jointly
training general instance-level representations, which is especially beneficial
for tasks lacking in training data. (2) the unified model is
parameter-efficient and can save redundant computation when handling multiple
tasks simultaneously. UNINEXT shows superior performance on 20 challenging
benchmarks from 10 instance-level tasks including classical image-level tasks
(object detection and instance segmentation), vision-and-language tasks
(referring expression comprehension and segmentation), and six video-level
object tracking tasks. Code is available at
https://github.com/MasterBin-IIAU/UNINEXT.
Authors' comments: CVPR2023
Sunwoo Kim, Kyuhong Shim, Luong Trung Nguyen, Byonghyo Shim
Image text retrieval is a task to search for the proper textual descriptions
of the visual world and vice versa. One challenge of this task is the
vulnerability to input image and text corruptions. Such corruptions are often
unobserved during the training, and degrade the retrieval model decision
quality substantially. In this paper, we propose a novel image text retrieval
technique, referred to as robust visual semantic embedding (RVSE), which
consists of novel image-based and text-based augmentation techniques called
semantic preserving augmentation for image (SPAugI) and text (SPAugT). Since
SPAugI and SPAugT change the original data in a way that its semantic
information is preserved, we enforce the feature extractors to generate
semantic aware embedding vectors regardless of the corruption, improving the
model robustness significantly. From extensive experiments using benchmark
datasets, we show that RVSE outperforms conventional retrieval schemes in terms
of image-text retrieval performance.
Authors' comments: Accepted to ICASSP 2023
Xianghao Xu, Paul Guerrero, Matthew Fisher, Siddhartha Chaudhuri, Daniel Ritchie
Representing a 3D shape with a set of primitives can aid perception of
structure, improve robotic object manipulation, and enable editing,
stylization, and compression of 3D shapes. Existing methods either use simple
parametric primitives or learn a generative shape space of parts. Both have
limitations: parametric primitives lead to coarse approximations, while learned
parts offer too little control over the decomposition. We instead propose to
decompose shapes using a library of 3D parts provided by the user, giving full
control over the choice of parts. The library can contain parts with
high-quality geometry that are suitable for a given category, resulting in
meaningful decompositions with clean geometry. The type of decomposition can
also be controlled through the choice of parts in the library. Our method works
via a self-supervised approach that iteratively retrieves parts from the
library and refines their placements. We show that this approach gives higher
reconstruction accuracy and more desirable decompositions than existing
approaches. Additionally, we show how the decomposition can be controlled
through the part library by using different part libraries to reconstruct the
same shapes.
Authors' comments: CVPR 2023
N. Weiße, J. Esslinger, S. Howard, F. M. Foerster, F. Haberstroh, L. Doyle, P. Norreys, J. Schreiber et al.
Knowledge of spatio-temporal couplings such as pulse-front tilt or curvature is important to determine the focused intensity of high-power lasers. Common techniques to diagnose these couplings are either qualitative or require hundreds of measurements. Here we present both a new algorithm for retrieving spatio-temporal couplings, as well as novel experimental implementations. Our method is based on the expression of the spatio-spectral phase in terms of a Zernike-Taylor basis, allowing us to directly quantify the coefficients for common spatio-temporal couplings. We take advantage of this method to perform quantitative measurements using a simple experimental setup, consisting of different bandpass filters in front of a Shack-Hartmann wavefront sensor. This fast acquisition of laser couplings using narrowband filters, abbreviated FALCON, is easy and cheap to implement in existing facilities. To this end, we present a measurement of spatio-temporal couplings at the ATLAS-3000 petawatt laser using our technique.
Cyril Zakka, Akash Chaurasia, Rohan Shad, Alex R. Dalal, Jennifer L. Kim, Michael Moor, Kevin Alexander, Euan Ashley et al.
Large-language models have recently demonstrated impressive zero-shot capabilities in a variety of natural language tasks such as summarization, dialogue generation, and question-answering. Despite many promising applications in clinical medicine, adoption of these models in real-world settings has been largely limited by their tendency to generate incorrect and sometimes even toxic statements. In this study, we develop Almanac, a large language model framework augmented with retrieval capabilities for medical guideline and treatment recommendations. Performance on a novel dataset of clinical scenarios (n = 130) evaluated by a panel of 5 board-certified and resident physicians demonstrates significant increases in factuality (mean of 18% at p-value < 0.05) across all specialties, with improvements in completeness and safety. Our results demonstrate the potential for large language models to be effective tools in the clinical decision-making process, while also emphasizing the importance of careful testing and deployment to mitigate their shortcomings.
Zhiqi Huang, Puxuan Yu, James Allan
In this paper, we introduce the approach behind our submission for the MIRACL challenge, a WSDM 2023 Cup competition that centers on ad-hoc retrieval across 18 diverse languages. Our solution contains two neural-based models. The first model is a bi-encoder re-ranker, on which we apply a cross-lingual distillation technique to transfer ranking knowledge from English to the target language space. The second model is a cross-encoder re-ranker trained on multilingual retrieval data generated using neural machine translation. We further fine-tune both models using MIRACL training data and ensemble multiple rank lists to obtain the final result. According to the MIRACL leaderboard, our approach ranks 8th for the Test-A set and 2nd for the Test-B set among the 16 known languages.
Cunjuan Zhu, Qi Jia, Wei Chen, Yanming Guo, Yu Liu
Video-Text Retrieval (VTR) aims to search for the most relevant video related
to the semantics in a given sentence, and vice versa. In general, this
retrieval task is composed of four successive steps: video and textual feature
representation extraction, feature embedding and matching, and objective
functions. In the last, a list of samples retrieved from the dataset is ranked
based on their matching similarities to the query. In recent years, significant
and flourishing progress has been achieved by deep learning techniques,
however, VTR is still a challenging task due to the problems like how to learn
an efficient spatial-temporal video feature and how to narrow the cross-modal
gap. In this survey, we review and summarize over 100 research papers related
to VTR, demonstrate state-of-the-art performance on several commonly
benchmarked datasets, and discuss potential challenges and directions, with the
expectation to provide some insights for researchers in the field of video-text
retrieval.
Authors' comments: International Journal of Multimedia Information Retrieval (IJMIR)
Tobias Norlund, Ehsan Doostmohammadi, Richard Johansson, Marco Kuhlmann
Recent work on the Retrieval-Enhanced Transformer (RETRO) model has shown that off-loading memory from trainable weights to a retrieval database can significantly improve language modeling and match the performance of non-retrieval models that are an order of magnitude larger in size. It has been suggested that at least some of this performance gain is due to non-trivial generalization based on both model weights and retrieval. In this paper, we try to better understand the relative contributions of these two components. We find that the performance gains from retrieval largely originate from overlapping tokens between the database and the test data, suggesting less non-trivial generalization than previously assumed. More generally, our results point to the challenges of evaluating the generalization of retrieval-augmented language models such as RETRO, as even limited token overlap may significantly decrease test-time loss. We release our code and model at https://github.com/TobiasNorlund/retro
Benno Weck, Xavier Serra
The recent progress in text-based audio retrieval was largely propelled by
the release of suitable datasets. Since the manual creation of such datasets is
a laborious task, obtaining data from online resources can be a cheap solution
to create large-scale datasets. We study the recently proposed SoundDesc
benchmark dataset, which was automatically sourced from the BBC Sound Effects
web page. In our analysis, we find that SoundDesc contains several duplicates
that cause leakage of training data to the evaluation data. This data leakage
ultimately leads to overly optimistic retrieval performance estimates in
previous benchmarks. We propose new training, validation, and testing splits
for the dataset that we make available online. To avoid weak contamination of
the test data, we pool audio files that share similar recording setups. In our
experiments, we find that the new splits serve as a more challenging benchmark.
Authors' comments: 5 pages. Accepted at ICASSP2023
Haoxiang Zhang, He Jiang, Ziqiang Wang, Deqiang Cheng
Zero-Shot Sketch-Based Image Retrieval (ZSSBIR) is an emerging task. The
pioneering work focused on the modal gap but ignored inter-class information.
Although recent work has begun to consider the triplet-based or contrast-based
loss to mine inter-class information, positive and negative samples need to be
carefully selected, or the model is prone to lose modality-specific
information. To respond to these issues, an Ontology-Aware Network (OAN) is
proposed. Specifically, the smooth inter-class independence learning mechanism
is put forward to maintain inter-class peculiarity. Meanwhile,
distillation-based consistency preservation is utilized to keep
modality-specific information. Extensive experiments have demonstrated the
superior performance of our algorithm on two challenging Sketchy and Tu-Berlin
datasets.
Authors' comments: 4 pages, 3 figures
Zhixin Ma, Chong-Wah Ngo
Known-item video search is effective with human-in-the-loop to interactively
investigate the search result and refine the initial query. Nevertheless, when
the first few pages of results are swamped with visually similar items, or the
search target is hidden deep in the ranked list, finding the know-item target
usually requires a long duration of browsing and result inspection. This paper
tackles the problem by reinforcement learning, aiming to reach a search target
within a few rounds of interaction by long-term learning from user feedbacks.
Specifically, the system interactively plans for navigation path based on
feedback and recommends a potential target that maximizes the long-term reward
for user comment. We conduct experiments for the challenging task of video
corpus moment retrieval (VCMR) to localize moments from a large video corpus.
The experimental results on TVR and DiDeMo datasets verify that our proposed
work is effective in retrieving the moments that are hidden deep inside the
ranked lists of CONQUER and HERO, which are the state-of-the-art auto-search
engines for VCMR.
Authors' comments: Accepted by ACM Multimedia 2022
Xuxin Cheng, Zhihong Zhu, Hongxiang Li, Yaowei Li, Yuexian Zou
With the rise of short videos, the demand for selecting appropriate
background music (BGM) for a video has increased significantly, video-music
retrieval (VMR) task gradually draws much attention by research community. As
other cross-modal learning tasks, existing VMR approaches usually attempt to
measure the similarity between the video and music in the feature space.
However, they (1) neglect the inevitable label noise; (2) neglect to enhance
the ability to capture critical video clips. In this paper, we propose a novel
saliency-based self-training framework, which is termed SSVMR. Specifically, we
first explore to fully make use of the information containing in the training
dataset by applying a semi-supervised method to suppress the adverse impact of
label noise problem, where a self-training approach is adopted. In addition, we
propose to capture the saliency of the video by mixing two videos at span level
and preserving the locality of the two original videos. Inspired by back
translation in NLP, we also conduct back retrieval to obtain more training
data. Experimental results on MVD dataset show that our SSVMR achieves the
state-of-the-art performance by a large margin, obtaining a relative
improvement of 34.8% over the previous best model in terms of R@1.
Authors' comments: Accepted by ICASSP 2023
Yimu Wang, Peng Shi
While recent progress in video-text retrieval has been advanced by the
exploration of better representation learning, in this paper, we present a
novel multi-grained sparse learning framework, S3MA, to learn an aligned sparse
space shared between the video and the text for video-text retrieval. The
shared sparse space is initialized with a finite number of sparse concepts,
each of which refers to a number of words. With the text data at hand, we learn
and update the shared sparse space in a supervised manner using the proposed
similarity and alignment losses. Moreover, to enable multi-grained alignment,
we incorporate frame representations for better modeling the video modality and
calculating fine-grained and coarse-grained similarities. Benefiting from the
learned shared sparse space and multi-grained similarities, extensive
experiments on several video-text retrieval benchmarks demonstrate the
superiority of S3MA over existing methods. Our code is available at
https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval.
Authors' comments: Findings of EMNLP 2023
Rima Alaifari, Francesca Bartolucci, Matthias Wellershoff
We study the problem of recovering a signal from magnitudes of its wavelet
frame coefficients when the analyzing wavelet is real-valued. We show that
every real-valued signal can be uniquely recovered, up to global sign, from its
multi-wavelet frame coefficients \[ \{\lvert \mathcal{W}_{\phi_i}
f(\alpha^{m}\beta n,\alpha^{m}) \rvert: i\in\{1,2,3\}, m,n\in\mathbb{Z}\} \]
for every $\alpha>1,\beta>0$ with $\beta\ln(\alpha)\leq 4\pi/(1+4p)$, $p>0$,
when the three wavelets $\phi_i$ are suitable linear combinations of the
Poisson wavelet $P_p$ of order $p$ and its Hilbert transform $\mathscr{H}P_p$.
For complex-valued signals we find that this is not possible for any choice of
the parameters $\alpha>1,\beta>0$, and for any window. In contrast to the
existing literature on wavelet sign retrieval, our uniqueness results do not
require any bandlimiting constraints or other a priori knowledge on the
real-valued signals to guarantee their unique recovery from the absolute values
of their wavelet coefficients.
Authors' comments: 14 pages, 2 figures
Niall Whiteford, Alistair Glasse, Katy L. Chubb, Daniel Kitzmann, Shrishmoy Ray, Mark W. Phillips, Beth A. Biller, Paul I. Palmer et al.
Retrieval methods are a powerful analysis technique for modelling exoplanetary atmospheres by estimating the bulk physical and chemical properties that combine in a forward model to best-fit an observed spectrum, and they are increasingly being applied to observations of directly-imaged exoplanets. We have adapted TauREx3, the Bayesian retrieval suite, for the analysis of near-infrared spectrophotometry from directly-imaged gas giant exoplanets and brown dwarfs. We demonstrate TauREx3's applicability to sub-stellar atmospheres by presenting results for brown dwarf benchmark GJ 570D which are consistent with previous retrieval studies, whilst also exhibiting systematic biases associated with the presence of alkali lines. We also present results for the cool exoplanet 51 Eri b, the first application of a free chemistry retrieval analysis to this object, using spectroscopic observations from GPI and SPHERE. While our retrieval analysis is able to explain spectroscopic and photometric observations without employing cloud extinction, we conclude this may be a result of employing a flexible temperature-pressure profile which is able to mimic the presence of clouds. We present Bayesian evidence for an ammonia detection with a 2.7$\sigma$ confidence, the first indication of ammonia in an exoplanetary atmosphere. This is consistent with this molecule being present in brown dwarfs of a similar spectral type. We demonstrate the chemical similarities between 51 Eri b and GJ 570D in relation to their retrieved molecular abundances. Finally, we show that overall retrieval conclusions for 51 Eri b can vary when employing different spectral data and modelling components, such as temperature-pressure and cloud structures.
NallappaBhavithran G, Selvakumar R
DNA is a promising storage medium, but its stability and occurrence of Indel
errors pose a significant challenge. The relative occurrence of Guanine(G) and
Cytosine(C) in DNA is crucial for its longevity, and reverse complementary base
pairs should be avoided to prevent the formation of a secondary structure in
DNA strands. We overcome these challenges by selecting appropriate group
homomorphisms. For storing and retrieving information in DNA strings we use
kernel code and the Varshamov-Tenengolts algorithm. The Varshamov-Tenengolts
algorithm corrects single indel errors. Additionally, we construct codes of any
desired length (n) while calculating its reverse complement distance based on
the value of n.
Authors' comments: 7 pages, 3 figures
Minsik Oh, Joosung Lee, Jiwei Li, Guoyin Wang
Identifying relevant persona or knowledge for conversational systems is
critical to grounded dialogue response generation. However, each grounding has
been mostly researched in isolation with more practical multi-context dialogue
tasks introduced in recent works. We define Persona and Knowledge Dual Context
Identification as the task to identify persona and knowledge jointly for a
given dialogue, which could be of elevated importance in complex multi-context
dialogue settings. We develop a novel grounding retrieval method that utilizes
all contexts of dialogue simultaneously. Our method requires less computational
power via utilizing neural QA retrieval models. We further introduce our novel
null-positive rank test which measures ranking performance on semantically
dissimilar samples (i.e. hard negatives) in relation to data augmentation.
Authors' comments: Accepted to EMNLP 2023 main conference (Oral). Code available at
https://github.com/minsik-ai/PK-ICR
Xu Wang, Dezhong Peng, Ming Yan, Peng Hu
Cross-domain image retrieval aims at retrieving images across different
domains to excavate cross-domain classificatory or correspondence
relationships. This paper studies a less-touched problem of cross-domain image
retrieval, i.e., unsupervised cross-domain image retrieval, considering the
following practical assumptions: (i) no correspondence relationship, and (ii)
no category annotations. It is challenging to align and bridge distinct domains
without cross-domain correspondence. To tackle the challenge, we present a
novel Correspondence-free Domain Alignment (CoDA) method to effectively
eliminate the cross-domain gap through In-domain Self-matching Supervision
(ISS) and Cross-domain Classifier Alignment (CCA). To be specific, ISS is
presented to encapsulate discriminative information into the latent common
space by elaborating a novel self-matching supervision mechanism. To alleviate
the cross-domain discrepancy, CCA is proposed to align distinct domain-specific
classifiers. Thanks to the ISS and CCA, our method could encode the
discrimination into the domain-invariant embedding space for unsupervised
cross-domain image retrieval. To verify the effectiveness of the proposed
method, extensive experiments are conducted on four benchmark datasets compared
with six state-of-the-art methods.
Authors' comments: AAAI 2023
Dawei Dai, Yutang Li, Liang Wang, Shiyu Fu, Shuyin Xia, Guoyin Wang
In some specific scenarios, face sketch was used to identify a person.
However, drawing a complete face sketch often needs skills and takes time,
which hinder its widespread applicability in the practice. In this study, we
proposed a new task named sketch less face image retrieval (SLFIR), in which
the retrieval was carried out at each stroke and aim to retrieve the target
face photo using a partial sketch with as few strokes as possible (see Fig.1).
Firstly, we developed a method to generate the data of sketch with drawing
process, and opened such dataset; Secondly, we proposed a two-stage method as
the baseline for SLFIR that (1) A triplet network, was first adopt to learn the
joint embedding space shared between the complete sketch and its target face
photo; (2) Regarding the sketch drawing episode as a sequence, we designed a
LSTM module to optimize the representation of the incomplete face sketch.
Experiments indicate that the new framework can finish the retrieval using a
partial or pool drawing sketch.
Authors' comments: 5 pages, 6 figs