Abhay Zala, Jaemin Cho, Satwik Kottur, Xilun Chen, Barlas Oğuz, Yasher Mehdad, Mohit Bansal
There is growing interest in searching for information from large video
corpora. Prior works have studied relevant tasks, such as text-based video
retrieval, moment retrieval, video summarization, and video captioning in
isolation, without an end-to-end setup that can jointly search from video
corpora and generate summaries. Such an end-to-end setup would allow for many
interesting applications, e.g., a text-based search that finds a relevant video
from a video corpus, extracts the most relevant moment from that video, and
segments the moment into important steps with captions. To address this, we
present the HiREST (HIerarchical REtrieval and STep-captioning) dataset and
propose a new benchmark that covers hierarchical information retrieval and
visual/textual stepwise summarization from an instructional video corpus.
HiREST consists of 3.4K text-video pairs from an instructional video dataset,
where 1.1K videos have annotations of moment spans relevant to text query and
breakdown of each moment into key instruction steps with caption and timestamps
(totaling 8.6K step captions). Our hierarchical benchmark consists of video
retrieval, moment retrieval, and two novel moment segmentation and step
captioning tasks. In moment segmentation, models break down a video moment into
instruction steps and identify start-end boundaries. In step captioning, models
generate a textual summary for each step. We also present starting point
task-specific and end-to-end joint baseline models for our new benchmark. While
the baseline models show some promising results, there still exists large room
for future improvement by the community. Project website:
https://hirest-cvpr2023.github.io
Authors' comments: CVPR 2023 (15 pages; the first two authors contributed equally;
Project website: https://hirest-cvpr2023.github.io)
Luca Zancato, Alessandro Achille, Tian Yu Liu, Matthew Trager, Pramuditha Perera, Stefano Soatto
We introduce Train/Test-Time Adaptation with Retrieval (${\rm T^3AR}$), a method to adapt models both at train and test time by means of a retrieval module and a searchable pool of external samples. Before inference, ${\rm T^3AR}$ adapts a given model to the downstream task using refined pseudo-labels and a self-supervised contrastive objective function whose noise distribution leverages retrieved real samples to improve feature adaptation on the target data manifold. The retrieval of real images is key to ${\rm T^3AR}$ since it does not rely solely on synthetic data augmentations to compensate for the lack of adaptation data, as typically done by other adaptation algorithms. Furthermore, thanks to the retrieval module, our method gives the user or service provider the possibility to improve model adaptation on the downstream task by incorporating further relevant data or to fully remove samples that may no longer be available due to changes in user preference after deployment. First, we show that ${\rm T^3AR}$ can be used at training time to improve downstream fine-grained classification over standard fine-tuning baselines, and the fewer the adaptation data the higher the relative improvement (up to 13%). Second, we apply ${\rm T^3AR}$ for test-time adaptation and show that exploiting a pool of external images at test-time leads to more robust representations over existing methods on DomainNet-126 and VISDA-C, especially when few adaptation data are available (up to 8%).
Alexander Black, Simon Jenni, Tu Bui, Md. Mehrab Tanjim, Stefano Petrangeli, Ritwik Sinha, Viswanathan Swaminathan, John Collomosse
We propose VADER, a spatio-temporal matching, alignment, and change summarization method to help fight misinformation spread via manipulated videos. VADER matches and coarsely aligns partial video fragments to candidate videos using a robust visual descriptor and scalable search over adaptively chunked video content. A transformer-based alignment module then refines the temporal localization of the query fragment within the matched video. A space-time comparator module identifies regions of manipulation between aligned content, invariant to any changes due to any residual temporal misalignments or artifacts arising from non-editorial changes of the content. Robustly matching video to a trusted source enables conclusions to be drawn on video provenance, enabling informed trust decisions on content encountered.
Thong Nguyen, Sean MacAvaney, Andrew Yates
Learned sparse retrieval (LSR) is a family of first-stage retrieval methods that are trained to generate sparse lexical representations of queries and documents for use with an inverted index. Many LSR methods have been recently introduced, with Splade models achieving state-of-the-art performance on MSMarco. Despite similarities in their model architectures, many LSR methods show substantial differences in effectiveness and efficiency. Differences in the experimental setups and configurations used make it difficult to compare the methods and derive insights. In this work, we analyze existing LSR methods and identify key components to establish an LSR framework that unifies all LSR methods under the same perspective. We then reproduce all prominent methods using a common codebase and re-train them in the same environment, which allows us to quantify how components of the framework affect effectiveness and efficiency. We find that (1) including document term weighting is most important for a method's effectiveness, (2) including query weighting has a small positive impact, and (3) document expansion and query expansion have a cancellation effect. As a result, we show how removing query expansion from a state-of-the-art model can reduce latency significantly while maintaining effectiveness on MSMarco and TripClick benchmarks. Our code is publicly available at https://github.com/thongnt99/learned-sparse-retrieval
Xinnian Liang, Shuangzhi Wu, Hui Huang, Jiaqi Bai, Chao Bian, Zhoujun Li
Retrieval augmented methods have shown promising results in various
classification tasks. However, existing methods focus on retrieving extra
context to enrich the input, which is noise sensitive and non-expandable. In
this paper, following this line, we propose a $k$-nearest-neighbor (KNN) -based
method for retrieval augmented classifications, which interpolates the
predicted label distribution with retrieved instances' label distributions.
Different from the standard KNN process, we propose a decoupling mechanism as
we find that shared representation for classification and retrieval hurts
performance and leads to training instability. We evaluate our method on a wide
range of classification datasets. Experimental results demonstrate the
effectiveness and robustness of our proposed method. We also conduct extra
experiments to analyze the contributions of different components in our
model.\footnote{\url{https://github.com/xnliang98/knn-cls-w-decoupling}}
Authors' comments: preprint
Ryan J. MacDonald, Natasha E. Batalha
Exoplanet atmospheric retrieval is a computational technique widely used to
infer properties of planetary atmospheres from remote spectroscopic
observations. Retrieval codes typically employ Bayesian sampling algorithms or
machine learning approaches to explore the range of atmospheric properties
(e.g., chemical composition, temperature structure, aerosols) compatible with
an observed spectrum. However, despite the wide adoption of exoplanet retrieval
techniques, there is currently no systematic summary of exoplanet retrieval
codes in the literature. Here, we provide a catalogue of the atmospheric
retrieval codes published to date, alongside links to their respective code
repositories where available. Our catalogue will be continuously updated via a
Zenodo archive.
Authors' comments: 5 pages, 1 giant Table. Published in RNAAS. Live catalogue will be
updated at https://doi.org/10.5281/zenodo.7675743
Yidan Zhang, Ting Zhang, Dong Chen, Yujing Wang, Qi Chen, Xing Xie, Hao Sun, Weiwei Deng et al.
While generative modeling has become prevalent across numerous research
fields, its integration into the realm of image retrieval remains largely
unexplored and underjustified. In this paper, we present a novel methodology,
reframing image retrieval as a variant of generative modeling and employing a
sequence-to-sequence model. This approach is harmoniously aligned with the
current trend towards unification in research, presenting a cohesive framework
that allows for end-to-end differentiable searching. This, in turn, facilitates
superior performance via direct optimization techniques. The development of our
model, dubbed IRGen, addresses the critical technical challenge of converting
an image into a concise sequence of semantic units, which is pivotal for
enabling efficient and effective search. Extensive experiments demonstrate that
our model achieves state-of-the-art performance on three widely-used image
retrieval benchmarks as well as two million-scale datasets, yielding
significant improvement compared to prior competitive retrieval methods. In
addition, the notable surge in precision scores facilitated by generative
modeling presents the potential to bypass the reranking phase, which is
traditionally indispensable in practical retrieval workflows.
Authors' comments: Accepted by ECCV 2024
Abhra Chaudhuri, Ayan Kumar Bhunia, Yi-Zhe Song, Anjan Dutta
Rising concerns about privacy and anonymity preservation of deep learning
models have facilitated research in data-free learning (DFL). For the first
time, we identify that for data-scarce tasks like Sketch-Based Image Retrieval
(SBIR), where the difficulty in acquiring paired photos and hand-drawn sketches
limits data-dependent cross-modal learning algorithms, DFL can prove to be a
much more practical paradigm. We thus propose Data-Free (DF)-SBIR, where,
unlike existing DFL problems, pre-trained, single-modality classification
models have to be leveraged to learn a cross-modal metric-space for retrieval
without access to any training data. The widespread availability of pre-trained
classification models, along with the difficulty in acquiring paired
photo-sketch datasets for SBIR justify the practicality of this setting. We
present a methodology for DF-SBIR, which can leverage knowledge from models
independently trained to perform classification on photos and sketches. We
evaluate our model on the Sketchy, TU-Berlin, and QuickDraw benchmarks,
designing a variety of baselines based on state-of-the-art DFL literature, and
observe that our method surpasses all of them by significant margins. Our
method also achieves mAPs competitive with data-dependent approaches, all the
while requiring no training data. Implementation is available at
\url{https://github.com/abhrac/data-free-sbir}.
Authors' comments: Computer Vision and Pattern Recognition (CVPR) 2023
Tongwen Huang, Xihua Li, Chao Yi, Xuemin Zhao, Yunbo Cao
When students make a mistake in an exercise, they can consolidate it by
``similar exercises'' which have the same concepts, purposes and methods.
Commonly, for a certain subject and study stage, the size of the exercise bank
is in the range of millions to even tens of millions, how to find similar
exercises for a given exercise becomes a crucial technical problem. Generally,
we can assign a variety of explicit labels to the exercise, and then query
through the labels, but the label annotation is time-consuming, laborious and
costly, with limited precision and granularity, so it is not feasible. In
practice, we define ``similar exercises'' as a retrieval process of finding a
set of similar exercises based on recall, ranking and re-rank procedures,
called the \textbf{FSE} problem (Finding similar exercises). Furthermore,
comprehensive representation of the semantic information of exercises was
obtained through representation learning. In addition to the reasonable
architecture, we also explore what kind of tasks are more conducive to the
learning of exercise semantic information from pre-training and supervised
learning. It is difficult to annotate similar exercises and the annotation
consistency among experts is low. Therefore this paper also provides solutions
to solve the problem of low-quality annotated data. Compared with other
methods, this paper has obvious advantages in both architecture rationality and
algorithm precision, which now serves the daily teaching of hundreds of
schools.
Authors' comments: 37th Conference on AAAI 2023 Artificial Intelligence for
Education(AI4Edu)
Feng He, Qi Wang, Zhifan Feng, Wenbin Jiang, Yajuan Lv, Yong zhu, Xiao Tan
Video retrieval is becoming increasingly important owing to the rapid
emergence of videos on the Internet. The dominant paradigm for video retrieval
learns video-text representations by pushing the distance between the
similarity of positive pairs and that of negative pairs apart from a fixed
margin. However, negative pairs used for training are sampled randomly, which
indicates that the semantics between negative pairs may be related or even
equivalent, while most methods still enforce dissimilar representations to
decrease their similarity. This phenomenon leads to inaccurate supervision and
poor performance in learning video-text representations.
While most video retrieval methods overlook that phenomenon, we propose an
adaptive margin changed with the distance between positive and negative pairs
to solve the aforementioned issue. First, we design the calculation framework
of the adaptive margin, including the method of distance measurement and the
function between the distance and the margin. Then, we explore a novel
implementation called "Cross-Modal Generalized Self-Distillation" (CMGSD),
which can be built on the top of most video retrieval models with few
modifications. Notably, CMGSD adds few computational overheads at train time
and adds no computational overhead at test time. Experimental results on three
widely used datasets demonstrate that the proposed method can yield
significantly better performance than the corresponding backbone model, and it
outperforms state-of-the-art methods by a large margin.
Authors' comments: Accepted by SIGIR 2021
Chang Ma, Haiteng Zhao, Lin Zheng, Jiayi Xin, Qintong Li, Lijun Wu, Zhihong Deng, Yang Lu et al.
Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, as well as the de novo and orphan proteins remain great challenges in protein representation learning. In this work, we show that MSAaugmented models inherently belong to retrievalaugmented methods. Motivated by this finding, we introduce Retrieved Sequence Augmentation(RSA) for protein representation learning without additional alignment or pre-processing. RSA links query protein sequences to a set of sequences with similar structures or properties in the database and combines these sequences for downstream prediction. We show that protein language models benefit from the retrieval enhancement on both structure prediction and property prediction tasks, with a 5% improvement on MSA Transformer on average while being 373 times faster. In addition, we show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction. Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences. Code is available on https://github.com/HKUNLP/RSA.
Christopher Richardson, Sudipta Kar, Anjishnu Kumar, Anand Ramachandran, Omar Zia Khan, Zeynab Raeesy, Abhinav Sethy
Open domain conversational agents can answer a broad range of targeted
queries. However, the sequential nature of interaction with these systems makes
knowledge exploration a lengthy task which burdens the user with asking a chain
of well phrased questions. In this paper, we present a retrieval based system
and associated dataset for predicting the next questions that the user might
have. Such a system can proactively assist users in knowledge exploration
leading to a more engaging dialog. The retrieval system is trained on a dataset
which contains ~14K multi-turn information-seeking conversations with a valid
follow-up question and a set of invalid candidates. The invalid candidates are
generated to simulate various syntactic and semantic confounders such as
paraphrases, partial entity match, irrelevant entity, and ASR errors. We use
confounder specific techniques to simulate these negative examples on the
OR-QuAC dataset and develop a dataset called the Follow-up Query Bank
(FQ-Bank). Then, we train ranking models on FQ-Bank and present results
comparing supervised and unsupervised approaches. The results suggest that we
can retrieve the valid follow-ups by ranking them in higher positions compared
to confounders, but further knowledge grounding can improve ranking
performance.
Authors' comments: EACL 2023
Yukang Gan, Yixiao Ge, Chang Zhou, Shupeng Su, Zhouchuan Xu, Xuyuan Xu, Quanchao Hui, Xiang Chen et al.
Large-scale embedding-based retrieval (EBR) is the cornerstone of search-related industrial applications. Given a user query, the system of EBR aims to identify relevant information from a large corpus of documents that may be tens or hundreds of billions in size. The storage and computation turn out to be expensive and inefficient with massive documents and high concurrent queries, making it difficult to further scale up. To tackle the challenge, we propose a binary embedding-based retrieval (BEBR) engine equipped with a recurrent binarization algorithm that enables customized bits per dimension. Specifically, we compress the full-precision query and document embeddings, formulated as float vectors in general, into a composition of multiple binary vectors using a lightweight transformation model with residual multilayer perception (MLP) blocks. We can therefore tailor the number of bits for different applications to trade off accuracy loss and cost savings. Importantly, we enable task-agnostic efficient training of the binarization model using a new embedding-to-embedding strategy. We also exploit the compatible training of binary embeddings so that the BEBR engine can support indexing among multiple embedding versions within a unified system. To further realize efficient search, we propose Symmetric Distance Calculation (SDC) to achieve lower response time than Hamming codes. We successfully employed the introduced BEBR to Tencent products, including Sogou, Tencent Video, QQ World, etc. The binarization algorithm can be seamlessly generalized to various tasks with multiple modalities. Extensive experiments on offline benchmarks and online A/B tests demonstrate the efficiency and effectiveness of our method, significantly saving 30%~50% index costs with almost no loss of accuracy at the system level.
Jinkuan Zhu, Hao Huang, Qiao Deng, Xiyao Li
Fashion image retrieval task aims to search relevant clothing items of a query image from the gallery. The previous recipes focus on designing different distance-based loss functions, pulling relevant pairs to be close and pushing irrelevant images apart. However, these methods ignore fine-grained features (e.g. neckband, cuff) of clothing images. In this paper, we propose a novel fashion image retrieval method leveraging both global and fine-grained features, dubbed Multi-Granular Alignment (MGA). Specifically, we design a Fine-Granular Aggregator(FGA) to capture and aggregate detailed patterns. Then we propose Attention-based Token Alignment (ATA) to align image features at the multi-granular level in a coarse-to-fine manner. To prove the effectiveness of our proposed method, we conduct experiments on two sub-tasks (In-Shop & Consumer2Shop) of the public fashion datasets DeepFashion. The experimental results show that our MGA outperforms the state-of-the-art methods by 1.8% and 0.6% in the two sub-tasks on the R@1 metric, respectively.
Igor O. Zavadskyi
The variable-length Reverse Multi-Delimiter (RMD) codes are known to
represent sequences of unbounded and unordered integers. When applied to data
compression, they combine a good compression ratio with fast decoding. In this
paper, we investigate another property of RMD-codes - the ability of direct
access to codewords in the encoded bitstream. We present the method allowing us
to extract and decode a codeword from an RMD-bitstream in almost constant time
with the tiny space overhead, and make experiments on its application to
natural language text compression.
Authors' comments: 18 pages, 5 figures, 2 algorithms, 1 table
Yanwen Fang, Yuxi Cai, Jintai Chen, Jingyu Zhao, Guangjian Tian, Guodong Li
More and more evidence has shown that strengthening layer interactions can
enhance the representation power of a deep neural network, while self-attention
excels at learning interdependencies by retrieving query-activated information.
Motivated by this, we devise a cross-layer attention mechanism, called
multi-head recurrent layer attention (MRLA), that sends a query representation
of the current layer to all previous layers to retrieve query-related
information from different levels of receptive fields. A light-weighted version
of MRLA is also proposed to reduce the quadratic computation cost. The proposed
layer attention mechanism can enrich the representation power of many
state-of-the-art vision networks, including CNNs and vision transformers. Its
effectiveness has been extensively evaluated in image classification, object
detection and instance segmentation tasks, where improvements can be
consistently observed. For example, our MRLA can improve 1.6% Top-1 accuracy on
ResNet-50, while only introducing 0.16M parameters and 0.07B FLOPs.
Surprisingly, it can boost the performances by a large margin of 3-4% box AP
and mask AP in dense prediction tasks. Our code is available at
https://github.com/joyfang1106/MRLA.
Authors' comments: Published as a conference paper at ICLR 2023
Peter Vouras, Kumar Vijay Mishra, Alexandra Artusio-Glimpse
Rydberg-aided atomic electrometry using alkali-metal atoms is gaining increased research interest for detecting external electric fields. However, the inability of Rydberg probes to detect phase is a serious impediment to their realistic deployment. In this paper, we derive a novel phase retrieval algorithm for use in a phased array or synthetic aperture applications where only measurements of electric field intensity are possible at each spatial sample. These array configurations exist if a Rydberg atom probe is used in place of an antenna. We employ three-stage alternating projections to solve the resulting optimization problem. Our numerical experiments demonstrate the effectiveness of the proposed algorithm in terms of beamformed array output.
Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, Yoav Shoham
Retrieval-Augmented Language Modeling (RALM) methods, which condition a
language model (LM) on relevant documents from a grounding corpus during
generation, were shown to significantly improve language modeling performance.
In addition, they can mitigate the problem of factually inaccurate text
generation and provide natural source attribution mechanism. Existing RALM
approaches focus on modifying the LM architecture in order to facilitate the
incorporation of external information, significantly complicating deployment.
This paper considers a simple alternative, which we dub In-Context RALM:
leaving the LM architecture unchanged and prepending grounding documents to the
input, without any further training of the LM. We show that In-Context RALM
that builds on off-the-shelf general purpose retrievers provides surprisingly
large LM gains across model sizes and diverse corpora. We also demonstrate that
the document retrieval and ranking mechanism can be specialized to the RALM
setting to further boost performance. We conclude that In-Context RALM has
considerable potential to increase the prevalence of LM grounding, particularly
in settings where a pretrained LM must be used without modification or even via
API access.
Authors' comments: Accepted for publication in Transactions of the Association for
Computational Linguistics (TACL). pre-MIT Press publication version
Sushuang Ma, Yuichi Ito, Ahmed Faris Al-Refaie, Quentin Changeat, Billy Edwards, Giovanna Tinetti
In this paper, we present YunMa, an exoplanet cloud simulation and retrieval
package, which enables the study of cloud microphysics and radiative properties
in exoplanetary atmospheres. YunMa simulates the vertical distribution and
sizes of cloud particles and their corresponding scattering signature in
transit spectra. We validated YunMa against results from the literature. When
coupled to the TauREx 3 platform, an open Bayesian framework for spectral
retrievals, YunMa enables the retrieval of the cloud properties and parameters
from transit spectra of exoplanets. The sedimentation efficiency
($f_{\mathrm{sed}}$), which controls the cloud microphysics, is set as a free
parameter in retrievals. We assess the retrieval performances of YunMa through
28 instances of a K2-18 b-like atmosphere with different fractions of H$_2$/He
and N$_2$, and assuming water clouds. Our results show a substantial
improvement in retrieval performances when using YunMa instead of a simple
opaque cloud model and highlight the need to include cloud radiative transfer
and microphysics to interpret the next-generation data for exoplanet
atmospheres. This work also inspires instrumental development for future
flagships by demonstrating retrieval performances with different data quality.
Authors' comments: 24 pages, 12 figures, accepted in ApJ
Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, Wen-tau Yih
We introduce REPLUG, a retrieval-augmented language modeling framework that treats the language model (LM) as a black box and augments it with a tuneable retrieval model. Unlike prior retrieval-augmented LMs that train language models with special cross attention mechanisms to encode the retrieved text, REPLUG simply prepends retrieved documents to the input for the frozen black-box LM. This simple design can be easily applied to any existing retrieval and language models. Furthermore, we show that the LM can be used to supervise the retrieval model, which can then find documents that help the LM make better predictions. Our experiments demonstrate that REPLUG with the tuned retriever significantly improves the performance of GPT-3 (175B) on language modeling by 6.3%, as well as the performance of Codex on five-shot MMLU by 5.1%.