Kexun Zhang, Xianjun Yang, William Yang Wang, Lei Li
Diffusion models show promising generation capability for a variety of data. Despite their high generation quality, the inference for diffusion models is still time-consuming due to the numerous sampling iterations required. To accelerate the inference, we propose ReDi, a simple yet learning-free Retrieval-based Diffusion sampling framework. From a precomputed knowledge base, ReDi retrieves a trajectory similar to the partially generated trajectory at an early stage of generation, skips a large portion of intermediate steps, and continues sampling from a later step in the retrieved trajectory. We theoretically prove that the generation performance of ReDi is guaranteed. Our experiments demonstrate that ReDi improves the model inference efficiency by 2x speedup. Furthermore, ReDi is able to generalize well in zero-shot cross-domain image generation such as image stylization.
Shunyu Zhang, Yaobo Liang, Ming Gong, Daxin Jiang, Nan Duan
Recently multi-lingual pre-trained language models (PLM) such as mBERT and
XLM-R have achieved impressive strides in cross-lingual dense retrieval.
Despite its successes, they are general-purpose PLM while the multilingual PLM
tailored for cross-lingual retrieval is still unexplored. Motivated by an
observation that the sentences in parallel documents are approximately in the
same order, which is universal across languages, we propose to model this
sequential sentence relation to facilitate cross-lingual representation
learning. Specifically, we propose a multilingual PLM called masked sentence
model (MSM), which consists of a sentence encoder to generate the sentence
representations, and a document encoder applied to a sequence of sentence
vectors from a document. The document encoder is shared for all languages to
model the universal sequential sentence relation across languages. To train the
model, we propose a masked sentence prediction task, which masks and predicts
the sentence vector via a hierarchical contrastive loss with sampled negatives.
Comprehensive experiments on four cross-lingual retrieval tasks show MSM
significantly outperforms existing advanced pre-training models, demonstrating
the effectiveness and stronger cross-lingual retrieval capabilities of our
approach. Code and model will be available.
Authors' comments: Published at ICLR 2023
Frederik Warburg, Marco Miani, Silas Brack, Soren Hauberg
We propose the first Bayesian encoder for metric learning. Rather than
relying on neural amortization as done in prior works, we learn a distribution
over the network weights with the Laplace Approximation. We actualize this by
first proving that the contrastive loss is a valid log-posterior. We then
propose three methods that ensure a positive definite Hessian. Lastly, we
present a novel decomposition of the Generalized Gauss-Newton approximation.
Empirically, we show that our Laplacian Metric Learner (LAM) estimates
well-calibrated uncertainties, reliably detects out-of-distribution examples,
and yields state-of-the-art predictive performance.
Authors' comments: Code: https://github.com/FrederikWarburg/bayesian-metric-learning
Maciej Wiatrak, Eirini Arvaniti, Angus Brayne, Jonas Vetterle, Aaron Sim
A recent advancement in the domain of biomedical Entity Linking is the development of powerful two-stage algorithms, an initial candidate retrieval stage that generates a shortlist of entities for each mention, followed by a candidate ranking stage. However, the effectiveness of both stages are inextricably dependent on computationally expensive components. Specifically, in candidate retrieval via dense representation retrieval it is important to have hard negative samples, which require repeated forward passes and nearest neighbour searches across the entire entity label set throughout training. In this work, we show that pairing a proxy-based metric learning loss with an adversarial regularizer provides an efficient alternative to hard negative sampling in the candidate retrieval stage. In particular, we show competitive performance on the recall@1 metric, thereby providing the option to leave out the expensive candidate ranking step. Finally, we demonstrate how the model can be used in a zero-shot setting to discover out of knowledge base biomedical entities.
Bolin Zhang, Yunzhe Xu, Zhiying Tu, Dianhui Chu
Using natural language, Conversational Bot offers unprecedented ways to many challenges in areas such as information searching, item recommendation, and question answering. Existing bots are usually developed through retrieval-based or generative-based approaches, yet both of them have their own advantages and disadvantages. To assemble this two approaches, we propose a hybrid retrieval-generation network (HeroNet) with the three-fold ideas: 1). To produce high-quality sentence representations, HeroNet performs multi-task learning on two subtasks: Similar Queries Discovery and Query-Response Matching. Specifically, the retrieval performance is improved while the model size is reduced by training two lightweight, task-specific adapter modules that share only one underlying T5-Encoder model. 2). By introducing adversarial training, HeroNet is able to solve both retrieval\&generation tasks simultaneously while maximizing performance of each other. 3). The retrieval results are used as prior knowledge to improve the generation performance while the generative result are scored by the discriminator and their scores are integrated into the generator's cross-entropy loss function. The experimental results on a open dataset demonstrate the effectiveness of the HeroNet and our code is available at https://github.com/TempHero/HeroNet.git
Seungyeon Kim, Ankit Singh Rawat, Manzil Zaheer, Sadeep Jayasumana, Veeranjaneyulu Sadhanala, Wittawat Jitkrittum, Aditya Krishna Menon, Rob Fergus et al.
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR). In this paper, we aim to improve distillation methods that pave the way for the resource-efficient deployment of such models in practice. Inspired by our theoretical analysis of the teacher-student generalization gap for IR models, we propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. Unlike existing teacher score-based distillation methods, our proposed approach employs embedding matching tasks to provide a stronger signal to align the representations of the teacher and student models. In addition, it utilizes query generation to explore the data manifold to reduce the discrepancies between the student and the teacher where training data is sparse. Furthermore, our analysis also motivates novel asymmetric architectures for student models which realizes better embedding alignment without increasing online inference cost. On standard benchmarks like MSMARCO, we show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
Asha Vishwanathan, Rajeev Unnikrishnan Warrier, Gautham Vadakkekara Suresh, Chandra Shekhar Kandpal
Business-specific Frequently Asked Questions (FAQ) retrieval in task-oriented
dialog systems poses unique challenges vis-\`a-vis community based FAQs. Each
FAQ question represents an intent which is usually an umbrella term for many
related user queries. We evaluate performance for such Business FAQs both with
standard FAQ retrieval techniques using query-Question (q-Q) similarity and
few-shot intent detection techniques. Implementing a real world solution for
FAQ retrieval in order to support multiple tenants (FAQ sets) entails
optimizing speed, accuracy and cost. We propose a novel approach to scale
multi-tenant FAQ applications in real-world context by contrastive fine-tuning
of the last layer in sentence Bi-Encoders along with tenant-specific weight
switching.
Authors' comments: Accepted at EMNLP 2022
Simon Lupart, Stéphane Clinchant
Neural retrieval models have acquired significant effectiveness gains over
the last few years compared to term-based methods. Nevertheless, those models
may be brittle when faced to typos, distribution shifts or vulnerable to
malicious attacks. For instance, several recent papers demonstrated that such
variations severely impacted models performances, and then tried to train more
resilient models. Usual approaches include synonyms replacements or typos
injections -- as data-augmentation -- and the use of more robust tokenizers
(characterBERT, BPE-dropout). To further complement the literature, we
investigate in this paper adversarial training as another possible solution to
this robustness issue. Our comparison includes the two main families of
BERT-based neural retrievers, i.e. dense and sparse, with and without
distillation techniques. We then demonstrate that one of the most simple
adversarial training techniques -- the Fast Gradient Sign Method (FGSM) -- can
improve first stage rankers robustness and effectiveness. In particular, FGSM
increases models performances on both in-domain and out-of-domain
distributions, and also on queries with typos, for multiple neural retrievers.
Authors' comments: Accepted at ECIR 2023
Robert Beinert, Saghar Rezaei
Phase retrieval consists in the recovery of an unknown signal from phaseless measurements of its usually complex-valued Fourier transform. Without further assumptions, this problem is notorious to be severe ill posed such that the recovery of the true signal is nearly impossible. In certain applications like crystallography, speckle imaging in astronomy, or blind channel estimation in communications, the unknown signal has a specific, sparse structure. In this paper, we exploit these sparse structure to recover the unknown signal uniquely up to inevitable ambiguities as global phase shifts, transitions, and conjugated reflections. Although using a constructive proof essentially based on Prony's method, our focus lies on the derivation of a recovery guarantee for multivariate signals using an adaptive sampling scheme. Instead of sampling the entire multivariate Fourier intensity, we only employ Fourier samples along certain adaptively chosen lines. For bivariate signals, an analogous result can be established for samples in generic directions. The number of samples here scales quadratically to the sparsity level of the unknown signal.
Xiaojie Jin, Bowen Zhang, Weibo Gong, Kai Xu, XueQing Deng, Peng Wang, Zhao Zhang, Xiaohui Shen et al.
State-of-the-art video-text retrieval (VTR) methods typically involve fully fine-tuning a pre-trained model (e.g. CLIP) on specific datasets. However, this can result in significant storage costs in practical applications as a separate model per task must be stored. To address this issue, we present our pioneering work that enables parameter-efficient VTR using a pre-trained model, with only a small number of tunable parameters during training. Towards this goal, we propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text. Specifically, MV-Adapter utilizes bottleneck structures in both video and text branches, along with two novel components. The first is a Temporal Adaptation Module that is incorporated in the video branch to introduce global and local temporal contexts. We also train weights calibrations to adjust to dynamic variations across frames. The second is Cross Modality Tying that generates weights for video/text branches through sharing cross modality factors, for better aligning between modalities. Thanks to above innovations, MV-Adapter can achieve comparable or better performance than standard full fine-tuning with negligible parameters overhead. Notably, MV-Adapter consistently outperforms various competing methods in V2T/T2V tasks with large margins on five widely used VTR benchmarks (MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet).
Haotian Liu, Kilho Son, Jianwei Yang, Ce Liu, Jianfeng Gao, Yong Jae Lee, Chunyuan Li
Image-text contrastive learning models such as CLIP have demonstrated strong task transfer ability. The high generality and usability of these visual models is achieved via a web-scale data collection process to ensure broad concept coverage, followed by expensive pre-training to feed all the knowledge into model weights. Alternatively, we propose REACT, REtrieval-Augmented CusTomization, a framework to acquire the relevant web knowledge to build customized visual models for target domains. We retrieve the most relevant image-text pairs (~3% of CLIP pre-training data) from the web-scale database as external knowledge, and propose to customize the model by only training new modualized blocks while freezing all the original weights. The effectiveness of REACT is demonstrated via extensive experiments on classification, retrieval, detection and segmentation tasks, including zero, few, and full-shot settings. Particularly, on the zero-shot classification task, compared with CLIP, it achieves up to 5.4% improvement on ImageNet and 3.7% on the ELEVATER benchmark (20 datasets).
Yu Zhang, Yue Wang, Zhi Tian, Geert Leus, Gong Zhang
This paper proposes a super-resolution harmonic retrieval method for uncorrelated strictly non-circular signals, whose covariance and pseudo-covariance present Toeplitz and Hankel structures, respectively. Accordingly, the augmented covariance matrix constructed by the covariance and pseudo-covariance matrices is not only low rank but also jointly Toeplitz-Hankel structured. To efficiently exploit such a desired structure for high estimation accuracy, we develop a low-rank Toeplitz-Hankel covariance reconstruction (LRTHCR) solution employed over the augmented covariance matrix. Further, we design a fitting error constraint to flexibly implement the LRTHCR algorithm without knowing the noise statistics. In addition, performance analysis is provided for the proposed LRTHCR in practical settings. Simulation results reveal that the LRTHCR outperforms the benchmark methods in terms of lower estimation errors.
Manh-Duy Nguyen, Binh T. Nguyen, Cathal Gurrin
Many models have been proposed for vision and language tasks, especially the image-text retrieval task. All state-of-the-art (SOTA) models in this challenge contained hundreds of millions of parameters. They also were pretrained on a large external dataset that has been proven to make a big improvement in overall performance. It is not easy to propose a new model with a novel architecture and intensively train it on a massive dataset with many GPUs to surpass many SOTA models, which are already available to use on the Internet. In this paper, we proposed a compact graph-based framework, named HADA, which can combine pretrained models to produce a better result, rather than building from scratch. First, we created a graph structure in which the nodes were the features extracted from the pretrained models and the edges connecting them. The graph structure was employed to capture and fuse the information from every pretrained model with each other. Then a graph neural network was applied to update the connection between the nodes to get the representative embedding vector for an image and text. Finally, we used the cosine similarity to match images with their relevant texts and vice versa to ensure a low inference time. Our experiments showed that, although HADA contained a tiny number of trainable parameters, it could increase baseline performance by more than 3.6% in terms of evaluation metrics in the Flickr30k dataset. Additionally, the proposed model did not train on any external dataset and did not require many GPUs but only 1 to train due to its small number of parameters. The source code is available at https://github.com/m2man/HADA.
Xindi Wu, KwunFung Lau, Francesco Ferroni, Aljoša Ošep, Deva Ramanan
Self-driving vehicles rely on urban street maps for autonomous navigation. In
this paper, we introduce Pix2Map, a method for inferring urban street map
topology directly from ego-view images, as needed to continually update and
expand existing maps. This is a challenging task, as we need to infer a complex
urban road topology directly from raw image data. The main insight of this
paper is that this problem can be posed as cross-modal retrieval by learning a
joint, cross-modal embedding space for images and existing maps, represented as
discrete graphs that encode the topological layout of the visual surroundings.
We conduct our experimental evaluation using the Argoverse dataset and show
that it is indeed possible to accurately retrieve street maps corresponding to
both seen and unseen roads solely from image data. Moreover, we show that our
retrieved maps can be used to update or expand existing maps and even show
proof-of-concept results for visual localization and image retrieval from
spatial graphs.
Authors' comments: 12 pages, 8 figures
Seonguk Seo, Mustafa Gokhan Uzunbas, Bohyung Han, Sara Cao, Ser-Nam Lim
Backfilling is the process of re-extracting all gallery embeddings from upgraded models in image retrieval systems. It inevitably requires a prohibitively large amount of computational cost and even entails the downtime of the service. Although backward-compatible learning sidesteps this challenge by tackling query-side representations, this leads to suboptimal solutions in principle because gallery embeddings cannot benefit from model upgrades. We address this dilemma by introducing an online backfilling algorithm, which enables us to achieve a progressive performance improvement during the backfilling process while not sacrificing the final performance of new model after the completion of backfilling. To this end, we first propose a simple distance rank merge technique for online backfilling. Then, we incorporate a reverse transformation module for more effective and efficient merging, which is further enhanced by adopting a metric-compatible contrastive learning approach. These two components help to make the distances of old and new models compatible, resulting in desirable merge results during backfilling with no extra computational overhead. Extensive experiments show the effectiveness of our framework on four standard benchmarks in various settings.
Chuhao Jin, Hongteng Xu, Ruihua Song, Zhiwu Lu
Poster generation is a significant task for a wide range of applications,
which is often time-consuming and requires lots of manual editing and artistic
experience. In this paper, we propose a novel data-driven framework, called
\textit{Text2Poster}, to automatically generate visually-effective posters from
textual information. Imitating the process of manual poster editing, our
framework leverages a large-scale pretrained visual-textual model to retrieve
background images from given texts, lays out the texts on the images
iteratively by cascaded auto-encoders, and finally, stylizes the texts by a
matching-based method. We learn the modules of the framework by weakly- and
self-supervised learning strategies, mitigating the demand for labeled data.
Both objective and subjective experiments demonstrate that our Text2Poster
outperforms state-of-the-art methods, including academic research and
commercial software, on the quality of generated posters.
Authors' comments: 5 pages, Accepted to ICASSP 2022
Hangfeng He, Hongming Zhang, Dan Roth
Despite the success of large language models (LLMs) in various natural language processing (NLP) tasks, the stored knowledge in these models may inevitably be incomplete, out-of-date, or incorrect. This motivates the need to utilize external knowledge to assist LLMs. Unfortunately, current methods for incorporating external knowledge often require additional training or fine-tuning, which can be costly and may not be feasible for LLMs. To address this issue, we propose a novel post-processing approach, rethinking with retrieval (RR), which retrieves relevant external knowledge based on the decomposed reasoning steps obtained from the chain-of-thought (CoT) prompting. This lightweight approach does not require additional training or fine-tuning and is not limited by the input length of LLMs. We evaluate the effectiveness of RR through extensive experiments with GPT-3 on three complex reasoning tasks: commonsense reasoning, temporal reasoning, and tabular reasoning. Our results show that RR can produce more faithful explanations and improve the performance of LLMs.
Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang
Most existing text-video retrieval methods focus on cross-modal matching
between the visual content of videos and textual query sentences. However, in
real-world scenarios, online videos are often accompanied by relevant text
information such as titles, tags, and even subtitles, which can be utilized to
match textual queries. This insight has motivated us to propose a novel
approach to text-video retrieval, where we directly generate associated
captions from videos using zero-shot video captioning with knowledge from
web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated
captions, a natural question arises: what benefits do they bring to text-video
retrieval? To answer this, we introduce Cap4Video, a new framework that
leverages captions in three ways: i) Input data: video-caption pairs can
augment the training data. ii) Intermediate feature interaction: we perform
cross-modal feature interaction between the video and caption to produce
enhanced video representations. iii) Output score: the Query-Caption matching
branch can complement the original Query-Video matching branch for text-video
retrieval. We conduct comprehensive ablation studies to demonstrate the
effectiveness of our approach. Without any post-processing, Cap4Video achieves
state-of-the-art performance on four standard text-video retrieval benchmarks:
MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). The code is
available at https://github.com/whwu95/Cap4Video .
Authors' comments: Accepted by CVPR 2023. Selected as a Highlight (Top 2.5% of ALL
submissions)
Kanishak Vaidya, B Sundar Rajan
We consider the problem of cache-aided multi-user private information
retrieval (MuPIR). In this problem, $N$ independent files are replicated across
$S \geq 2$ non-colluding servers. There are $K$ users, each equipped with cache
memory which can store $M$ files. Each user wants to retrieve a file from the
servers, but the users don't want any of the servers to get any information
about their demand. The user caches are filled with some arbitrary function of
the files before the users decide their demands, known as the placement phase.
After deciding their demands, users cooperatively send queries to the servers
to retrieve their desired files privately. Upon receiving the queries, servers
broadcast coded transmissions which are a function of the queries they received
and the files, known as the delivery phase. Conveying queries to the servers
incurs an upload cost for the users, and downloading the answers broadcasted by
the servers incurs a download cost. To implement cache-aided MuPIR schemes,
each file has to be split into $F$ packets. In this paper, we propose MuPIR
schemes that utilize placement delivery arrays (PDAs) to characterize placement
and delivery. Proposed MuPIR schemes significantly reduce subpacketization
levels while slightly increasing the download cost. The proposed scheme also
substantially reduces the upload cost for the users. For PDAs based on {\it
Ali-Niesen} scheme for centralized coded caching, we show that our scheme is
order optimal in terms of download cost. We recover the optimal single-user PIR
scheme presented by {\it Tian et al.} as a special case. Our scheme also
achieves optimal rate for single-user cache-aided PIR setup reported by R.
Tondon.
Authors' comments: 30 pages, 7 figures and 5 tables
Yali Du, Yinwei Wei, Wei Ji, Fan Liu, Xin Luo, Liqiang Nie
The booming development and huge market of micro-videos bring new e-commerce
channels for merchants. Currently, more micro-video publishers prefer to embed
relevant ads into their micro-videos, which not only provides them with
business income but helps the audiences to discover their interesting products.
However, due to the micro-video recording by unprofessional equipment,
involving various topics and including multiple modalities, it is challenging
to locate the products related to micro-videos efficiently, appropriately, and
accurately. We formulate the microvideo-product retrieval task, which is the
first attempt to explore the retrieval between the multi-modal and multi-modal
instances.
A novel approach named Multi-Queue Momentum Contrast (MQMC) network is
proposed for bidirectional retrieval, consisting of the uni-modal feature and
multi-modal instance representation learning. Moreover, a discriminative
selection strategy with a multi-queue is used to distinguish the importance of
different negatives based on their categories. We collect two large-scale
microvideo-product datasets (MVS and MVS-large) for evaluation and manually
construct the hierarchical category ontology, which covers sundry products in
daily life. Extensive experiments show that MQMC outperforms the
state-of-the-art baselines. Our replication package (including code, dataset,
etc.) is publicly available at https://github.com/duyali2000/MQMC.
Authors' comments: Proceedings of the Sixteenth ACM International Conference on Web
Search and Data Mining (WSDM '23), February 27-March 3, 2023, Singapore,
Singapore