Jon Eskreis-Winkler, Yubin Kim, Andrew Stanton
In e-commerce, head queries account for the vast majority of gross merchandise sales and improvements to head queries are highly impactful to the business. While most supervised approaches to search perform better in head queries vs. tail queries, we propose a method that further improves head query performance dramatically. We propose XWalk, a random-walk based graph approach to candidate retrieval for product search that borrows from recommendation system techniques. XWalk is highly efficient to train and inference in a large-scale high traffic e-commerce setting, and shows substantial improvements in head query performance over state-of-the-art neural retreivers. Ensembling XWalk with a neural and/or lexical retriever combines the best of both worlds and the resulting retrieval system outperforms all other methods in both offline relevance-based evaluation and in online A/B tests.
F. Lesjak, L. Nortmann, F. Yan, D. Cont, A. Reiners, N. Piskunov, A. Hatzes, L. Boldt-Christmas et al.
Accurately estimating the C/O ratio of hot Jupiter atmospheres is a promising
pathway towards understanding planet formation and migration, as well as the
formation of clouds and the overall atmospheric composition. The atmosphere of
the hot Jupiter WASP-43b has been extensively analysed using low-resolution
observations with HST and Spitzer, but these previous observations did not
cover the K band, which hosts prominent spectral features of major
carbon-bearing species such as CO and CH$_{4}$. As a result, the ability to
establish precise constraints on the C/O ratio was limited. Moreover, the
planet has not been studied at high spectral resolution, which can provide
insights into the atmospheric dynamics.
In this study, we present the first high-resolution dayside spectra of
WASP-43b with the new CRIRES$^+$ spectrograph. By observing the planet in the K
band, we successfully detected the presence of CO and provide evidence for the
existence of H$_2$O using the cross-correlation method. This discovery
represents the first direct detection of CO in the atmosphere of WASP-43b.
Furthermore, we retrieved the temperature-pressure profile, abundances of CO
and H$_2$O, and a super-solar C/O ratio of 0.78 by applying a Bayesian
retrieval framework to the data. Our findings also shed light on the
atmospheric characteristics of WASP-43b. We found no evidence for a cloud deck
on the dayside, and recovered a line broadening indicative of an equatorial
super-rotation corresponding to a jet with a wind speed of $\sim$ 5 km
s$^{-1}$, matching the results of previous forward models and low-resolution
atmospheric retrievals for this planet.
Authors' comments: 15 pages, 14 figures
Abdelrahman Abdallah, Adam Jatowt
Open-domain question answering (QA) tasks usually require the retrieval of relevant information from a large corpus to generate accurate answers. We propose a novel approach called Generator-Retriever-Generator (GRG) that combines document retrieval techniques with a large language model (LLM), by first prompting the model to generate contextual documents based on a given question. In parallel, a dual-encoder network retrieves documents that are relevant to the question from an external corpus. The generated and retrieved documents are then passed to the second LLM, which generates the final answer. By combining document retrieval and LLM generation, our approach addresses the challenges of open-domain QA, such as generating informative and contextually relevant answers. GRG outperforms the state-of-the-art generate-then-read and retrieve-then-read pipelines (GENREAD and RFiD) improving their performance by at least by +5.2, +4.2, and +1.6 on TriviaQA, NQ, and WebQ datasets, respectively. We provide code, datasets, and checkpoints at https://github.com/abdoelsayed2016/GRG.
Liyuan Ma, Hongxia Wang, Ningyi Leng, Ziyang Yuan
Fourier phase retrieval (FPR) is a challenging task widely used in various applications. It involves recovering an unknown signal from its Fourier phaseless measurements. FPR with few measurements is important for reducing time and hardware costs, but it suffers from serious ill-posedness. Recently, untrained neural networks have offered new approaches by introducing learned priors to alleviate the ill-posedness without requiring any external data. However, they may not be ideal for reconstructing fine details in images and can be computationally expensive. This paper proposes an untrained neural network (NN) embedded algorithm based on the alternating direction method of multipliers (ADMM) framework to solve FPR with few measurements. Specifically, we use a generative network to represent the image to be recovered, which confines the image to the space defined by the network structure. To improve the ability to represent high-frequency information, total variation (TV) regularization is imposed to facilitate the recovery of local structures in the image. Furthermore, to reduce the computational cost mainly caused by the parameter updates of the untrained NN, we develop an accelerated algorithm that adaptively trades off between explicit and implicit regularization. Experimental results indicate that the proposed algorithm outperforms existing untrained NN-based algorithms with fewer computational resources and even performs competitively against trained NN-based algorithms.
Ben Morris, Hans Oberschelp, Hamilton Samraj Santhakumar
In the bounded retrieval model, the adversary can leak a certain amount of information from the message sender's computer (e.g., 10 percent of the hard drive). Bellare, Kane and Rogaway give an efficient symmetric encryption scheme in the bounded retrieval model. Their scheme uses a giant key (a key so large only a fraction of it can be leaked.) One property of their scheme is that the encrypted message is larger than the original message. Rogaway asked if an efficient scheme exists that does not increase the size of the message. In this paper we present such a scheme.
Zuozhuo Dai, Fangtao Shao, Qingkun Su, Zilong Dong, Siyu Zhu
State-of-the-art text-video retrieval (TVR) methods typically utilize CLIP and cosine similarity for efficient retrieval. Meanwhile, cross attention methods, which employ a transformer decoder to compute attention between each text query and all frames in a video, offer a more comprehensive interaction between text and videos. However, these methods lack important fine-grained spatial information as they directly compute attention between text and video-level tokens. To address this issue, we propose CrossTVR, a two-stage text-video retrieval architecture. In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection. In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions. Additionally, we employ the frozen CLIP model strategy in fine-grained retrieval, enabling scalability to larger pre-trained vision models like ViT-G, resulting in improved retrieval performance. Experiments on text video retrieval datasets demonstrate the effectiveness and scalability of our proposed CrossTVR compared to state-of-the-art approaches.
Fan Ni, Xu Zhang, Jianhui Wu, Guan-Nan Dong, Aichun Zhu, Hui Liu, Yue Zhang
Most existing methods for text-based person retrieval focus on text-to-image person retrieval. Nevertheless, due to the lack of dynamic information provided by isolated frames, the performance is hampered when the person is obscured in isolated frames or variable motion details are given in the textual description. In this paper, we propose a new task called Text-to-Video Person Retrieval(TVPR) which aims to effectively overcome the limitations of isolated frames. Since there is no dataset or benchmark that describes person videos with natural language, we construct a large-scale cross-modal person video dataset containing detailed natural language annotations, such as person's appearance, actions and interactions with environment, etc., termed as Text-to-Video Person Re-identification (TVPReid) dataset, which will be publicly available. To this end, a Text-to-Video Person Retrieval Network (TVPRN) is proposed. Specifically, TVPRN acquires video representations by fusing visual and motion representations of person videos, which can deal with temporal occlusion and the absence of variable motion details in isolated frames. Meanwhile, we employ the pre-trained BERT to obtain caption representations and the relationship between caption and video representations to reveal the most relevant person videos. To evaluate the effectiveness of the proposed TVPRN, extensive experiments have been conducted on TVPReid dataset. To the best of our knowledge, TVPRN is the first successful attempt to use video for text-based person retrieval task and has achieved state-of-the-art performance on TVPReid dataset. The TVPReid dataset will be publicly available to benefit future research.
Liang Wang, Nan Yang, Furu Wei
Large language models (LLMs) have demonstrated their ability to learn
in-context, allowing them to perform various tasks based on a few input-output
examples. However, the effectiveness of in-context learning is heavily reliant
on the quality of the selected examples. In this paper, we propose a novel
framework to iteratively train dense retrievers that can identify high-quality
in-context examples for LLMs. Our framework initially trains a reward model
based on LLM feedback to evaluate the quality of candidate examples, followed
by knowledge distillation to train a bi-encoder based dense retriever. Our
experiments on a suite of $30$ tasks demonstrate that our framework
significantly enhances in-context learning performance. Furthermore, we show
the generalization ability of our framework to unseen tasks during training. An
in-depth analysis reveals that our model improves performance by retrieving
examples with similar patterns, and the gains are consistent across LLMs of
varying sizes. The code and data are available at
https://github.com/microsoft/LMOps/tree/main/llm_retriever .
Authors' comments: Accepted by EACL 2024
Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang et al.
Generating videos for visual storytelling can be a tedious and complex
process that typically requires either live-action filming or graphics
animation rendering. To bypass these challenges, our key idea is to utilize the
abundance of existing video clips and synthesize a coherent storytelling video
by customizing their appearances. We achieve this by developing a framework
comprised of two functional modules: (i) Motion Structure Retrieval, which
provides video candidates with desired scene or motion context described by
query texts, and (ii) Structure-Guided Text-to-Video Synthesis, which generates
plot-aligned videos under the guidance of motion structure and text prompts.
For the first module, we leverage an off-the-shelf video retrieval system and
extract video depths as motion structure. For the second module, we propose a
controllable video generation model that offers flexible controls over
structure and characters. The videos are synthesized by following the
structural guidance and appearance instruction. To ensure visual consistency
across clips, we propose an effective concept personalization approach, which
allows the specification of the desired character identities through text
prompts. Extensive experiments demonstrate that our approach exhibits
significant advantages over various existing baselines.
Authors' comments: Github: https://github.com/VideoCrafter/Animate-A-Story Project page:
https://videocrafter.github.io/Animate-A-Story
Enrique Mas-Candela, Antonio Ros-Vila, Jorge Calvo-Zaragoza
In this work, the novel Image Transformation Sequence Retrieval (ITSR) task is presented, in which a model must retrieve the sequence of transformations between two given images that act as source and target, respectively. Given certain characteristics of the challenge such as the multiplicity of a correct sequence or the correlation between consecutive steps of the process, we propose a solution to ITSR using a general model-based Reinforcement Learning such as Monte Carlo Tree Search (MCTS), which is combined with a deep neural network. Our experiments provide a benchmark in both synthetic and real domains, where the proposed approach is compared with supervised training. The results report that a model trained with MCTS is able to outperform its supervised counterpart in both the simplest and the most complex cases. Our work draws interesting conclusions about the nature of ITSR and its associated challenges.
Abhinav Joshi, Akshat Sharma, Sai Kiran Tanikella, Ashutosh Modi
The task of Prior Case Retrieval (PCR) in the legal domain is about
automatically citing relevant (based on facts and precedence) prior legal cases
in a given query case. To further promote research in PCR, in this paper, we
propose a new large benchmark (in English) for the PCR task: IL-PCR (Indian
Legal Prior Case Retrieval) corpus. Given the complex nature of case relevance
and the long size of legal documents, BM25 remains a strong baseline for
ranking the cited prior documents. In this work, we explore the role of events
in legal case retrieval and propose an unsupervised retrieval method-based
pipeline U-CREAT (Unsupervised Case Retrieval using Events Extraction). We find
that the proposed unsupervised retrieval method significantly increases
performance compared to BM25 and makes retrieval faster by a considerable
margin, making it applicable to real-time case retrieval systems. Our proposed
system is generic, we show that it generalizes across two different legal
systems (Indian and Canadian), and it shows state-of-the-art performance on the
benchmarks for both the legal systems (IL-PCR and COLIEE corpora).
Authors' comments: Accepted at ACL 2023, 15 pages (12 main + 3 Appendix)
Xiaozhong Lyu, Stefan Grafberger, Samantha Biegel, Shaopeng Wei, Meng Cao, Sebastian Schelter, Ce Zhang
Retrieval augmentation enables large language models to take advantage of external knowledge, for example on tasks like question answering and data imputation. However, the performance of such retrieval-augmented models is limited by the data quality of their underlying retrieval corpus. In this paper, we propose an algorithm based on multilinear extension for evaluating the data importance of retrieved data points. There are exponentially many terms in the multilinear extension, and one key contribution of this paper is a polynomial time algorithm that computes exactly, given a retrieval-augmented model with an additive utility function and a validation set, the data importance of data points in the retrieval corpus using the multilinear extension of the model's utility function. We further proposed an even more efficient ({\epsilon}, {\delta})-approximation algorithm. Our experimental results illustrate that we can enhance the performance of large language models by only pruning or reweighting the retrieval corpus, without requiring further training. For some tasks, this even allows a small model (e.g., GPT-JT), augmented with a search engine API, to outperform GPT-3.5 (without retrieval augmentation). Moreover, we show that weights based on multilinear extension can be computed efficiently in practice (e.g., in less than ten minutes for a corpus with 100 million elements).
Pandeng Li, Chen-Wei Xie, Hongtao Xie, Liming Zhao, Lei Zhang, Yun Zheng, Deli Zhao, Yongdong Zhang
Video moment retrieval pursues an efficient and generalized solution to
identify the specific temporal segments within an untrimmed video that
correspond to a given language description. To achieve this goal, we provide a
generative diffusion-based framework called MomentDiff, which simulates a
typical human retrieval process from random browsing to gradual localization.
Specifically, we first diffuse the real span to random noise, and learn to
denoise the random noise to the original span with the guidance of similarity
between text and video. This allows the model to learn a mapping from arbitrary
random locations to real moments, enabling the ability to locate segments from
random initialization. Once trained, MomentDiff could sample random temporal
segments as initial guesses and iteratively refine them to generate an accurate
temporal boundary. Different from discriminative works (e.g., based on
learnable proposals or queries), MomentDiff with random initialized spans could
resist the temporal location biases from datasets. To evaluate the influence of
the temporal location biases, we propose two anti-bias datasets with location
distribution shifts, named Charades-STA-Len and Charades-STA-Mom. The
experimental results demonstrate that our efficient framework consistently
outperforms state-of-the-art methods on three public benchmarks, and exhibits
better generalization and robustness on the proposed anti-bias datasets. The
code, model, and anti-bias evaluation datasets are available at
https://github.com/IMCCretrieval/MomentDiff.
Authors' comments: 19 pages, 6 figures
Shuo Li, Sangdon Park, Insup Lee, Osbert Bastani
When applied to open-domain question answering, large language models (LLMs)
frequently generate incorrect responses based on made-up facts, which are
called $\textit{hallucinations}$. Retrieval augmented generation (RAG) is a
promising strategy to avoid hallucinations, but it does not provide guarantees
on its correctness. To address this challenge, we propose the Trustworthy
Retrieval Augmented Question Answering, or $\textit{TRAQ}$, which provides the
first end-to-end statistical correctness guarantee for RAG. TRAQ uses conformal
prediction, a statistical technique for constructing prediction sets that are
guaranteed to contain the semantically correct response with high probability.
Additionally, TRAQ leverages Bayesian optimization to minimize the size of the
constructed sets. In an extensive experimental evaluation, we demonstrate that
TRAQ provides the desired correctness guarantee while reducing prediction set
size by 16.2% on average compared to an ablation. The implementation is
available at $\href{https://github.com/shuoli90/TRAQ.git}{TRAQ}$.
Authors' comments: 23 pages, 17 figures, 2024 Annual Conference of the North American
Chapter of the Association for Computational Linguistics
Mayank Goel, Venktesh V, Vikram Goyal
Math Word Problems (MWPs) in online assessments help test the ability of the
learner to make critical inferences by interpreting the linguistic information
in them. To test the mathematical reasoning capabilities of the learners,
sometimes the problem is rephrased or the thematic setting of the original MWP
is changed. Since manual identification of MWPs with similar problem models is
cumbersome, we propose a tool in this work for MWP retrieval. We propose a
hybrid approach to retrieve similar MWPs with the same problem model. In our
work, the problem model refers to the sequence of operations to be performed to
arrive at the solution. We demonstrate that our tool is useful for the
mentioned tasks and better than semantic similarity-based approaches, which
fail to capture the arithmetic and logical sequence of the MWPs. A demo of the
tool can be found at https://www.youtube.com/watch?v=gSQWP3chFIs
Authors' comments: Accepted to ECML-PKDD 2023
Brendan King, Jeffrey Flanigan
There has been significant interest in zero and few-shot learning for
dialogue state tracking (DST) due to the high cost of collecting and annotating
task-oriented dialogues. Recent work has demonstrated that in-context learning
requires very little data and zero parameter updates, and even outperforms
trained methods in the few-shot setting (Hu et al. 2022). We propose RefPyDST,
which advances the state of the art with three advancements to in-context
learning for DST. First, we formulate DST as a Python programming task,
explicitly modeling language coreference as variable reference in Python.
Second, since in-context learning depends highly on the context examples, we
propose a method to retrieve a diverse set of relevant examples to improve
performance. Finally, we introduce a novel re-weighting method during decoding
that takes into account probabilities of competing surface forms, and produces
a more accurate dialogue state prediction. We evaluate our approach using
MultiWOZ and achieve state-of-the-art multi-domain joint-goal accuracy in zero
and few-shot settings.
Authors' comments: 14 pages, 2 figures, to appear in Findings of the ACL 2023
Timothy Ossowski, Junjie Hu
Recent years have witnessed impressive results of pre-trained vision-language models on knowledge-intensive tasks such as visual question answering (VQA). Despite the recent advances in VQA, existing methods mainly adopt a discriminative formulation that predicts answers within a pre-defined label set, leading to easy overfitting on low-resource domains with limited labeled data (e.g., medicine) and poor generalization under domain shift to another dataset. To tackle this limitation, we propose a novel generative model enhanced by multimodal prompt retrieval (MPR) that integrates retrieved prompts and multimodal features to generate answers in free text. Our generative model enables rapid zero-shot dataset adaptation to unseen data distributions and open-set answer labels across datasets. Our experiments on medical VQA tasks show that MPR outperforms its non-retrieval counterpart by up to 30% accuracy points in a few-shot domain adaptation setting.
Aaron Mueller, Kanika Narang, Lambert Mathias, Qifan Wang, Hamed Firooz
Large language models show impressive results on few-shot NLP tasks. However,
these models are memory and computation-intensive. Meta-training allows one to
leverage smaller models for few-shot generalization in a domain-general and
task-agnostic manner; however, these methods alone results in models that may
not have sufficient parameterization or knowledge to adapt quickly to a large
variety of tasks. To overcome this issue, we propose meta-training with
demonstration retrieval, where we use a dense passage retriever to retrieve
semantically similar labeled demonstrations to each example for more varied
supervision. By separating external knowledge from model parameters, we can use
meta-training to train parameter-efficient models that generalize well on a
larger variety of tasks. We construct a meta-training set from UnifiedQA and
CrossFit, and propose a demonstration bank based on UnifiedQA tasks. To our
knowledge, our work is the first to combine retrieval with meta-training, to
use DPR models to retrieve demonstrations, and to leverage demonstrations from
many tasks simultaneously, rather than randomly sampling demonstrations from
the training set of the target task. Our approach outperforms a variety of
targeted parameter-efficient and retrieval-augmented few-shot methods on QA,
NLI, and text classification tasks (including SQuAD, QNLI, and TREC). Our
approach can be meta-trained and fine-tuned quickly on a single GPU.
Authors' comments: Accepted to Findings of ACL 2023
Avinash Madasu, Vasudev Lal
Video retrieval (VR) involves retrieving the ground truth video from the video database given a text caption or vice-versa. The two important components of compositionality: objects & attributes and actions are joined using correct syntax to form a proper text query. These components (objects & attributes, actions and syntax) each play an important role to help distinguish among videos and retrieve the correct ground truth video. However, it is unclear what is the effect of these components on the video retrieval performance. We therefore, conduct a systematic study to evaluate the compositional and syntactic understanding of video retrieval models on standard benchmarks such as MSRVTT, MSVD and DIDEMO. The study is performed on two categories of video retrieval models: (i) which are pre-trained on video-text pairs and fine-tuned on downstream video retrieval datasets (Eg. Frozen-in-Time, Violet, MCQ etc.) (ii) which adapt pre-trained image-text representations like CLIP for video retrieval (Eg. CLIP4Clip, XCLIP, CLIP2Video etc.). Our experiments reveal that actions and syntax play a minor role compared to objects & attributes in video understanding. Moreover, video retrieval models that use pre-trained image-text representations (CLIP) have better syntactic and compositional understanding as compared to models pre-trained on video-text data. The code is available at https://github.com/IntelLabs/multimodal_cognitive_ai/tree/main/ICSVR
Tanjida Kabir, Luyao Chen, Muhammad F Walji, Luca Giancardo, Xiaoqian Jiang, Shayan Shams
Learning about diagnostic features and related clinical information from
dental radiographs is important for dental research. However, the lack of
expert-annotated data and convenient search tools poses challenges. Our primary
objective is to design a search tool that uses a user's query for oral-related
research. The proposed framework, Contrastive LAnguage Image REtrieval Search
for dental research, Dental CLAIRES, utilizes periapical radiographs and
associated clinical details such as periodontal diagnosis, demographic
information to retrieve the best-matched images based on the text query. We
applied a contrastive representation learning method to find images described
by the user's text by maximizing the similarity score of positive pairs (true
pairs) and minimizing the score of negative pairs (random pairs). Our model
achieved a hit@3 ratio of 96% and a Mean Reciprocal Rank (MRR) of 0.82. We also
designed a graphical user interface that allows researchers to verify the
model's performance with interactions.
Authors' comments: 10 pages, 7 figures, 4 tables