Minjoon Jung, Youwon Jang, Seongho Choi, Joochan Kim, Jin-Hwa Kim, Byoung-Tak Zhang
Video moment retrieval (VMR) identifies a specific moment in an untrimmed
video for a given natural language query. This task is prone to suffer the weak
alignment problem innate in video datasets. Due to the ambiguity, a query does
not fully cover the relevant details of the corresponding moment, or the moment
may contain misaligned and irrelevant frames, potentially limiting further
performance gains. To tackle this problem, we propose a background-aware moment
detection transformer (BM-DETR). Our model adopts a contrastive approach,
carefully utilizing the negative queries matched to other moments in the video.
Specifically, our model learns to predict the target moment from the joint
probability of each frame given the positive query and the complement of
negative queries. This leads to effective use of the surrounding background,
improving moment sensitivity and enhancing overall alignments in videos.
Extensive experiments on four benchmarks demonstrate the effectiveness of our
approach. Our code is available at:
\url{https://github.com/minjoong507/BM-DETR}
Authors' comments: Accepted by WACV 2025
Xu Zhang, Zhedong Zheng, Xiaohan Wang, Yi Yang
Language-guided image retrieval enables users to search for images and
interact with the retrieval system more naturally and expressively by using a
reference image and a relative caption as a query. Most existing studies mainly
focus on designing image-text composition architecture to extract
discriminative visual-linguistic relations. Despite great success, we identify
an inherent problem that obstructs the extraction of discriminative features
and considerably compromises model training: \textbf{triplet ambiguity}. This
problem stems from the annotation process wherein annotators view only one
triplet at a time. As a result, they often describe simple attributes, such as
color, while neglecting fine-grained details like location and style. This
leads to multiple false-negative candidates matching the same modification
text. We propose a novel Consensus Network (Css-Net) that self-adaptively
learns from noisy triplets to minimize the negative effects of triplet
ambiguity. Inspired by the psychological finding that groups perform better
than individuals, Css-Net comprises 1) a consensus module featuring four
distinct compositors that generate diverse fused image-text embeddings and 2) a
Kullback-Leibler divergence loss, which fosters learning among the compositors,
enabling them to reduce biases learned from noisy triplets and reach a
consensus. The decisions from four compositors are weighted during evaluation
to further achieve consensus. Comprehensive experiments on three datasets
demonstrate that Css-Net can alleviate triplet ambiguity, achieving competitive
performance on benchmarks, such as $+2.77\%$ R@10 and $+6.67\%$ R@50 on
FashionIQ.
Authors' comments: 11 pages
Jintao Rong, Hao Chen, Linlin Ou, Tianxiao Chen, Xinyi Yu, Yifan Liu
The Contrastive Language-Image Pretraining (CLIP) model has been widely used in various downstream vision tasks. The few-shot learning paradigm has been widely adopted to augment its capacity for these tasks. However, current paradigms may struggle with fine-grained classification, such as satellite image recognition, due to widening domain gaps. To address this limitation, we propose retrieval-enhanced visual prompt learning (RePrompt), which introduces retrieval mechanisms to cache and reuse the knowledge of downstream tasks. RePrompt constructs a retrieval database from either training examples or external data if available, and uses a retrieval mechanism to enhance multiple stages of a simple prompt learning baseline, thus narrowing the domain gap. During inference, our enhanced model can reference similar samples brought by retrieval to make more accurate predictions. A detailed analysis reveals that retrieval helps to improve the distribution of late features, thus, improving generalization for downstream tasks. Reprompt attains state-of-the-art performance on a wide range of vision datasets, including 11 image datasets, 3 video datasets, 1 multi-view dataset, and 4 domain generalization benchmarks.
Wang-Chiew Tan, Yuliang Li, Pedro Rodriguez, Richard James, Xi Victoria Lin, Alon Halevy, Scott Yih
We present a reality check on large language models and inspect the promise of retrieval augmented language models in comparison. Such language models are semi-parametric, where models integrate model parameters and knowledge from external data sources to make their predictions, as opposed to the parametric nature of vanilla large language models. We give initial experimental findings that semi-parametric architectures can be enhanced with views, a query analyzer/planner, and provenance to make a significantly more powerful system for question answering in terms of accuracy and efficiency, and potentially for other NLP tasks
Alexandru Ghita, Radu Tudor Ionescu
The performance of neural networks in content-based image retrieval (CBIR) is highly influenced by the chosen loss (objective) function. The majority of objective functions for neural models can be divided into metric learning and statistical learning. Metric learning approaches require a pair mining strategy that often lacks efficiency, while statistical learning approaches are not generating highly compact features due to their indirect feature optimization. To this end, we propose a novel repeller-attractor loss that falls in the metric learning paradigm, yet directly optimizes for the L2 metric without the need of generating pairs. Our loss is formed of three components. One leading objective ensures that the learned features are attracted to each designated learnable class anchor. The second loss component regulates the anchors and forces them to be separable by a margin, while the third objective ensures that the anchors do not collapse to zero. Furthermore, we develop a more efficient two-stage retrieval system by harnessing the learned class anchors during the first stage of the retrieval process, eliminating the need of comparing the query with every image in the database. We establish a set of four datasets (CIFAR-100, Food-101, SVHN, and Tiny ImageNet) and evaluate the proposed objective in the context of few-shot and full-set training on the CBIR task, by using both convolutional and transformer architectures. Compared to existing objective functions, our empirical evidence shows that the proposed objective is generating superior and more consistent results.
Man Luo, Zhiyuan Fang, Tejas Gokhale, Yezhou Yang, Chitta Baral
We investigate knowledge retrieval with multi-modal queries, i.e. queries
containing information split across image and text inputs, a challenging task
that differs from previous work on cross-modal retrieval. We curate a new
dataset called ReMuQ for benchmarking progress on this task. ReMuQ requires a
system to retrieve knowledge from a large corpus by integrating contents from
both text and image queries. We introduce a retriever model ``ReViz'' that can
directly process input text and images to retrieve relevant knowledge in an
end-to-end fashion without being dependent on intermediate modules such as
object detectors or caption generators. We introduce a new pretraining task
that is effective for learning knowledge retrieval with multimodal queries and
also improves performance on downstream tasks. We demonstrate superior
performance in retrieval on two datasets (ReMuQ and OK-VQA) under zero-shot
settings as well as further improvements when finetuned on these datasets.
Authors' comments: ACL 2023
Shiyang Li, Yifan Gao, Haoming Jiang, Qingyu Yin, Zheng Li, Xifeng Yan, Chao Zhang, Bing Yin
Answering complex questions often requires reasoning over knowledge graphs
(KGs). State-of-the-art methods often utilize entities in questions to retrieve
local subgraphs, which are then fed into KG encoder, e.g. graph neural networks
(GNNs), to model their local structures and integrated into language models for
question answering. However, this paradigm constrains retrieved knowledge in
local subgraphs and discards more diverse triplets buried in KGs that are
disconnected but useful for question answering. In this paper, we propose a
simple yet effective method to first retrieve the most relevant triplets from
KGs and then rerank them, which are then concatenated with questions to be fed
into language models. Extensive results on both CommonsenseQA and OpenbookQA
datasets show that our method can outperform state-of-the-art up to 4.6%
absolute accuracy.
Authors' comments: Findings of ACL 2023
Jian Chen, Ruiyi Zhang, Tong Yu, Rohan Sharma, Zhiqiang Xu, Tong Sun, Changyou Chen
Learning from noisy labels is an important and long-standing problem in
machine learning for real applications. One of the main research lines focuses
on learning a label corrector to purify potential noisy labels. However, these
methods typically rely on strict assumptions and are limited to certain types
of label noise. In this paper, we reformulate the label-noise problem from a
generative-model perspective, $\textit{i.e.}$, labels are generated by
gradually refining an initial random guess. This new perspective immediately
enables existing powerful diffusion models to seamlessly learn the stochastic
generative process. Once the generative uncertainty is modeled, we can perform
classification inference using maximum likelihood estimation of labels. To
mitigate the impact of noisy labels, we propose the
$\textbf{L}$abel-$\textbf{R}$etrieval-$\textbf{A}$ugmented (LRA) diffusion
model, which leverages neighbor consistency to effectively construct
pseudo-clean labels for diffusion training. Our model is flexible and general,
allowing easy incorporation of different types of conditional information,
$\textit{e.g.}$, use of pre-trained models, to further boost model performance.
Extensive experiments are conducted for evaluation. Our model achieves new
state-of-the-art (SOTA) results on all the standard real-world benchmark
datasets. Remarkably, by incorporating conditional information from the
powerful CLIP model, our method can boost the current SOTA accuracy by 10-20
absolute points in many cases.
Authors' comments: Accepted by NeurIPS 2023
Shengchao Liu, Jiongxiao Wang, Yijin Yang, Chengpeng Wang, Ling Liu, Hongyu Guo, Chaowei Xiao
Recent advancements in conversational large language models (LLMs), such as ChatGPT, have demonstrated remarkable promise in various domains, including drug discovery. However, existing works mainly focus on investigating the capabilities of conversational LLMs on chemical reaction and retrosynthesis. While drug editing, a critical task in the drug discovery pipeline, remains largely unexplored. To bridge this gap, we propose ChatDrug, a framework to facilitate the systematic investigation of drug editing using LLMs. ChatDrug jointly leverages a prompt module, a retrieval and domain feedback (ReDF) module, and a conversation module to streamline effective drug editing. We empirically show that ChatDrug reaches the best performance on 33 out of 39 drug editing tasks, encompassing small molecules, peptides, and proteins. We further demonstrate, through 10 case studies, that ChatDrug can successfully identify the key substructures (e.g., the molecule functional groups, peptide motifs, and protein structures) for manipulation, generating diverse and valid suggestions for drug editing. Promisingly, we also show that ChatDrug can offer insightful explanations from a domain-specific perspective, enhancing interpretability and enabling informed decision-making. This research sheds light on the potential of ChatGPT and conversational LLMs for drug editing. It paves the way for a more efficient and collaborative drug discovery pipeline, contributing to the advancement of pharmaceutical research and development.
Zhicheng Guo, Sijie Cheng, Yile Wang, Peng Li, Yang Liu
Retrieval-augmented methods have received increasing attention to support downstream tasks by leveraging useful information from external resources. Recent studies mainly focus on exploring retrieval to solve knowledge-intensive (KI) tasks. However, the potential of retrieval for most non-knowledge-intensive (NKI) tasks remains under-explored. There are two main challenges to leveraging retrieval-augmented methods for NKI tasks: 1) the demand for diverse relevance score functions and 2) the dilemma between training cost and task performance. To address these challenges, we propose a two-stage framework for NKI tasks, named PGRA. In the first stage, we adopt a task-agnostic retriever to build a shared static index and select candidate evidence efficiently. In the second stage, we design a prompt-guided reranker to rerank the nearest evidence according to task-specific relevance for the reader. Experimental results show that PGRA outperforms other state-of-the-art retrieval-augmented methods. Our analyses further investigate the influence factors to model performance and demonstrate the generality of PGRA. Codes are available at https://github.com/THUNLP-MT/PGRA.
Chaeeun Kim, Soyoung Yoon, Hyunji Lee, Joel Jang, Sohee Yang, Minjoon Seo
Benchmarking the performance of information retrieval (IR) is mostly
conducted with a fixed set of documents (static corpora). However, in realistic
scenarios, this is rarely the case and the documents to be retrieved are
constantly updated and added. In this paper, we focus on Generative Retrievals
(GR), which apply autoregressive language models to IR problems, and explore
their adaptability and robustness in dynamic scenarios. We also conduct an
extensive evaluation of computational and memory efficiency, crucial factors
for real-world deployment of IR systems handling vast and ever-changing
document collections. Our results on the StreamingQA benchmark demonstrate that
GR is more adaptable to evolving knowledge (4-11%), robust in learning
knowledge with temporal information, and efficient in terms of inference FLOPs
(x2), indexing time (x6), and storage footprint (x4) compared to Dual Encoders
(DE), which are commonly used in retrieval systems. Our paper highlights the
potential of GR for future use in practical IR systems within dynamic
environments.
Authors' comments: published at EMNLP 2024
Xuming Hu, Zhijiang Guo, Zhiyang Teng, Irwin King, Philip S. Yu
Multimodal relation extraction (MRE) is the task of identifying the semantic
relationships between two entities based on the context of the sentence image
pair. Existing retrieval-augmented approaches mainly focused on modeling the
retrieved textual knowledge, but this may not be able to accurately identify
complex relations. To improve the prediction, this research proposes to
retrieve textual and visual evidence based on the object, sentence, and whole
image. We further develop a novel approach to synthesize the object-level,
image-level, and sentence-level information for better reasoning between the
same and different modalities. Extensive experiments and analyses show that the
proposed method is able to effectively select and compare evidence across
modalities and significantly outperforms state-of-the-art models.
Authors' comments: Accepted to ACL 2023
Zheng Li, Caili Guo, Xin Wang, Zerun Feng, Yanjun Wang
Image-Text Retrieval (ITR) is essentially a ranking problem. Given a query caption, the goal is to rank candidate images by relevance, from large to small. The current ITR datasets are constructed in a pairwise manner. Image-text pairs are annotated as positive or negative. Correspondingly, ITR models mainly use pairwise losses, such as triplet loss, to learn to rank. Pairwise-based ITR increases positive pair similarity while decreasing negative pair similarity indiscriminately. However, the relevance between dissimilar negative pairs is different. Pairwise annotations cannot reflect this difference in relevance. In the current datasets, pairwise annotations miss many correlations. There are many potential positive pairs among the pairs labeled as negative. Pairwise-based ITR can only rank positive samples before negative samples, but cannot rank negative samples by relevance. In this paper, we integrate listwise ranking into conventional pairwise-based ITR. Listwise ranking optimizes the entire ranking list based on relevance scores. Specifically, we first propose a Relevance Score Calculation (RSC) module to calculate the relevance score of the entire ranked list. Then we choose the ranking metric, Normalized Discounted Cumulative Gain (NDCG), as the optimization objective. We transform the non-differentiable NDCG into a differentiable listwise loss, named Smooth-NDCG (S-NDCG). Our listwise ranking approach can be plug-and-play integrated into current pairwise-based ITR models. Experiments on ITR benchmarks show that integrating listwise ranking can improve the performance of current ITR models and provide more user-friendly retrieval results. The code is available at https://github.com/AAA-Zheng/Listwise_ITR.
Xuming Hu, Junzhe Chen, Zhijiang Guo, Philip S. Yu
Evidence plays a crucial role in automated fact-checking. When verifying
real-world claims, existing fact-checking systems either assume the evidence
sentences are given or use the search snippets returned by the search engine.
Such methods ignore the challenges of collecting evidence and may not provide
sufficient information to verify real-world claims. Aiming at building a better
fact-checking system, we propose to incorporate full text from source documents
as evidence and introduce two enriched datasets. The first one is a
multilingual dataset, while the second one is monolingual (English). We further
develop a latent variable model to jointly extract evidence sentences from
documents and perform claim verification. Experiments indicate that including
source documents can provide sufficient contextual clues even when gold
evidence sentences are not annotated. The proposed system is able to achieve
significant improvements upon best-reported models under different settings.
Authors' comments: Fixed minor issues, 11 pages
Kevin Lin, Kyle Lo, Joseph E. Gonzalez, Dan Klein
When re-finding items, users who forget or are uncertain about identifying details often rely on creative strategies for expressing their information needs -- complex queries that describe content elements (e.g., book characters or events), information beyond the document text (e.g., descriptions of book covers), or personal context (e.g., when they read a book). This retrieval setting, called tip of the tongue (TOT), is especially challenging for models heavily reliant on lexical and semantic overlap between query and document text. In this work, we introduce a simple yet effective framework for handling such complex queries by decomposing the query into individual clues, routing those as sub-queries to specialized retrievers, and ensembling the results. This approach allows us to take advantage of off-the-shelf retrievers (e.g., CLIP for retrieving images of book covers) or incorporate retriever-specific logic (e.g., date constraints). We show that our framework incorportating query decompositions into retrievers can improve gold book recall up to 7% relative again for Recall@5 on a new collection of 14,441 real-world query-book pairs from an online community for resolving TOT inquiries.
Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, Nan Duan
Large Language Models (LLMs) play powerful, black-box readers in the
retrieve-then-read pipeline, making remarkable progress in knowledge-intensive
tasks. This work introduces a new framework, Rewrite-Retrieve-Read instead of
the previous retrieve-then-read for the retrieval-augmented LLMs from the
perspective of the query rewriting. Unlike prior studies focusing on adapting
either the retriever or the reader, our approach pays attention to the
adaptation of the search query itself, for there is inevitably a gap between
the input text and the needed knowledge in retrieval. We first prompt an LLM to
generate the query, then use a web search engine to retrieve contexts.
Furthermore, to better align the query to the frozen modules, we propose a
trainable scheme for our pipeline. A small language model is adopted as a
trainable rewriter to cater to the black-box LLM reader. The rewriter is
trained using the feedback of the LLM reader by reinforcement learning.
Evaluation is conducted on downstream tasks, open-domain QA and multiple-choice
QA. Experiments results show consistent performance improvement, indicating
that our framework is proven effective and scalable, and brings a new framework
for retrieval-augmented LLM.
Authors' comments: EMNLP2023
Livio Baldini Soares, Daniel Gillick, Jeremy R. Cole, Tom Kwiatkowski
Neural document rerankers are extremely effective in terms of accuracy. However, the best models require dedicated hardware for serving, which is costly and often not feasible. To avoid this serving-time requirement, we present a method of capturing up to 86% of the gains of a Transformer cross-attention model with a lexicalized scoring function that only requires 10-6% of the Transformer's FLOPs per document and can be served using commodity CPUs. When combined with a BM25 retriever, this approach matches the quality of a state-of-the art dual encoder retriever, that still requires an accelerator for query encoding. We introduce NAIL (Non-Autoregressive Indexing with Language models) as a model architecture that is compatible with recent encoder-decoder and decoder-only large language models, such as T5, GPT-3 and PaLM. This model architecture can leverage existing pre-trained checkpoints and can be fine-tuned for efficiently constructing document representations that do not require neural processing of queries.
Alexander Scarlatos, Andrew Lan
Many recent developments in large language models focus on prompting them to perform specific tasks. One effective prompting method is in-context learning, where the model performs a (possibly new) generation/prediction task given one (or more) examples. Past work has shown that the choice of examples can make a large impact on task performance. However, finding good examples is not straightforward since the definition of a representative group of examples can vary greatly depending on the task. While there are many existing methods for selecting in-context examples, they generally score examples independently, ignoring the dependency between them and the order in which they are provided to the large language model. In this work, we propose Retrieval for In-Context Learning (RetICL), a learnable method for modeling and optimally selecting examples sequentially for in-context learning. We frame the problem of sequential example selection as a Markov decision process, design an example retriever model using an LSTM, and train it using proximal policy optimization (PPO). We validate RetICL on math problem solving datasets and show that it outperforms both heuristic and learnable baselines, and achieves state-of-the-art accuracy on the TabMWP dataset. We also use case studies to show that RetICL implicitly learns representations of math problem solving strategies.
Raghav Gupta, Renat Aksitov, Samrat Phatale, Simral Chaudhary, Harrison Lee, Abhinav Rastogi
Conversational recommendation systems (CRS) aim to recommend suitable items
to users through natural language conversation. However, most CRS approaches do
not effectively utilize the signal provided by these conversations. They rely
heavily on explicit external knowledge e.g., knowledge graphs to augment the
models' understanding of the items and attributes, which is quite hard to
scale. To alleviate this, we propose an alternative information retrieval
(IR)-styled approach to the CRS item recommendation task, where we represent
conversations as queries and items as documents to be retrieved. We expand the
document representation used for retrieval with conversations from the training
set. With a simple BM25-based retriever, we show that our task formulation
compares favorably with much more complex baselines using complex external
knowledge on a popular CRS benchmark. We demonstrate further improvements using
user-centric modeling and data augmentation to counter the cold start problem
for CRSs.
Authors' comments: To appear at the 5th NLP4ConvAI workshop
Wenhao Yu, Zhihan Zhang, Zhenwen Liang, Meng Jiang, Ashish Sabharwal
Large language models (LLMs) exhibit remarkable performance across various NLP tasks. However, they often generate incorrect or hallucinated information, which hinders their practical applicability in real-world scenarios. Human feedback has been shown to effectively enhance the factuality and quality of generated content, addressing some of these limitations. However, this approach is resource-intensive, involving manual input and supervision, which can be time-consuming and expensive. Moreover, it cannot be provided during inference, further limiting its practical utility in dynamic and interactive applications. In this paper, we introduce ReFeed, a novel pipeline designed to enhance LLMs by providing automatic retrieval feedback in a plug-and-play framework without the need for expensive fine-tuning. ReFeed first generates initial outputs, then utilizes a retrieval model to acquire relevant information from large document collections, and finally incorporates the retrieved information into the in-context demonstration for output refinement, thereby addressing the limitations of LLMs in a more efficient and cost-effective manner. Experiments on four knowledge-intensive benchmark datasets demonstrate our proposed ReFeed could improve over +6.0% under zero-shot setting and +2.5% under few-shot setting, compared to baselines without using retrieval feedback.