Yizhou Chi, Jessy Lin, Kevin Lin, Dan Klein
Users often make ambiguous requests that require clarification. We study the problem of asking clarification questions in an information retrieval setting, where systems often face ambiguous search queries and it is challenging to turn the uncertainty in the retrieval model into a natural language question. We present CLARINET, a system that asks informative clarification questions by choosing questions whose answers would maximize certainty in the correct candidate. Our approach works by augmenting a large language model (LLM) to condition on a retrieval distribution, finetuning end-to-end to generate the question that would have maximized the rank of the true candidate at each turn. When evaluated on a real-world retrieval dataset of users searching for books, our system outperforms traditional heuristics such as information gain on retrieval success by 17% and vanilla-prompted LLMs by 39% relative.
Huanshuo Liu, Bo Chen, Menghui Zhu, Jianghao Lin, Jiarui Qin, Yang Yang, Hao Zhang, Ruiming Tang
Click-through rate (CTR) prediction is crucial for personalized online
services. Sample-level retrieval-based models, such as RIM, have demonstrated
remarkable performance. However, they face challenges including inference
inefficiency and high resource consumption due to the retrieval process, which
hinder their practical application in industrial settings. To address this, we
propose a universal plug-and-play \underline{r}etrieval-\underline{o}riented
\underline{k}nowledge (\textbf{\name}) framework that bypasses the real
retrieval process. The framework features a knowledge base that preserves and
imitates the retrieved \& aggregated representations using a
decomposition-reconstruction paradigm. Knowledge distillation and contrastive
learning optimize the knowledge base, enabling the integration of
retrieval-enhanced representations with various CTR models. Experiments on
three large-scale datasets demonstrate \name's exceptional compatibility and
performance, with the neural knowledge base serving as an effective surrogate
for the retrieval pool. \name surpasses the teacher model while maintaining
superior inference efficiency and demonstrates the feasibility of distilling
knowledge from non-parametric methods using a parametric approach. These
results highlight \name's strong potential for real-world applications and its
ability to transform retrieval-based methods into practical solutions. Our
implementation code is available to support reproducibility in
\url{https://github.com/HSLiu-Initial/ROK.git}.
Authors' comments: 11 pages, 6 figures, 6 tables.Accepted by CIKM'24
Guozheng Li, Peng Wang, Wenjun Ke, Yikai Guo, Ke Ji, Ziyu Shang, Jiajun Liu, Zijie Xu
Relation extraction (RE) aims to identify relations between entities
mentioned in texts. Although large language models (LLMs) have demonstrated
impressive in-context learning (ICL) abilities in various tasks, they still
suffer from poor performances compared to most supervised fine-tuned RE
methods. Utilizing ICL for RE with LLMs encounters two challenges: (1)
retrieving good demonstrations from training examples, and (2) enabling LLMs
exhibit strong ICL abilities in RE. On the one hand, retrieving good
demonstrations is a non-trivial process in RE, which easily results in low
relevance regarding entities and relations. On the other hand, ICL with an LLM
achieves poor performance in RE while RE is different from language modeling in
nature or the LLM is not large enough. In this work, we propose a novel
recall-retrieve-reason RE framework that synergizes LLMs with retrieval corpora
(training examples) to enable relevant retrieving and reliable in-context
reasoning. Specifically, we distill the consistently ontological knowledge from
training datasets to let LLMs generate relevant entity pairs grounded by
retrieval corpora as valid queries. These entity pairs are then used to
retrieve relevant training examples from the retrieval corpora as
demonstrations for LLMs to conduct better ICL via instruction tuning. Extensive
experiments on different LLMs and RE datasets demonstrate that our method
generates relevant and valid entity pairs and boosts ICL abilities of LLMs,
achieving competitive or new state-of-the-art performance on sentence-level RE
compared to previous supervised fine-tuning methods and ICL-based methods.
Authors' comments: IJCAI 2024
Lucas Ventura, Cordelia Schmid, Gül Varol
We describe a protocol to study text-to-video retrieval training with
unlabeled videos, where we assume (i) no access to labels for any videos, i.e.,
no access to the set of ground-truth captions, but (ii) access to labeled
images in the form of text. Using image expert models is a realistic scenario
given that annotating images is cheaper therefore scalable, in contrast to
expensive video labeling schemes. Recently, zero-shot image experts such as
CLIP have established a new strong baseline for video understanding tasks. In
this paper, we make use of this progress and instantiate the image experts from
two types of models: a text-to-image retrieval model to provide an initial
backbone, and image captioning models to provide supervision signal into
unlabeled videos. We show that automatically labeling video frames with image
captioning allows text-to-video retrieval training. This process adapts the
features to the target domain at no manual annotation cost, consequently
outperforming the strong zero-shot CLIP baseline. During training, we sample
captions from multiple video frames that best match the visual content, and
perform a temporal pooling over frame representations by scoring frames
according to their relevance to each caption. We conduct extensive ablations to
provide insights and demonstrate the effectiveness of this simple framework by
outperforming the CLIP zero-shot baselines on text-to-video retrieval on three
standard datasets, namely ActivityNet, MSR-VTT, and MSVD.
Authors' comments: A short version of this work appeared at CVPR 2023 Workshops. Project
page: https://imagine.enpc.fr/~ventural/multicaps/
Quan Zhang, Binqi Zeng, Chijin Zhou, Gwihwan Go, Heyuan Shi, Yu Jiang
Presently, with the assistance of advanced LLM application development frameworks, more and more LLM-powered applications can effortlessly augment the LLMs' knowledge with external content using the retrieval augmented generation (RAG) technique. However, these frameworks' designs do not have sufficient consideration of the risk of external content, thereby allowing attackers to undermine the applications developed with these frameworks. In this paper, we reveal a new threat to LLM-powered applications, termed retrieval poisoning, where attackers can guide the application to yield malicious responses during the RAG process. Specifically, through the analysis of LLM application frameworks, attackers can craft documents visually indistinguishable from benign ones. Despite the documents providing correct information, once they are used as reference sources for RAG, the application is misled into generating incorrect responses. Our preliminary experiments indicate that attackers can mislead LLMs with an 88.33\% success rate, and achieve a 66.67\% success rate in the real-world application, demonstrating the potential impact of retrieval poisoning.
Yunpeng Xu, Mufang Ying, Wenge Guo, Zhi Wei
Practical machine learning systems often operate in multiple sequential
stages, as seen in ranking and recommendation systems, which typically include
a retrieval phase followed by a ranking phase. Effectively assessing prediction
uncertainty and ensuring effective risk control in such systems pose
significant challenges due to their inherent complexity. To address these
challenges, we developed two-stage risk control methods based on the recently
proposed learn-then-test (LTT) and conformal risk control (CRC) frameworks.
Unlike the methods in prior work that address multiple risks, our approach
leverages the sequential nature of the problem, resulting in reduced
computational burden. We provide theoretical guarantees for our proposed
methods and design novel loss functions tailored for ranked retrieval tasks.
The effectiveness of our approach is validated through experiments on two
large-scale, widely-used datasets: MSLR-Web and Yahoo LTRC.
Authors' comments: 20 pages, 3 figures, 2 tables; 7 supplementary pages
Ryoya Nara, Yu-Chieh Lin, Yuji Nozawa, Youyang Ng, Goh Itoh, Osamu Torii, Yusuke Matsui
Many image retrieval studies use metric learning to train an image encoder.
However, metric learning cannot handle differences in users' preferences, and
requires data to train an image encoder. To overcome these limitations, we
revisit relevance feedback, a classic technique for interactive retrieval
systems, and propose an interactive CLIP-based image retrieval system with
relevance feedback. Our retrieval system first executes the retrieval, collects
each user's unique preferences through binary feedback, and returns images the
user prefers. Even when users have various preferences, our retrieval system
learns each user's preference through the feedback and adapts to the
preference. Moreover, our retrieval system leverages CLIP's zero-shot
transferability and achieves high accuracy without training. We empirically
show that our retrieval system competes well with state-of-the-art metric
learning in category-based image retrieval, despite not training image encoders
specifically for each dataset. Furthermore, we set up two additional
experimental settings where users have various preferences: one-label-based
image retrieval and conditioned image retrieval. In both cases, our retrieval
system effectively adapts to each user's preferences, resulting in improved
accuracy compared to image retrieval without feedback. Overall, our work
highlights the potential benefits of integrating CLIP with classic relevance
feedback techniques to enhance image retrieval.
Authors' comments: Accepted to ECCV 2024 Workshops: 2nd Workshop on Traditional Computer
Vision in the Age of Deep Learning (TradiCV)
Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Multimodal LLMs are the natural evolution of LLMs, and enlarge their
capabilities so as to work beyond the pure textual modality. As research is
being carried out to design novel architectures and vision-and-language
adapters, in this paper we concentrate on endowing such models with the
capability of answering questions that require external knowledge. Our
approach, termed Wiki-LLaVA, aims at integrating an external knowledge source
of multimodal documents, which is accessed through a hierarchical retrieval
pipeline. Relevant passages, using this approach, are retrieved from the
external knowledge source and employed as additional context for the LLM,
augmenting the effectiveness and precision of generated dialogues. We conduct
extensive experiments on datasets tailored for visual question answering with
external data and demonstrate the appropriateness of our approach.
Authors' comments: CVPR 2024 Workshop on What is Next in Multimodal Foundation Models
Kuicai Dong, Derrick Goh Xin Deik, Yi Quan Lee, Hao Zhang, Xiangyang Li, Cong Zhang, Yong Liu
Long document question answering (DocQA) aims to answer questions from long documents over 10k words. They usually contain content structures such as sections, sub-sections, and paragraph demarcations. However, the indexing methods of long documents remain under-explored, while existing systems generally employ fixed-length chunking. As they do not consider content structures, the resultant chunks can exclude vital information or include irrelevant content. Motivated by this, we propose the Multi-view Content-aware indexing (MC-indexing) for more effective long DocQA via (i) segment structured document into content chunks, and (ii) represent each content chunk in raw-text, keywords, and summary views. We highlight that MC-indexing requires neither training nor fine-tuning. Having plug-and-play capability, it can be seamlessly integrated with any retrievers to boost their performance. Besides, we propose a long DocQA dataset that includes not only question-answer pair, but also document structure and answer scope. When compared to state-of-art chunking schemes, MC-indexing has significantly increased the recall by 42.8%, 30.0%, 23.9%, and 16.3% via top k= 1.5, 3, 5, and 10 respectively. These improved scores are the average of 8 widely used retrievers (2 sparse and 6 dense) via extensive experiments.
Sanat Sharma, David Seunghyun Yoon, Franck Dernoncourt, Dewang Sultania, Karishma Bagga, Mengjiao Zhang, Trung Bui, Varun Kotte
Question answering (QA) has become an important application in the advanced
development of large language models. General pre-trained large language models
for question-answering are not trained to properly understand the knowledge or
terminology for a specific domain, such as finance, healthcare, education, and
customer service for a product. To better cater to domain-specific
understanding, we build an in-house question-answering system for Adobe
products. We propose a novel framework to compile a large question-answer
database and develop the approach for retrieval-aware finetuning of a Large
Language model. We showcase that fine-tuning the retriever leads to major
improvements in the final generation. Our overall approach reduces
hallucinations during generation while keeping in context the latest retrieval
information for contextual grounding.
Authors' comments: AAAI 2024 (Association for the Advancement of Artificial
Intelligence) Scientific Document Understanding Workshop
Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yuyao Zhang, Peitian Zhang, Yutao Zhu, Zhicheng Dou
Information Retrieval (IR) systems are crucial tools for users to access information, which have long been dominated by traditional methods relying on similarity matching. With the advancement of pre-trained language models, generative information retrieval (GenIR) emerges as a novel paradigm, attracting increasing attention. Based on the form of information provided to users, current research in GenIR can be categorized into two aspects: \textbf{(1) Generative Document Retrieval} (GR) leverages the generative model's parameters for memorizing documents, enabling retrieval by directly generating relevant document identifiers without explicit indexing. \textbf{(2) Reliable Response Generation} employs language models to directly generate information users seek, breaking the limitations of traditional IR in terms of document granularity and relevance matching while offering flexibility, efficiency, and creativity to meet practical needs. This paper aims to systematically review the latest research progress in GenIR. We will summarize the advancements in GR regarding model training and structure, document identifier, incremental learning, etc., as well as progress in reliable response generation in aspects of internal knowledge memorization, external knowledge augmentation, etc. We also review the evaluation, challenges and future developments in GenIR systems. This review aims to offer a comprehensive reference for researchers, encouraging further development in the GenIR field. Github Repository: https://github.com/RUC-NLPIR/GenIR-Survey
Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tongshuang Wu, Graham Neubig
Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, \textit{DataTune}, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49\% and improves over existing methods that use synthetic or retrieved training data by 34\%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.
Mathias Thorsager, Victor Croisfelt, Junya Shiraishi, Petar Popovski
This paper introduces EcoPull, a sustainable Internet of Things (IoT) framework empowered by tiny machine learning (TinyML) models for fetching images from wireless visual sensor networks. Two types of learnable TinyML models are installed in the IoT devices: i) a behavior model and ii) an image compressor model. The first filters out irrelevant images for the current task, reducing unnecessary transmission and resource competition among the devices. The second allows IoT devices to communicate with the receiver via latent representations of images, reducing communication bandwidth usage. However, integrating learnable modules into IoT devices comes at the cost of increased energy consumption due to inference. The numerical results show that the proposed framework can save > 70% energy compared to the baseline while maintaining the quality of the retrieved images at the ES.
Xuzheng Yu, Chen Jiang, Xingning Dong, Tian Gan, Ming Yang, Qingpei Guo
The user base of short video apps has experienced unprecedented growth in recent years, resulting in a significant demand for video content analysis. In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap. Nevertheless, most existing approaches treat texts merely as discrete tokens and neglect their syntax structures. Moreover, the abundant spatial and temporal clues in videos are often underutilized due to the lack of interaction with text. To address these issues, we argue that using texts as guidance to focus on relevant temporal frames and spatial regions within videos is beneficial. In this paper, we propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net) that exploits the inherent semantic and syntax hierarchy of texts to bridge the modality gap from two perspectives. First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions, to guide the visual representations. Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation. We evaluated our method on four public text-video retrieval datasets of MSR-VTT, MSVD, DiDeMo, and ActivityNet. The experimental results and ablation studies confirm the advantages of our proposed method.
Donghuo Zeng, Yanan Wang, Kazushi Ikeda, Yi Yu
Metric learning minimizes the gap between similar (positive) pairs of data
points and increases the separation of dissimilar (negative) pairs, aiming at
capturing the underlying data structure and enhancing the performance of tasks
like audio-visual cross-modal retrieval (AV-CMR). Recent works employ sampling
methods to select impactful data points from the embedding space during
training. However, the model training fails to fully explore the space due to
the scarcity of training data points, resulting in an incomplete representation
of the overall positive and negative distributions. In this paper, we propose
an innovative Anchor-aware Deep Metric Learning (AADML) method to address this
challenge by uncovering the underlying correlations among existing data points,
which enhances the quality of the shared embedding space. Specifically, our
method establishes a correlation graph-based manifold structure by considering
the dependencies between each sample as the anchor and its semantically similar
samples. Through dynamic weighting of the correlations within this underlying
manifold structure using an attention-driven mechanism, Anchor Awareness (AA)
scores are obtained for each anchor. These AA scores serve as data proxies to
compute relative distances in metric learning approaches. Extensive experiments
conducted on two audio-visual benchmark datasets demonstrate the effectiveness
of our proposed AADML method, significantly surpassing state-of-the-art models.
Furthermore, we investigate the integration of AA proxies with various metric
learning methods, further highlighting the efficacy of our approach.
Authors' comments: 9 pages, 5 figures. Accepted by ACM ICMR 2024
SeungHeon Doh, Jongpil Lee, Dasaem Jeong, Juhan Nam
Word embedding has become an essential means for text-based information
retrieval. Typically, word embeddings are learned from large quantities of
general and unstructured text data. However, in the domain of music, the word
embedding may have difficulty understanding musical contexts or recognizing
music-related entities like artists and tracks. To address this issue, we
propose a new approach called Musical Word Embedding (MWE), which involves
learning from various types of texts, including both everyday and music-related
vocabulary. We integrate MWE into an audio-word joint representation framework
for tagging and retrieving music, using words like tag, artist, and track that
have different levels of musical specificity. Our experiments show that using a
more specific musical word like track results in better retrieval performance,
while using a less specific term like tag leads to better tagging performance.
To balance this compromise, we suggest multi-prototype training that uses words
with different levels of musical specificity jointly. We evaluate both word
embedding and audio-word joint embedding on four tasks (tag rank prediction,
music tagging, query-by-tag, and query-by-track) across two datasets (Million
Song Dataset and MTG-Jamendo). Our findings show that the suggested MWE is more
efficient and robust than the conventional word embedding.
Authors' comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language
Processing (TASLP)
Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vassilis N. Ioannidis, Karthik Subbian et al.
Answering real-world user queries, such as product search, often requires
accurate retrieval of information from semi-structured knowledge bases or
databases that involve blend of unstructured (e.g., textual descriptions of
products) and structured (e.g., entity relations of products) information.
However, previous works have mostly studied textual and relational retrieval
tasks as separate topics. To address the gap, we develop STARK, a large-scale
Semi-structure retrieval benchmark on Textual and Relational Knowledge Bases.
We design a novel pipeline to synthesize natural and realistic user queries
that integrate diverse relational information and complex textual properties,
as well as their ground-truth answers. Moreover, we rigorously conduct human
evaluation to validate the quality of our benchmark, which covers a variety of
practical applications, including product recommendations, academic paper
searches, and precision medicine inquiries. Our benchmark serves as a
comprehensive testbed for evaluating the performance of retrieval systems, with
an emphasis on retrieval approaches driven by large language models (LLMs). Our
experiments suggest that the STARK datasets present significant challenges to
the current retrieval and LLM systems, indicating the demand for building more
capable retrieval systems that can handle both textual and relational aspects.
Authors' comments: 25 pages, 7 figures
Guanhua Chen, Wenhan Yu, Lei Sha
While Retrieval-Augmented Generation (RAG) plays a crucial role in the application of Large Language Models (LLMs), existing retrieval methods in knowledge-dense domains like law and medicine still suffer from a lack of multi-perspective views, which are essential for improving interpretability and reliability. Previous research on multi-view retrieval often focused solely on different semantic forms of queries, neglecting the expression of specific domain knowledge perspectives. This paper introduces a novel multi-view RAG framework, MVRAG, tailored for knowledge-dense domains that utilizes intention-aware query rewriting from multiple domain viewpoints to enhance retrieval precision, thereby improving the effectiveness of the final inference. Experiments conducted on legal and medical case retrieval demonstrate significant improvements in recall and precision rates with our framework. Our multi-perspective retrieval approach unleashes the potential of multi-view information enhancing RAG tasks, accelerating the further application of LLMs in knowledge-intensive fields.
Han Fang, Xianghao Zang, Chao Ban, Zerun Feng, Lanxiang Zhou, Zhongjiang He, Yongxiang Li, Hao Sun
Text-video retrieval aims to find the most relevant cross-modal samples for a given query. Recent methods focus on modeling the whole spatial-temporal relations. However, since video clips contain more diverse content than captions, the model aligning these asymmetric video-text pairs has a high risk of retrieving many false positive results. In this paper, we propose Probabilistic Token Aggregation (\textit{ProTA}) to handle cross-modal interaction with content asymmetry. Specifically, we propose dual partial-related aggregation to disentangle and re-aggregate token representations in both low-dimension and high-dimension spaces. We propose token-based probabilistic alignment to generate token-level probabilistic representation and maintain the feature representation diversity. In addition, an adaptive contrastive loss is proposed to learn compact cross-modal distribution space. Based on extensive experiments, \textit{ProTA} achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%).
Md Adnan Arefeen, Biplob Debnath, Md Yusuf Sarwar Uddin, Srimat Chakradhar
Retrieval augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for combined understanding of multimodal data such as text, images and videos is appealing but two critical limitations exist: one-time, upfront capture of all content in large multimodal data as text descriptions entails high processing times, and not all information in the rich multimodal data is typically in the text descriptions. Since the user queries are not known apriori, developing a system for multimodal to text conversion and interactive querying of multimodal data is challenging. To address these limitations, we propose iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of large corpus of multimodal data. Unlike traditional RAG, iRAG quickly indexes large repositories of multimodal data, and in the incremental workflow, it uses the index to opportunistically extract more details from select portions of the multimodal data to retrieve context relevant to an interactive user query. Such an incremental workflow avoids long multimodal to text conversion times, overcomes information loss issues by doing on-demand query-specific extraction of details in multimodal data, and ensures high quality of responses to interactive user queries that are often not known apriori. To the best of our knowledge, iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of large, real-world multimodal data. Experimental results on real-world long videos demonstrate 23x to 25x faster video to text ingestion, while ensuring that quality of responses to interactive user queries is comparable to responses from a traditional RAG where all video data is converted to text upfront before any querying.