Ashutosh Joshi, Sheikh Muhammad Sarwar, Samarth Varshney, Sreyashi Nag, Shrivats Agrawal, Juhi Naik
Complex dialog systems often use retrieved evidence to facilitate factual responses. Such RAG (Retrieval Augmented Generation) systems retrieve from massive heterogeneous data stores that are usually architected as multiple indexes or APIs instead of a single monolithic source. For a given query, relevant evidence needs to be retrieved from one or a small subset of possible retrieval sources. Complex queries can even require multi-step retrieval. For example, a conversational agent on a retail site answering customer questions about past orders will need to retrieve the appropriate customer order first and then the evidence relevant to the customer's question in the context of the ordered product. Most RAG Agents handle such Chain-of-Thought (CoT) tasks by interleaving reasoning and retrieval steps. However, each reasoning step directly adds to the latency of the system. For large models (>100B parameters) this latency cost is significant -- in the order of multiple seconds. Multi-agent systems may classify the query to a single Agent associated with a retrieval source, though this means that a (small) classification model dictates the performance of a large language model. In this work we present REAPER (REAsoning-based PlannER) - an LLM based planner to generate retrieval plans in conversational systems. We show significant gains in latency over Agent-based systems and are able to scale easily to new and unseen use cases as compared to classification-based planning. Though our method can be applied to any RAG system, we show our results in the context of Rufus -- Amazon's conversational shopping assistant.
Thomas Hummel, Shyamgopal Karthik, Mariana-Iuliana Georgescu, Zeynep Akata
In Composed Video Retrieval, a video and a textual description which modifies
the video content are provided as inputs to the model. The aim is to retrieve
the relevant video with the modified content from a database of videos. In this
challenging task, the first step is to acquire large-scale training datasets
and collect high-quality benchmarks for evaluation. In this work, we introduce
EgoCVR, a new evaluation benchmark for fine-grained Composed Video Retrieval
using large-scale egocentric video datasets. EgoCVR consists of 2,295 queries
that specifically focus on high-quality temporal video understanding. We find
that existing Composed Video Retrieval frameworks do not achieve the necessary
high-quality temporal video understanding for this task. To address this
shortcoming, we adapt a simple training-free method, propose a generic
re-ranking framework for Composed Video Retrieval, and demonstrate that this
achieves strong results on EgoCVR. Our code and benchmark are freely available
at https://github.com/ExplainableML/EgoCVR.
Authors' comments: ECCV 2024
Longtao Jiang, Min Wang, Zecheng Li, Yao Fang, Wengang Zhou, Houqiang Li
Different from traditional video retrieval, sign language retrieval is more
biased towards understanding the semantic information of human actions
contained in video clips. Previous works typically only encode RGB videos to
obtain high-level semantic features, resulting in local action details drowned
in a large amount of visual information redundancy. Furthermore, existing
RGB-based sign retrieval works suffer from the huge memory cost of dense visual
data embedding in end-to-end training, and adopt offline RGB encoder instead,
leading to suboptimal feature representation. To address these issues, we
propose a novel sign language representation framework called Semantically
Enhanced Dual-Stream Encoder (SEDS), which integrates Pose and RGB modalities
to represent the local and global information of sign language videos.
Specifically, the Pose encoder embeds the coordinates of keypoints
corresponding to human joints, effectively capturing detailed action features.
For better context-aware fusion of two video modalities, we propose a Cross
Gloss Attention Fusion (CGAF) module to aggregate the adjacent clip features
with similar semantic information from intra-modality and inter-modality.
Moreover, a Pose-RGB Fine-grained Matching Objective is developed to enhance
the aggregated fusion feature by contextual matching of fine-grained
dual-stream features. Besides the offline RGB encoder, the whole framework only
contains learnable lightweight networks, which can be trained end-to-end.
Extensive experiments demonstrate that our framework significantly outperforms
state-of-the-art methods on various datasets.
Authors' comments: Accepted to ACM International Conference on Multimedia (MM) 2024
Xiaowan Hu, Yiyi Chen, Yan Li, Minquan Wang, Haoqian Wang, Quan Chen, Han Li, Peng Jiang
With the rapid expansion of e-commerce, more consumers have become accustomed
to making purchases via livestreaming. Accurately identifying the products
being sold by salespeople, i.e., livestreaming product retrieval (LPR), poses a
fundamental and daunting challenge. The LPR task encompasses three primary
dilemmas in real-world scenarios: 1) the recognition of intended products from
distractor products present in the background; 2) the video-image heterogeneity
that the appearance of products showcased in live streams often deviates
substantially from standardized product images in stores; 3) there are numerous
confusing products with subtle visual nuances in the shop. To tackle these
challenges, we propose the Spatiotemporal Graphing Multi-modal Network (SGMN).
First, we employ a text-guided attention mechanism that leverages the spoken
content of salespeople to guide the model to focus toward intended products,
emphasizing their salience over cluttered background products. Second, a
long-range spatiotemporal graph network is further designed to achieve both
instance-level interaction and frame-level matching, solving the misalignment
caused by video-image heterogeneity. Third, we propose a multi-modal hard
example mining, assisting the model in distinguishing highly similar products
with fine-grained features across the video-image-text domain. Through
extensive quantitative and qualitative experiments, we demonstrate the superior
performance of our proposed SGMN model, surpassing the state-of-the-art methods
by a substantial margin. The code is available at
https://github.com/Huxiaowan/SGMN.
Authors' comments: 16 pages, 12 figures
Kent Fujiwara, Mikihiro Tanaka, Qing Yu
With the release of large-scale motion datasets with textual annotations, the
task of establishing a robust latent space for language and 3D human motion has
recently witnessed a surge of interest. Methods have been proposed to convert
human motion and texts into features to achieve accurate correspondence between
them. Despite these efforts to align language and motion representations, we
claim that the temporal element is often overlooked, especially for compound
actions, resulting in chronological inaccuracies. To shed light on the temporal
alignment in motion-language latent spaces, we propose Chronologically Accurate
Retrieval (CAR) to evaluate the chronological understanding of the models. We
decompose textual descriptions into events, and prepare negative text samples
by shuffling the order of events in compound action descriptions. We then
design a simple task for motion-language models to retrieve the more likely
text from the ground truth and its chronologically shuffled version. CAR
reveals many cases where current motion-language models fail to distinguish the
event chronology of human motion, despite their impressive performance in terms
of conventional evaluation metrics. To achieve better temporal alignment
between text and motion, we further propose to use these texts with shuffled
sequence of events as negative samples during training to reinforce the
motion-language models. We conduct experiments on text-motion retrieval and
text-to-motion generation using the reinforced motion-language models, which
demonstrate improved performance over conventional approaches, indicating the
necessity to consider temporal elements in motion-language alignment.
Authors' comments: To appear at ECCV 2024. Project page: https://kfworks.com/CAR-WP/
Zhao-Heng Yin, Pieter Abbeel
Imitation learning is a powerful machine learning algorithm for a robot to
acquire manipulation skills. Nevertheless, many real-world manipulation tasks
involve precise and dexterous robot-object interactions, which make it
difficult for humans to collect high-quality expert demonstrations. As a
result, a robot has to learn skills from suboptimal demonstrations and
unstructured interactions, which remains a key challenge. Existing works
typically use offline deep reinforcement learning (RL) to solve this challenge,
but in practice these algorithms are unstable and fragile due to the deadly
triad issue. To overcome this problem, we propose GSR, a simple yet effective
algorithm that learns from suboptimal demonstrations through Graph Search and
Retrieval. We first use pretrained representation to organize the interaction
experience into a graph and perform a graph search to calculate the values of
different behaviors. Then, we apply a retrieval-based procedure to identify the
best behavior (actions) on each state and use behavior cloning to learn that
behavior. We evaluate our method in both simulation and real-world robotic
manipulation tasks with complex visual inputs, covering various precise and
dexterous manipulation skills with objects of different physical properties.
GSR can achieve a 10% to 30% higher success rate and over 30% higher
proficiency compared to baselines. Our project page is at
https://zhaohengyin.github.io/gsr.
Authors' comments: Robotics: Science and Systems (RSS) 2024
Gonçalo Vinagre Martins, João Magalhães, Afonso Quinaz, Carla Viegas, Sofia Cavaco
SLVideo is a video moment retrieval system for Sign Language videos that
incorporates facial expressions, addressing this gap in existing technology.
The system extracts embedding representations for the hand and face signs from
video frames to capture the signs in their entirety, enabling users to search
for a specific sign language video segment with text queries. A collection of
eight hours of annotated Portuguese Sign Language videos is used as the
dataset, and a CLIP model is used to generate the embeddings. The initial
results are promising in a zero-shot setting. In addition, SLVideo incorporates
a thesaurus that enables users to search for similar signs to those retrieved,
using the video segment embeddings, and also supports the edition and creation
of video sign language annotations. Project web page:
https://novasearch.github.io/SLVideo/
Authors' comments: 4 pages, 1 figure, 1 table
Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, Gongyi Wang
The memory and computational demands of Key-Value (KV) cache present significant challenges for deploying long-context language models. Previous approaches attempt to mitigate this issue by selectively dropping tokens, which irreversibly erases critical information that might be needed for future queries. In this paper, we propose a novel compression technique for KV cache that preserves all token information. Our investigation reveals that: i) Most attention heads primarily focus on the local context; ii) Only a few heads, denoted as retrieval heads, can essentially pay attention to all input tokens. These key observations motivate us to use separate caching strategy for attention heads. Therefore, we propose RazorAttention, a training-free KV cache compression algorithm, which maintains a full cache for these crucial retrieval heads and discards the remote tokens in non-retrieval heads. Furthermore, we introduce a novel mechanism involving a "compensation token" to further recover the information in the dropped tokens. Extensive evaluations across a diverse set of large language models (LLMs) demonstrate that RazorAttention achieves a reduction in KV cache size by over 70% without noticeable impacts on performance. Additionally, RazorAttention is compatible with FlashAttention, rendering it an efficient and plug-and-play solution that enhances LLM inference efficiency without overhead or retraining of the original model.
Mariya Hendriksen, Shuo Zhang, Ridho Reinanda, Mohamed Yahya, Edgar Meij, Maarten de Rijke
Image-Text Retrieval (ITR) systems are central to multimodal information
access, with Vision-Language Models (VLMs) showing strong performance on
standard benchmarks. However, these benchmarks predominantly rely on
coarse-grained annotations, limiting their ability to reveal how models perform
under real-world conditions, where query granularity varies. Motivated by this
gap, we examine how dataset granularity and query perturbations affect
retrieval performance and robustness across four architecturally diverse VLMs
(ALIGN, AltCLIP, CLIP, and GroupViT). Using both standard benchmarks (MS-COCO,
Flickr30k) and their fine-grained variants, we show that richer captions
consistently enhance retrieval, especially in text-to-image tasks, where we
observe an average improvement of 16.23%, compared to 6.44% in image-to-text.
To assess robustness, we introduce a taxonomy of perturbations and conduct
extensive experiments, revealing that while perturbations typically degrade
performance, they can also unexpectedly improve retrieval, exposing nuanced
model behaviors. Notably, word order emerges as a critical factor --
contradicting prior assumptions of model insensitivity to it. Our results
highlight variation in model robustness and a dataset-dependent relationship
between caption granularity and perturbation sensitivity and emphasize the
necessity of evaluating models on datasets of varying granularity.
Authors' comments: accepted at SIGIR 2025
Sean Wu, Michael Koo, Li Yo Kao, Andy Black, Lesley Blum, Fabien Scalzo, Ira Kurtz
Open-source LLMs have shown great potential as fine-tuned chatbots, and
demonstrate robust abilities in reasoning and surpass many existing benchmarks.
Retrieval-Augmented Generation (RAG) is a technique for improving the
performance of LLMs on tasks that the models weren't explicitly trained on, by
leveraging external knowledge databases. Numerous studies have demonstrated the
effectiveness of RAG to more successfully accomplish downstream tasks when
using vector datasets that consist of relevant background information. It has
been implicitly assumed by those in the field that if adversarial background
information is utilized in this context, that the success of using a RAG-based
approach would be nonexistent or even negatively impact the results. To address
this assumption, we tested several open-source LLMs on the ability of RAG to
improve their success in answering multiple-choice questions (MCQ) in the
medical subspecialty field of Nephrology. Unlike previous studies, we examined
the effect of RAG in utilizing both relevant and adversarial background
databases. We set up several open-source LLMs, including Llama 3, Phi-3,
Mixtral 8x7b, Zephyr$\beta$, and Gemma 7B Instruct, in a zero-shot RAG
pipeline. As adversarial sources of information, text from the Bible and a
Random Words generated database were used for comparison. Our data show that
most of the open-source LLMs improve their multiple-choice test-taking success
as expected when incorporating relevant information vector databases.
Surprisingly however, adversarial Bible text significantly improved the success
of many LLMs and even random word text improved test taking ability of some of
the models. In summary, our results demonstrate for the first time the
countertintuitive ability of adversarial information datasets to improve the
RAG-based LLM success.
Authors' comments: 24 pages, 3 figures, 11 tables
Akash Kumar Mohankumar, Gururaj K, Gagan Madan, Amit Singh
Accurately retrieving relevant bid keywords for user queries is critical in
Sponsored Search but remains challenging, particularly for short, ambiguous
queries. Existing dense and generative retrieval models often fail to capture
nuanced user intent in these cases. To address this, we propose an approach to
enhance query understanding by augmenting queries with rich contextual signals
derived from web search results and large language models, stored in an online
cache. Specifically, we use web search titles and snippets to ground queries in
real-world information and utilize GPT-4 to generate query rewrites and
explanations that clarify user intent. These signals are efficiently integrated
through a Fusion-in-Decoder based Unity architecture, enabling both dense and
generative retrieval with serving costs on par with traditional context-free
models. To address scenarios where context is unavailable in the cache, we
introduce context glancing, a curriculum learning strategy that improves model
robustness and performance even without contextual signals during inference.
Extensive offline experiments demonstrate that our context-aware approach
substantially outperforms context-free models. Furthermore, online A/B testing
on a prominent search engine across 160+ countries shows significant
improvements in user engagement and revenue.
Authors' comments: 8 pages, 8 tables, 1 figure
Gilles Orban de Xivry, Olivier Absil
High contrast imaging (HCI) is fundamentally limited by wavefront
aberrations, and the ability to perform wavefront sensing from focal plane
images is key to reach the full potential of ground and space-based
instruments. Vortex focal plane mask coupled with downstream pupil (Lyot) stop
stands as one of the best small-angle coronagraphs, but is also sensitive to
low-order aberrations. Here, we revisit the behavior of the vortex phase mask,
from entrance pupil down to the final detector plane, with Zernike polynomials
as input phase aberrations. In particular we develop a second-order expansion
that allows us to analyze the phase retrieval properties in a more intuitive
and accurate way than previously proposed. With this formalism, we show how the
azimuthal vortex modulation modifies the phase retrieval properties compared to
normal imaging. In particular, our results suggest that images obtained with a
scalar vortex coronagraph can be used for unambiguous focal-plane wavefront
sensing in any practical situation. We compare our results with numerical
simulations and discuss practical implementation in coronagraphic instruments.
Authors' comments: 7 pages, 7 figures, paper presented at SPIE Astronomical Telescopes +
Instrumentation 2024
Wufei Ma, Kai Li, Zhongshi Jiang, Moustafa Meshry, Qihao Liu, Huiyu Wang, Christian Häne, Alan Yuille
Recent video-text foundation models have demonstrated strong performance on a
wide variety of downstream video understanding tasks. Can these video-text
models genuinely understand the contents of natural videos? Standard video-text
evaluations could be misleading as many questions can be inferred merely from
the objects and contexts in a single frame or biases inherent in the datasets.
In this paper, we aim to better assess the capabilities of current video-text
models and understand their limitations. We propose a novel evaluation task for
video-text understanding, namely retrieval from counterfactually augmented data
(RCAD), and a new Feint6K dataset. To succeed on our new evaluation task,
models must derive a comprehensive understanding of the video from cross-frame
reasoning. Analyses show that previous video-text foundation models can be
easily fooled by counterfactually augmented data and are far behind human-level
performance. In order to narrow the gap between video-text models and human
performance on RCAD, we identify a key limitation of current contrastive
approaches on video-text data and introduce LLM-teacher, a more effective
approach to learn action semantics by leveraging knowledge obtained from a
pretrained large language model. Experiments and analyses show that our
approach successfully learn more discriminative action embeddings and improves
results on Feint6K when applied to multiple video-text models. Our Feint6K
dataset and project page is available at https://feint6k.github.io.
Authors' comments: ECCV 2024. Project page: https://feint6k.github.io
Fedor Borisyuk, Qingquan Song, Mingzhou Zhou, Ganesh Parameswaran, Madhu Arun, Siva Popuri, Tugrul Bingol, Zhuotao Pei et al.
This paper introduces LiNR, LinkedIn's large-scale, GPU-based retrieval system. LiNR supports a billion-sized index on GPU models. We discuss our experiences and challenges in creating scalable, differentiable search indexes using TensorFlow and PyTorch at production scale. In LiNR, both items and model weights are integrated into the model binary. Viewing index construction as a form of model training, we describe scaling our system for large indexes, incorporating full scans and efficient filtering. A key focus is on enabling attribute-based pre-filtering for exhaustive GPU searches, addressing the common challenge of post-filtering in KNN searches that often reduces system quality. We further provide multi-embedding retrieval algorithms and strategies for tackling cold start issues in retrieval. Our advancements in supporting larger indexes through quantization are also discussed. We believe LiNR represents one of the industry's first Live-updated model-based retrieval indexes. Applied to out-of-network post recommendations on LinkedIn Feed, LiNR has contributed to a 3% relative increase in professional daily active users. We envisage LiNR as a step towards integrating retrieval and ranking into a single GPU model, simplifying complex infrastructures and enabling end-to-end optimization of the entire differentiable infrastructure through gradient descent.
Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu et al.
Large language models (LLMs) have demonstrated great success in various fields, benefiting from their huge amount of parameters that store knowledge. However, LLMs still suffer from several key issues, such as hallucination problems, knowledge update issues, and lacking domain-specific expertise. The appearance of retrieval-augmented generation (RAG), which leverages an external knowledge database to augment LLMs, makes up those drawbacks of LLMs. This paper reviews all significant techniques of RAG, especially in the retriever and the retrieval fusions. Besides, tutorial codes are provided for implementing the representative techniques in RAG. This paper further discusses the RAG update, including RAG with/without knowledge update. Then, we introduce RAG evaluation and benchmarking, as well as the application of RAG in representative NLP tasks and industrial scenarios. Finally, this paper discusses RAG's future directions and challenges for promoting this field's development.
Hamin Koo, Minseon Kim, Sung Ju Hwang
Large Language Models (LLMs) excel in various language tasks but they often generate incorrect information, a phenomenon known as "hallucinations". Retrieval-Augmented Generation (RAG) aims to mitigate this by using document retrieval for accurate responses. However, RAG still faces hallucinations due to vague queries. This study aims to improve RAG by optimizing query generation with a query-document alignment score, refining queries using LLMs for better precision and efficiency of document retrieval. Experiments have shown that our approach improves document retrieval, resulting in an average accuracy gain of 1.6%.
Jeonghyun Park, Hwanhee Lee
Conversational search seeks to retrieve relevant passages for the given
questions in conversational question answering. Conversational Query
Reformulation (CQR) improves conversational search by refining the original
queries into de-contextualized forms to resolve the issues in the original
queries, such as omissions and coreferences. Previous CQR methods focus on
imitating human written queries which may not always yield meaningful search
results for the retriever. In this paper, we introduce GuideCQR, a framework
that refines queries for CQR by leveraging key information from the initially
retrieved documents. Specifically, GuideCQR extracts keywords and generates
expected answers from the retrieved documents, then unifies them with the
queries after filtering to add useful information that enhances the search
process. Experimental results demonstrate that our proposed method achieves
state-of-the-art performance across multiple datasets, outperforming previous
CQR methods. Additionally, we show that GuideCQR can get additional performance
gains in conversational search using various types of queries, even for queries
written by humans.
Authors' comments: 18 pages, 3 figures, 16 tables
Ingeol Baek, Jimin Lee, Joonho Yang, Hwanhee Lee
Query rewriting aims to generate a new query that can complement the original
query to improve the information retrieval system. Recent studies on query
rewriting, such as query2doc, query2expand and querey2cot, rely on the internal
knowledge of Large Language Models (LLMs) to generate a relevant passage to add
information to the query. Nevertheless, the efficacy of these methodologies may
markedly decline in instances where the requisite knowledge is not encapsulated
within the model's intrinsic parameters. In this paper, we propose a novel
structured query rewriting method called Crafting the Path tailored for
retrieval systems. Crafting the Path involves a three-step process that crafts
query-related information necessary for finding the passages to be searched in
each step. Specifically, the Crafting the Path begins with Query Concept
Comprehension, proceeds to Query Type Identification, and finally conducts
Expected Answer Extraction. Experimental results show that our method
outperforms previous rewriting methods, especially in less familiar domains for
LLMs. We demonstrate that our method is less dependent on the internal
parameter knowledge of the model and generates queries with fewer factual
inaccuracies. Furthermore, we observe that \name{} demonstrates superior
performance in the retrieval-augmented generation scenarios.
Authors' comments: 3 figures, 13 tables
Naoya Sogi, Takashi Shibata, Makoto Terao
The pre-trained vision and language (V\&L) models have substantially improved
the performance of cross-modal image-text retrieval. In general, however, V\&L
models have limited retrieval performance for small objects because of the
rough alignment between words and the small objects in the image. In contrast,
it is known that human cognition is object-centric, and we pay more attention
to important objects, even if they are small. To bridge this gap between the
human cognition and the V\&L model's capability, we propose a cross-modal
image-text retrieval framework based on ``object-aware query perturbation.''
The proposed method generates a key feature subspace of the detected objects
and perturbs the corresponding queries using this subspace to improve the
object awareness in the image. In our proposed method, object-aware cross-modal
image-text retrieval is possible while keeping the rich expressive power and
retrieval performance of existing V\&L models without additional fine-tuning.
Comprehensive experiments on four public datasets show that our method
outperforms conventional algorithms. Our code is publicly available at
\url{https://github.com/NEC-N-SOGI/query-perturbation}.
Authors' comments: ECCV 2024. Code: https://github.com/NEC-N-SOGI/query-perturbation
Georgios Papagiannis, Norman Di Palo, Pietro Vitiello, Edward Johns
We present R+X, a framework which enables robots to learn skills from long,
unlabelled, first-person videos of humans performing everyday tasks. Given a
language command from a human, R+X first retrieves short video clips containing
relevant behaviour, and then executes the skill by conditioning an in-context
imitation learning method (KAT) on this behaviour. By leveraging a Vision
Language Model (VLM) for retrieval, R+X does not require any manual annotation
of the videos, and by leveraging in-context learning for execution, robots can
perform commanded skills immediately, without requiring a period of training on
the retrieved videos. Experiments studying a range of everyday household tasks
show that R+X succeeds at translating unlabelled human videos into robust robot
skills, and that R+X outperforms several recent alternative methods. Videos and
code are available at https://www.robot-learning.uk/r-plus-x.
Authors' comments: Published at the IEEE International Conference on Robotics and
Automation (ICRA) 2025