Weitong Cai, Jiabo Huang, Shaogang Gong
Video moment retrieval (VMR) is to search for a visual temporal moment in an
untrimmed raw video by a given text query description (sentence). Existing
studies either start from collecting exhaustive frame-wise annotations on the
temporal boundary of target moments (fully-supervised), or learn with only the
video-level video-text pairing labels (weakly-supervised). The former is poor
in generalisation to unknown concepts and/or novel scenes due to restricted
dataset scale and diversity under expensive annotation costs; the latter is
subject to visual-textual mis-correlations from incomplete labels. In this
work, we introduce a new approach called hybrid-learning video moment retrieval
to solve the problem by knowledge transfer through adapting the video-text
matching relationships learned from a fully-supervised source domain to a
weakly-labelled target domain when they do not share a common label space. Our
aim is to explore shared universal knowledge between the two domains in order
to improve model learning in the weakly-labelled target domain. Specifically,
we introduce a multiplE branch Video-text Alignment model (EVA) that performs
cross-modal (visual-textual) matching information sharing and multi-modal
feature alignment to optimise domain-invariant visual and textual features as
well as per-task discriminative joint video-text representations. Experiments
show EVA's effectiveness in exploring temporal segment annotations in a source
domain to help learn video moment retrieval without temporal labels in a target
domain.
Authors' comments: Accepted by BMVC2022
Zijian Li, Qingyan Guo, Jiawei Shao, Lei Song, Jiang Bian, Jun Zhang, Rui Wang
Retrieval augmented generation has revolutionized large language model (LLM)
outputs by providing factual supports. Nevertheless, it struggles to capture
all the necessary knowledge for complex reasoning questions. Existing retrieval
methods typically divide reference documents into passages, treating them in
isolation. These passages, however, are often interrelated, such as passages
that are contiguous or share the same keywords. Therefore, recognizing the
relatedness is crucial for enhancing the retrieval process. In this paper, we
propose a novel retrieval method, called GNN-Ret, which leverages graph neural
networks (GNNs) to enhance retrieval by considering the relatedness between
passages. Specifically, we first construct a graph of passages by connecting
passages that are structure-related and keyword-related. A graph neural network
(GNN) is then leveraged to exploit the relationships between passages and
improve the retrieval of supporting passages. Furthermore, we extend our method
to handle multi-hop reasoning questions using a recurrent graph neural network
(RGNN), named RGNN-Ret. At each step, RGNN-Ret integrates the graphs of
passages from previous steps, thereby enhancing the retrieval of supporting
passages. Extensive experiments on benchmark datasets demonstrate that GNN-Ret
achieves higher accuracy for question answering with a single query of LLMs
than strong baselines that require multiple queries, and RGNN-Ret further
improves accuracy and achieves state-of-the-art performance, with up to 10.4%
accuracy improvement on the 2WikiMQA dataset.
Authors' comments: Under review
Viktor Shcherbakov, Fedor Krasnov
Product search is uniquely different from search for documents, Internet
resources or vacancies, therefore it requires the development of specialized
search systems. The present work describes the H1 embdedding model, designed
for an offline term indexing of product descriptions at e-commerce platforms.
The model is compared to other state-of-the-art (SoTA) embedding models within
a framework of hybrid product search system that incorporates the advantages of
lexical methods for product retrieval and semantic embedding-based methods. We
propose an approach to building semantically rich term vocabularies for search
indexes. Compared to other production semantic models, H1 paired with the
proposed approach stands out due to its ability to process multi-word product
terms as one token. As an example, for search queries "new balance shoes",
"gloria jeans kids wear" brand entity will be represented as one token - "new
balance", "gloria jeans". This results in an increased precision of the system
without affecting the recall. The hybrid search system with proposed model
scores mAP@12 = 56.1% and R@1k = 86.6% on the WANDS public dataset, beating
other SoTA analogues.
Authors' comments: 10 pages, 4 figures
Aleksander Theo Strand, Sushant Gautam, Cise Midoglu, Pål Halvorsen
The rapid evolution of digital sports media necessitates sophisticated
information retrieval systems that can efficiently parse extensive multimodal
datasets. This paper introduces SoccerRAG, an innovative framework designed to
harness the power of Retrieval Augmented Generation (RAG) and Large Language
Models (LLMs) to extract soccer-related information through natural language
queries. By leveraging a multimodal dataset, SoccerRAG supports dynamic
querying and automatic data validation, enhancing user interaction and
accessibility to sports archives. Our evaluations indicate that SoccerRAG
effectively handles complex queries, offering significant improvements over
traditional retrieval systems in terms of accuracy and user engagement. The
results underscore the potential of using RAG and LLMs in sports analytics,
paving the way for future advancements in the accessibility and real-time
processing of sports data.
Authors' comments: accepted to CBMI 2024 as a regular paper;
https://github.com/simula/soccer-rag
Aleksander Theo Strand, Sushant Gautam, Cise Midoglu, Pål Halvorsen
The rapid evolution of digital sports media necessitates sophisticated
information retrieval systems that can efficiently parse extensive multimodal
datasets. This paper demonstrates SoccerRAG, an innovative framework designed
to harness the power of Retrieval Augmented Generation (RAG) and Large Language
Models (LLMs) to extract soccer-related information through natural language
queries. By leveraging a multimodal dataset, SoccerRAG supports dynamic
querying and automatic data validation, enhancing user interaction and
accessibility to sports archives. We present a novel interactive user interface
(UI) based on the Chainlit framework which wraps around the core functionality,
and enable users to interact with the SoccerRAG framework in a chatbot-like
visual manner.
Authors' comments: accepted to CBMI 2024 as a demonstration;
https://github.com/simula/soccer-rag. arXiv admin note: text overlap with
arXiv:2406.01273
Shicheng Xu, Liang Pang, Huawei Shen, Xueqi Cheng
Retrieval-augmented generation (RAG) utilizes retrieved texts to enhance
large language models (LLMs). Studies show that while RAG provides valuable
external information (benefit), it may also mislead LLMs (detriment) with noisy
or incorrect retrieved texts. Although many existing methods attempt to
preserve benefit and avoid detriment, they lack a theoretical explanation for
RAG. The benefit and detriment in the next token prediction of RAG remain a
black box that cannot be quantified or compared in an explainable manner, so
existing methods are data-driven, need additional utility evaluators or
post-hoc. This paper takes the first step towards providing a theory to explain
and trade off the benefit and detriment in RAG. First, we model RAG as the
fusion between distribution of LLMs knowledge and distribution of retrieved
texts. Then, we formalize the trade-off between the value of external knowledge
(benefit) and its potential risk of misleading LLMs (detriment) in next token
prediction of RAG by distribution difference in this fusion. Finally, we prove
that the actual effect of RAG on the token, which is the comparison between
benefit and detriment, can be predicted without any training or accessing the
utility of retrieval. Based on our theory, we propose a practical novel method,
Tok-RAG, which achieves collaborative generation between the pure LLM and RAG
at token level to preserve benefit and avoid detriment. Experiments in
real-world tasks using LLMs such as OPT, LLaMA-2, and Mistral show the
effectiveness of our method and support our theoretical findings.
Authors' comments: ICLR 2025
Harsh Chaudhari, Giorgio Severi, John Abascal, Matthew Jagielski, Christopher A. Choquette-Choo, Milad Nasr, Cristina Nita-Rotaru, Alina Oprea
Retrieval Augmented Generation (RAG) expands the capabilities of modern large language models (LLMs) in chatbot applications, enabling developers to adapt and personalize the LLM output without expensive training or fine-tuning. RAG systems use an external knowledge database to retrieve the most relevant documents for a given query, providing this context to the LLM generator. While RAG achieves impressive utility in many applications, its adoption to enable personalized generative models introduces new security risks. In this work, we propose new attack surfaces for an adversary to compromise a victim's RAG system, by injecting a single malicious document in its knowledge database. We design Phantom, general two-step attack framework against RAG augmented LLMs. The first step involves crafting a poisoned document designed to be retrieved by the RAG system within the top-k results only when an adversarial trigger, a specific sequence of words acting as backdoor, is present in the victim's queries. In the second step, a specially crafted adversarial string within the poisoned document triggers various adversarial attacks in the LLM generator, including denial of service, reputation damage, privacy violations, and harmful behaviors. We demonstrate our attacks on multiple LLM architectures, including Gemma, Vicuna, and Llama.
Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram et al.
Contrastive Language-Image Pretraining (CLIP) is widely used to train models
to align images and texts in a common embedding space by mapping them to
fixed-sized vectors. These models are key to multimodal information retrieval
and related tasks. However, CLIP models generally underperform in text-only
tasks compared to specialized text models. This creates inefficiencies for
information retrieval systems that keep separate embeddings and models for
text-only and multimodal tasks. We propose a novel, multi-task contrastive
training method to address this issue, which we use to train the jina-clip-v1
model to achieve the state-of-the-art performance on both text-image and
text-text retrieval tasks.
Authors' comments: 4 pages, ICML2024 workshop submission
Costas Mavromatis, George Karypis
Knowledge Graphs (KGs) represent human-crafted factual knowledge in the form of triplets (head, relation, tail), which collectively form a graph. Question Answering over KGs (KGQA) is the task of answering natural questions grounding the reasoning to the information provided by the KG. Large Language Models (LLMs) are the state-of-the-art models for QA tasks due to their remarkable ability to understand natural language. On the other hand, Graph Neural Networks (GNNs) have been widely used for KGQA as they can handle the complex graph information stored in the KG. In this work, we introduce GNN-RAG, a novel method for combining language understanding abilities of LLMs with the reasoning abilities of GNNs in a retrieval-augmented generation (RAG) style. First, a GNN reasons over a dense KG subgraph to retrieve answer candidates for a given question. Second, the shortest paths in the KG that connect question entities and answer candidates are extracted to represent KG reasoning paths. The extracted paths are verbalized and given as input for LLM reasoning with RAG. In our GNN-RAG framework, the GNN acts as a dense subgraph reasoner to extract useful graph information, while the LLM leverages its natural language processing ability for ultimate KGQA. Furthermore, we develop a retrieval augmentation (RA) technique to further boost KGQA performance with GNN-RAG. Experimental results show that GNN-RAG achieves state-of-the-art performance in two widely used KGQA benchmarks (WebQSP and CWQ), outperforming or matching GPT-4 performance with a 7B tuned LLM. In addition, GNN-RAG excels on multi-hop and multi-entity questions outperforming competing approaches by 8.9--15.5% points at answer F1.
Wei Cheng, Yuhan Wu, Wei Hu
Recent years have witnessed the deployment of code language models (LMs) in
various code intelligence tasks such as code completion. Yet, it is challenging
for pre-trained LMs to generate correct completions in private repositories.
Previous studies retrieve cross-file context based on import relations or text
similarity, which is insufficiently relevant to completion targets. In this
paper, we propose a dataflow-guided retrieval augmentation approach, called
DraCo, for repository-level code completion. DraCo parses a private repository
into code entities and establishes their relations through an extended dataflow
analysis, forming a repo-specific context graph. Whenever triggering code
completion, DraCo precisely retrieves relevant background knowledge from the
repo-specific context graph and generates well-formed prompts to query code
LMs. Furthermore, we construct a large Python dataset, ReccEval, with more
diverse completion targets. Our experiments demonstrate the superior accuracy
and applicable efficiency of DraCo, improving code exact match by 3.43% and
identifier F1-score by 3.27% on average compared to the state-of-the-art
approach.
Authors' comments: Accepted in the 62nd Annual Meeting of the Association for
Computational Linguistics (ACL 2024)
Xuan Wu, Hongxiang Li, Yuanjiang Luo, Xuxin Cheng, Xianwei Zhuang, Meng Cao, Keren Fu
Sign language video retrieval plays a key role in facilitating information access for the deaf community. Despite significant advances in video-text retrieval, the complexity and inherent uncertainty of sign language preclude the direct application of these techniques. Previous methods achieve the mapping between sign language video and text through fine-grained modal alignment. However, due to the scarcity of fine-grained annotation, the uncertainty inherent in sign language video is underestimated, limiting the further development of sign language retrieval tasks. To address this challenge, we propose a novel Uncertainty-aware Probability Distribution Retrieval (UPRet), that conceptualizes the mapping process of sign language video and text in terms of probability distributions, explores their potential interrelationships, and enables flexible mappings. Experiments on three benchmarks demonstrate the effectiveness of our method, which achieves state-of-the-art results on How2Sign (59.1%), PHOENIX-2014T (72.0%), and CSL-Daily (78.4%).
Laxman Dhulipala, Majid Hadian, Rajesh Jayaram, Jason Lee, Vahab Mirrokni
Neural embedding models have become a fundamental component of modern information retrieval (IR) pipelines. These models produce a single embedding $x \in \mathbb{R}^d$ per data-point, allowing for fast retrieval via highly optimized maximum inner product search (MIPS) algorithms. Recently, beginning with the landmark ColBERT paper, multi-vector models, which produce a set of embedding per data point, have achieved markedly superior performance for IR tasks. Unfortunately, using these models for IR is computationally expensive due to the increased complexity of multi-vector retrieval and scoring. In this paper, we introduce MUVERA (MUlti-VEctor Retrieval Algorithm), a retrieval mechanism which reduces multi-vector similarity search to single-vector similarity search. This enables the usage of off-the-shelf MIPS solvers for multi-vector retrieval. MUVERA asymmetrically generates Fixed Dimensional Encodings (FDEs) of queries and documents, which are vectors whose inner product approximates multi-vector similarity. We prove that FDEs give high-quality $\epsilon$-approximations, thus providing the first single-vector proxy for multi-vector similarity with theoretical guarantees. Empirically, we find that FDEs achieve the same recall as prior state-of-the-art heuristics while retrieving 2-5$\times$ fewer candidates. Compared to prior state of the art implementations, MUVERA achieves consistently good end-to-end recall and latency across a diverse set of the BEIR retrieval datasets, achieving an average of 10$\%$ improved recall with $90\%$ lower latency.
Meng Cao, Haoran Tang, Jinfa Huang, Peng Jin, Can Zhang, Ruyang Liu, Long Chen, Xiaodan Liang et al.
Text-Video Retrieval (TVR) aims to align relevant video content with natural
language queries. To date, most state-of-the-art TVR methods learn
image-to-video transfer learning based on large-scale pre-trained
visionlanguage models (e.g., CLIP). However, fully fine-tuning these
pre-trained models for TVR incurs prohibitively expensive computation costs. To
this end, we propose to conduct efficient text-video Retrieval with a
sparse-andcorrelated AdaPter (RAP), i.e., fine-tuning the pre-trained model
with a few parameterized layers. To accommodate the text-video scenario, we
equip our RAP with two indispensable characteristics: temporal sparsity and
correlation. Specifically, we propose a low-rank modulation module to refine
the per-image features from the frozen CLIP backbone, which accentuates salient
frames within the video features while alleviating temporal redundancy.
Besides, we introduce an asynchronous self-attention mechanism that first
selects the top responsive visual patches and augments the correlation modeling
between them with learnable temporal and patch offsets. Extensive experiments
on four TVR datasets demonstrate that RAP achieves superior or comparable
performance compared to the fully fine-tuned counterpart and other
parameter-efficient fine-tuning methods.
Authors' comments: Accepted by ACL 2024 Findings
Johannes Buchner, Hattie Starck, Mara Salvato, Hagai Netzer, Zsofi Igo, Brivael Laloux, Antonis Georgakakis, Isabelle Gauger et al.
The assembly and co-evolution of supermassive black holes (SMBH) and their
host galaxy stellar population is a key open questions in galaxy evolution.
Stellar mass ($M_\star$) and star formation rate (SFR), are inferred by
modeling the spectral energy distribution (SED). For galaxies triggering SMBH
activity, the active galactic nucleus (AGN) contaminates the light at all
wavelengths, hampering the inference of galaxy parameters. Incomplete AGN
templates can lead to systematic overestimates of the stellar mass, biasing our
understanding of AGN-galaxy co-evolution. This challenge has gained further
impetus with the advent of sensitive wide-area surveys with millions of
luminous AGN, including by eROSITA, Euclid and LSST. We aim to estimate the
accuracy and bias of AGN host galaxy parameters and improve upon existing
techniques. This work makes two contributions: 1) a new SED fitting code,
GRAHSP, with a flexible, empirically motivated AGN model including a power law
continuum emission lines, a FeII forest and a flexible infrared torus. We
verify that our model reproduces published X-ray to infrared SEDs of AGN to
better than 20\% accuracy. A fully Bayesian fit with nested sampling includes
uncertainties in the model and the data, making the inference highly robust. 2)
we created a benchmark photometric dataset where pure quasars are merged with
non-AGN pure galaxies into a hybrid (Chimera) object but with known galaxy and
AGN properties. Comparing the true and retrieved $M_\star$, SFR and AGN
luminosities shows that previous codes systematically over-estimate $M_\star$
and SFR by 0.5 dex with a wide scatter of 0.7 dex, at AGN luminosities above
10^44 erg/s. In contrast, GRAHSP shows no bias on $M_\star$ and SFR. GRAHSP
also estimates more realistic uncertainties. GRAHSP enables characterization of
the environmental conditions conducive to black hole growth. (abridged)
Authors' comments: version resubmitted to A&A after a first positive referee report
Ridong Wu, Shuhong Chen, Xiangbiao Su, Yuankai Zhu, Yifei Liao, Jianming Wu
With the rapid development of large-scale language models,
Retrieval-Augmented Generation (RAG) has been widely adopted. However, existing
RAG paradigms are inevitably influenced by erroneous retrieval information,
thereby reducing the reliability and correctness of generated results.
Therefore, to improve the relevance of retrieval information, this study
proposes a method that replaces traditional retrievers with GPT-3.5, leveraging
its vast corpus knowledge to generate retrieval information. We also propose a
web retrieval based method to implement fine-grained knowledge retrieval,
Utilizing the powerful reasoning capability of GPT-3.5 to realize semantic
partitioning of problem.In order to mitigate the illusion of GPT retrieval and
reduce noise in Web retrieval,we proposes a multi-source retrieval framework,
named MSRAG, which combines GPT retrieval with web retrieval. Experiments on
multiple knowledge-intensive QA datasets demonstrate that the proposed
framework in this study performs better than existing RAG framework in
enhancing the overall efficiency and accuracy of QA systems.
Authors' comments: 4 pages,3 figures
Xintong Jiang, Yaxiong Wang, Mengjian Li, Yujiao Wu, Bingwen Hu, Xueming Qian
Composed Image Retrieval (CIR) involves searching for target images based on
an image-text pair query. While current methods treat this as a query-target
matching problem, we argue that CIR triplets contain additional associations
beyond this primary relation. In our paper, we identify two new relations
within triplets, treating each triplet as a graph node. Firstly, we introduce
the concept of text-bridged image alignment, where the query text serves as a
bridge between the query image and the target image. We propose a hinge-based
cross-attention mechanism to incorporate this relation into network learning.
Secondly, we explore complementary text reasoning, considering CIR as a form of
cross-modal retrieval where two images compose to reason about complementary
text. To integrate these perspectives effectively, we design a twin
attention-based compositor. By combining these complementary associations with
the explicit query pair-target image relation, we establish a comprehensive set
of constraints for CIR. Our framework, CaLa (Complementary Association Learning
for Augmenting Composed Image Retrieval), leverages these insights. We evaluate
CaLa on CIRR and FashionIQ benchmarks with multiple backbones, demonstrating
its superiority in composed image retrieval.
Authors' comments: arXiv admin note: text overlap with arXiv:2309.02169
Jialiang Xu, Michael Moor, Jure Leskovec
Despite impressive advances in recent multimodal large language models (MLLMs), state-of-the-art models such as from the GPT-4 suite still struggle with knowledge-intensive tasks. To address this, we consider Reverse Image Retrieval (RIR) augmented generation, a simple yet effective strategy to augment MLLMs with web-scale reverse image search results. RIR robustly improves knowledge-intensive visual question answering (VQA) of GPT-4V by 37-43%, GPT-4 Turbo by 25-27%, and GPT-4o by 18-20% in terms of open-ended VQA evaluation metrics. To our surprise, we discover that RIR helps the model to better access its own world knowledge. Concretely, our experiments suggest that RIR augmentation helps by providing further visual and textual cues without necessarily containing the direct answer to a query. In addition, we elucidate cases in which RIR can hurt performance and conduct a human evaluation. Finally, we find that the overall advantage of using RIR makes it difficult for an agent that can choose to use RIR to perform better than an approach where RIR is the default setting.
Sihe Zhang, Qingdong He, Jinlong Peng, Yuxi Li, Zhengkai Jiang, Jiafu Wu, Mingmin Chi, Yabiao Wang et al.
Image retrieval aims to identify visually similar images within a database using a given query image. Traditional methods typically employ both global and local features extracted from images for matching, and may also apply re-ranking techniques to enhance accuracy. However, these methods often fail to account for the noise present in query images, which can stem from natural or human-induced factors, thereby negatively impacting retrieval performance. To mitigate this issue, we introduce a novel setting for low-quality image retrieval, and propose an Adaptive Noise-Based Network (AdapNet) to learn robust abstract representations. Specifically, we devise a quality compensation block trained to compensate for various low-quality factors in input images. Besides, we introduce an innovative adaptive noise-based loss function, which dynamically adjusts its focus on the gradient in accordance with image quality, thereby augmenting the learning of unknown noisy samples during training and enhancing intra-class compactness. To assess the performance, we construct two datasets with low-quality queries, which is built by applying various types of noise on clean query images on the standard Revisited Oxford and Revisited Paris datasets. Comprehensive experimental results illustrate that AdapNet surpasses state-of-the-art methods on the Noise Revisited Oxford and Noise Revisited Paris benchmarks, while maintaining competitive performance on high-quality datasets. The code and constructed datasets will be made available.
Huanshuo Liu, Hao Zhang, Zhijiang Guo, Jing Wang, Kuicai Dong, Xiangyang Li, Yi Quan Lee, Cong Zhang et al.
Retrieval-augmented generation (RAG) has emerged as a promising solution for
mitigating hallucinations of large language models (LLMs) with retrieved
external knowledge. Adaptive RAG enhances this approach by enabling dynamic
retrieval during generation, activating retrieval only when the query exceeds
LLM's internal knowledge. Existing methods primarily focus on detecting LLM's
confidence via statistical uncertainty. Instead, we present the first attempts
to solve adaptive RAG from a representation perspective and develop an inherent
control-based framework, termed \name. Specifically, we extract the features
that represent the honesty and confidence directions of LLM and adopt them to
control LLM behavior and guide retrieval timing decisions. We also design a
simple yet effective query formulation strategy to support adaptive retrieval.
Experiments show that \name is superior to existing adaptive RAG methods on a
diverse set of tasks, the honesty steering can effectively make LLMs more
honest and confidence monitoring is a promising indicator of retrieval
trigger.Our code is available at \url{https://github.com/HSLiu-Initial/CtrlA}.
Authors' comments: 29 pages, 10 figures, 11 tables
Kevin Dela Rosa
In this work, we propose the use of "aligned visual captions" as a mechanism
for integrating information contained within videos into retrieval augmented
generation (RAG) based chat assistant systems. These captions are able to
describe the visual and audio content of videos in a large corpus while having
the advantage of being in a textual format that is both easy to reason about &
incorporate into large language model (LLM) prompts, but also typically require
less multimedia content to be inserted into the multimodal LLM context window,
where typical configurations can aggressively fill up the context window by
sampling video frames from the source video. Furthermore, visual captions can
be adapted to specific use cases by prompting the original foundational model /
captioner for particular visual details or fine tuning. In hopes of helping
advancing progress in this area, we curate a dataset and describe automatic
evaluation procedures on common RAG tasks.
Authors' comments: SIGIR 2024 Workshop on Multimodal Representation and Retrieval (MRR
2024)