Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang
Retrieval-Augmented Generation (RAG) is a powerful strategy for improving the factual accuracy of models by retrieving external knowledge relevant to queries and incorporating it into the generation process. However, existing approaches primarily focus on text, with some recent advancements considering images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing contextual details more effectively than any other modality. While very recent studies explore the use of videos in response generation, they either predefine query-associated videos without retrieval or convert videos into textual descriptions losing multimodal richness. To tackle these, we introduce VideoRAG, a framework that not only dynamically retrieves videos based on their relevance with queries but also utilizes both visual and textual information. The operation of VideoRAG is powered by recent Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and the seamless integration of retrieved videos jointly with queries for response generation. Also, inspired by that the context size of LVLMs may not be sufficient to process all frames in extremely long videos and not all frames are equally important, we introduce a video frame selection mechanism to extract the most informative subset of frames, along with a strategy to extract textual information from videos (as it can aid the understanding of video content) when their subtitles are not available. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines. Code is available at https://github.com/starsuzi/VideoRAG.
Te-Lun Yang, Jyi-Shane Liu, Yuen-Hsien Tseng, Jyh-Shing Roger Jang
This study develops a question-answering system based on Retrieval-Augmented
Generation (RAG) using Chinese Wikipedia and Lawbank as retrieval sources.
Using TTQA and TMMLU+ as evaluation datasets, the system employs BGE-M3 for
dense vector retrieval to obtain highly relevant search results and
BGE-reranker to reorder these results based on query relevance. The most
pertinent retrieval outcomes serve as reference knowledge for a Large Language
Model (LLM), enhancing its ability to answer questions and establishing a
knowledge retrieval system grounded in generative AI.
The system's effectiveness is assessed through a two-stage evaluation:
automatic and assisted performance evaluations. The automatic evaluation
calculates accuracy by comparing the model's auto-generated labels with ground
truth answers, measuring performance under standardized conditions without
human intervention. The assisted performance evaluation involves 20
finance-related multiple-choice questions answered by 20 participants without
financial backgrounds. Initially, participants answer independently. Later,
they receive system-generated reference information to assist in answering,
examining whether the system improves accuracy when assistance is provided.
The main contributions of this research are: (1) Enhanced LLM Capability: By
integrating BGE-M3 and BGE-reranker, the system retrieves and reorders highly
relevant results, reduces hallucinations, and dynamically accesses authorized
or public knowledge sources. (2) Improved Data Privacy: A customized RAG
architecture enables local operation of the LLM, eliminating the need to send
private data to external servers. This approach enhances data security, reduces
reliance on commercial services, lowers operational costs, and mitigates
privacy risks.
Authors' comments: 8 pages, 13 figures, 1 table
Hanna Zubkova, Ji-Hoon Park, Seong-Whan Lee
Bearing in mind the limited parametric knowledge of Large Language Models
(LLMs), retrieval-augmented generation (RAG) which supplies them with the
relevant external knowledge has served as an approach to mitigate the issue of
hallucinations to a certain extent. However, uniformly retrieving supporting
context makes response generation source-inefficient, as triggering the
retriever is not always necessary, or even inaccurate, when a model gets
distracted by noisy retrieved content and produces an unhelpful answer.
Motivated by these issues, we introduce Semantic Uncertainty Guided Adaptive
Retrieval (SUGAR), where we leverage context-based entropy to actively decide
whether to retrieve and to further determine between single-step and multi-step
retrieval. Our empirical results show that selective retrieval guided by
semantic uncertainty estimation improves the performance across diverse
question answering tasks, as well as achieves a more efficient inference.
Authors' comments: ICASSP2025
Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A. Rossi et al.
Retrieval-augmented generation (RAG) is a powerful technique that enhances downstream task execution by retrieving additional information, such as knowledge, skills, and tools from external sources. Graph, by its intrinsic "nodes connected by edges" nature, encodes massive heterogeneous and relational information, making it a golden resource for RAG in tremendous real-world applications. As a result, we have recently witnessed increasing attention on equipping RAG with Graph, i.e., GraphRAG. However, unlike conventional RAG, where the retriever, generator, and external data sources can be uniformly designed in the neural-embedding space, the uniqueness of graph-structured data, such as diverse-formatted and domain-specific relational knowledge, poses unique and significant challenges when designing GraphRAG for different domains. Given the broad applicability, the associated design challenges, and the recent surge in GraphRAG, a systematic and up-to-date survey of its key concepts and techniques is urgently desired. Following this motivation, we present a comprehensive and up-to-date survey on GraphRAG. Our survey first proposes a holistic GraphRAG framework by defining its key components, including query processor, retriever, organizer, generator, and data source. Furthermore, recognizing that graphs in different domains exhibit distinct relational patterns and require dedicated designs, we review GraphRAG techniques uniquely tailored to each domain. Finally, we discuss research challenges and brainstorm directions to inspire cross-disciplinary opportunities. Our survey repository is publicly maintained at https://github.com/Graph-RAG/GraphRAG/.
Haitian Chen, Qingyao Ai, Xiao Wang, Yiqun Liu, Fen Lin, Qin Liu
Efficiently retrieving a concise set of candidates from a large document corpus remains a pivotal challenge in Information Retrieval (IR). Neural retrieval models, particularly dense retrieval models built with transformers and pretrained language models, have been popular due to their superior performance. However, criticisms have also been raised on their lack of explainability and vulnerability to adversarial attacks. In response to these challenges, we propose to improve the robustness of dense retrieval models by enhancing their sensitivity of fine-graned relevance signals. A model achieving sensitivity in this context should exhibit high variances when documents' key passages determining their relevance to queries have been modified, while maintaining low variances for other changes in irrelevant passages. This sensitivity allows a dense retrieval model to produce robust results with respect to attacks that try to promote documents without actually increasing their relevance. It also makes it possible to analyze which part of a document is actually relevant to a query, and thus improve the explainability of the retrieval model. Motivated by causality and counterfactual analysis, we propose a series of counterfactual regularization methods based on game theory and unsupervised learning with counterfactual passages. Experiments show that, our method can extract key passages without reliance on the passage-level relevance annotations. Moreover, the regularized dense retrieval models exhibit heightened robustness against adversarial attacks, surpassing the state-of-the-art anti-attack methods.
Guanting Dong, Chenghao Zhang, Mengjie Deng, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen
Multi-step multimodal reasoning tasks pose significant challenges for
multimodal large language models (MLLMs), and finding effective ways to enhance
their performance in such scenarios remains an unresolved issue. In this paper,
we propose AR-MCTS, a universal framework designed to progressively improve the
reasoning capabilities of MLLMs through Active Retrieval (AR) and Monte Carlo
Tree Search (MCTS). Our approach begins with the development of a unified
retrieval module that retrieves key supporting insights for solving complex
reasoning problems from a hybrid-modal retrieval corpus. To bridge the gap in
automated multimodal reasoning verification, we employ the MCTS algorithm
combined with an active retrieval mechanism, which enables the automatic
generation of step-wise annotations. This strategy dynamically retrieves key
insights for each reasoning step, moving beyond traditional beam search
sampling to improve the diversity and reliability of the reasoning space.
Additionally, we introduce a process reward model that aligns progressively to
support the automatic verification of multimodal reasoning tasks. Experimental
results across three complex multimodal reasoning benchmarks confirm the
effectiveness of the AR-MCTS framework in enhancing the performance of various
multimodal models. Further analysis demonstrates that AR-MCTS can optimize
sampling diversity and accuracy, yielding reliable multimodal reasoning.
Authors' comments: Working in progress
Chi Liu, Jiangxia Cao, Rui Huang, Kuo Cai, Weifeng Ding, Qiang Luo, Kun Gai, Guorui Zhou
Recommendation systems (RecSys) are designed to connect users with relevant items from a vast pool of candidates while aligning with the business goals of the platform. A typical industrial RecSys is composed of two main stages, retrieval and ranking: (1) the retrieval stage aims at searching hundreds of item candidates satisfied user interests; (2) based on the retrieved items, the ranking stage aims at selecting the best dozen items by multiple targets estimation for each item candidate, including classification and regression targets. Compared with ranking model, the retrieval model absence of item candidate information during inference, therefore retrieval models are often trained by classification target only (e.g., click-through rate), but failed to incorporate regression target (e.g., the expected watch-time), which limit the effectiveness of retrieval. In this paper, we propose the Controllable Retrieval Model (CRM), which integrates regression information as conditional features into the two-tower retrieval paradigm. This modification enables the retrieval stage could fulfill the target gap with ranking model, enhancing the retrieval model ability to search item candidates satisfied the user interests and condition effectively. We validate the effectiveness of CRM through real-world A/B testing and demonstrate its successful deployment in Kuaishou short-video recommendation system, which serves over 400 million users.
Jianlyu Chen, Nan Wang, Chaofan Li, Bo Wang, Shitao Xiao, Han Xiao, Hao Liao, Defu Lian et al.
Evaluation plays a crucial role in the advancement of information retrieval
(IR) models. However, current benchmarks, which are based on predefined domains
and human-labeled data, face limitations in addressing evaluation needs for
emerging domains both cost-effectively and efficiently. To address this
challenge, we propose the Automated Heterogeneous Information Retrieval
Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1)
Automated. The testing data in AIR-Bench is automatically generated by large
language models (LLMs) without human intervention. 2) Heterogeneous. The
testing data in AIR-Bench is generated with respect to diverse tasks, domains
and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are
constantly augmented to provide an increasingly comprehensive evaluation
benchmark for community developers. We develop a reliable and robust data
generation pipeline to automatically create diverse and high-quality evaluation
datasets based on real-world corpora. Our findings demonstrate that the
generated testing data in AIR-Bench aligns well with human-labeled testing
data, making AIR-Bench a dependable benchmark for evaluating IR models. The
resources in AIR-Bench are publicly available at
https://github.com/AIR-Bench/AIR-Bench.
Authors' comments: 31 pages, 6 figures; Update Table 4 and Figure 3
Effrosyni Sokli, Pranav Kasela, Georgios Peikos, Gabriella Pasi
While Dense Retrieval Models (DRMs) have advanced Information Retrieval (IR), one limitation of these neural models is their narrow generalizability and robustness. To cope with this issue, one can leverage the Mixture-of-Experts (MoE) architecture. While previous IR studies have incorporated MoE architectures within the Transformer layers of DRMs, our work investigates an architecture that integrates a single MoE block (SB-MoE) after the output of the final Transformer layer. Our empirical evaluation investigates how SB-MoE compares, in terms of retrieval effectiveness, to standard fine-tuning. In detail, we fine-tune three DRMs (TinyBERT, BERT, and Contriever) across four benchmark collections with and without adding the MoE block. Moreover, since MoE showcases performance variations with respect to its parameters (i.e., the number of experts), we conduct additional experiments to investigate this aspect further. The findings show the effectiveness of SB-MoE especially for DRMs with a low number of parameters (i.e., TinyBERT), as it consistently outperforms the fine-tuned underlying model on all four benchmarks. For DRMs with a higher number of parameters (i.e., BERT and Contriever), SB-MoE requires larger numbers of training samples to yield better retrieval performance.
Xiangyu Peng, Prafulla Kumar Choubey, Caiming Xiong, Chien-Sheng Wu
Existing evaluation frameworks for retrieval-augmented generation (RAG) systems focus on answerable queries, but they overlook the importance of appropriately rejecting unanswerable requests. In this paper, we introduce UAEval4RAG, a framework designed to evaluate whether RAG systems can handle unanswerable queries effectively. We define a taxonomy with six unanswerable categories, and UAEval4RAG automatically synthesizes diverse and challenging queries for any given knowledge base with unanswered ratio and acceptable ratio metrics. We conduct experiments with various RAG components, including retrieval models, rewriting methods, rerankers, language models, and prompting strategies, and reveal hidden trade-offs in performance of RAG systems. Our findings highlight the critical role of component selection and prompt design in optimizing RAG systems to balance the accuracy of answerable queries with high rejection rates of unanswerable ones. UAEval4RAG provides valuable insights and tools for developing more robust and reliable RAG systems.
Junhao Zhuang, Xuan Ju, Zhaoyang Zhang, Yong Liu, Shiyi Zhang, Chun Yuan, Ying Shan
Automatic black-and-white image sequence colorization while preserving
character and object identity (ID) is a complex task with significant market
demand, such as in cartoon or comic series colorization. Despite advancements
in visual colorization using large-scale generative models like diffusion
models, challenges with controllability and identity consistency persist,
making current solutions unsuitable for industrial application.To address this,
we propose ColorFlow, a three-stage diffusion-based framework tailored for
image sequence colorization in industrial applications. Unlike existing methods
that require per-ID finetuning or explicit ID embedding extraction, we propose
a novel robust and generalizable Retrieval Augmented Colorization pipeline for
colorizing images with relevant color references. Our pipeline also features a
dual-branch design: one branch for color identity extraction and the other for
colorization, leveraging the strengths of diffusion models. We utilize the
self-attention mechanism in diffusion models for strong in-context learning and
color identity matching. To evaluate our model, we introduce ColorFlow-Bench, a
comprehensive benchmark for reference-based colorization. Results show that
ColorFlow outperforms existing models across multiple metrics, setting a new
standard in sequential image colorization and potentially benefiting the art
industry. We release our codes and models on our project page:
https://zhuang2002.github.io/ColorFlow/.
Authors' comments: Project Page: https://zhuang2002.github.io/ColorFlow/
Madhu Kiran, Kartikey Vishnu, Rafael M. O. Cruz, Eric Granger
Image retrieval methods rely on metric learning to train backbone feature
extraction models that can extract discriminant queries and reference (gallery)
feature representations for similarity matching. Although state-of-the-art
accuracy has improved considerably with the advent of deep learning (DL) models
trained on large datasets, image retrieval remains challenging in many
real-world video analytics and surveillance applications, e.g., person
re-identification. Using the Euclidean space for matching limits the
performance in real-world applications due to the curse of dimensionality,
overfitting, and sensitivity to noisy data.
We argue that the feature dissimilarity space is more suitable for similarity
matching, and propose a dichotomy transformation to project query and reference
embeddings into a single embedding in the dissimilarity space.
We also advocate for end-to-end training of a backbone and binary
classification models for pair-wise matching. As opposed to comparing the
distance between queries and reference embeddings, we show the benefits of
classifying the single dissimilarity space embedding (as similar or
dissimilar), especially when trained end-to-end. We propose a method to train
the max-margin classifier together with the backbone feature extractor by
applying constraints to the L2 norm of the classifier weights along with the
hinge loss.
Our extensive experiments on challenging image retrieval datasets and using
diverse feature extraction backbones highlight the benefits of similarity
matching in the dissimilarity space. In particular, when jointly training the
feature extraction backbone and regularised classifier for matching, the
dissimilarity space provides a higher level of accuracy.
Authors' comments: 7 pages
Fabian Paischer, Liu Yang, Linfeng Liu, Shuai Shao, Kaveh Hassani, Jiacheng Li, Ricky Chen, Zhang Gabriel Li et al.
Sequential recommendation systems aim to provide personalized recommendations
for users based on their interaction history. To achieve this, they often
incorporate auxiliary information, such as textual descriptions of items and
auxiliary tasks, like predicting user preferences and intent. Despite numerous
efforts to enhance these models, they still suffer from limited
personalization. To address this issue, we propose a new paradigm, which we
term preference discerning. In preference dscerning, we explicitly condition a
generative sequential recommendation system on user preferences within its
context. To this end, we generate user preferences using Large Language Models
(LLMs) based on user reviews and item-specific data. To evaluate preference
discerning capabilities of sequential recommendation systems, we introduce a
novel benchmark that provides a holistic evaluation across various scenarios,
including preference steering and sentiment following. We assess current
state-of-the-art methods using our benchmark and show that they struggle to
accurately discern user preferences. Therefore, we propose a new method named
Mender ($\textbf{M}$ultimodal Prefer$\textbf{en}$ce
$\textbf{d}$iscern$\textbf{er}$), which improves upon existing methods and
achieves state-of-the-art performance on our benchmark. Our results show that
Mender can be effectively guided by human preferences even though they have not
been observed during training, paving the way toward more personalized
sequential recommendation systems. We will open-source the code and benchmarks
upon publication.
Authors' comments: 11 pages + references and appendix
Haiyang Peng, Deren Han, Linbin Li, Meng Huang
This paper aims to address the phase retrieval problem from subgaussian measurements with arbitrary noise, with a focus on devising robust and efficient algorithms for solving non-convex problems. To ensure uniqueness of solutions in the subgaussian setting, we explore two commonly used assumptions: either the subgaussian measurements satisfy a fourth-moment condition or the target signals exhibit non-peakiness. For each scenario, we introduce a novel spectral initialization method that yields robust initial estimates. Building on this, we employ leave-one-out arguments to show that the classical Wirtinger flow algorithm achieves a linear rate of convergence for both real-valued and complex-valued cases, provided the sampling complexity $m\ge O(n \log^3 m)$, where $n$ is the dimension of the underlying signals. In contrast to existing work, our algorithms are regularization-free, requiring no truncation, trimming, or additional penalty terms, and they permit the algorithm step sizes as large as $O(1)$, compared to the $O(1/n)$ in previous literature. Furthermore, our results accommodate arbitrary noise vectors that meet certain statistical conditions, covering a wide range of noise scenarios, with sub-exponential noise as a notable special case. The effectiveness of our algorithms is validated through various numerical experiments. We emphasize that our findings provide the first theoretical guarantees for recovering non-peaky signals using non-convex methods from Bernoulli measurements, which is of independent interest.
Nadia Sheikh, Anne-Laure Jousse, Daniel Buades Marcos, Akintunde Oladipo, Olivier Rousseau, Jimmy Lin
Given the dominance of dense retrievers that do not generalize well beyond their training dataset distributions, domain-specific test sets are essential in evaluating retrieval. There are few test datasets for retrieval systems intended for use by healthcare providers in a point-of-care setting. To fill this gap we have collaborated with medical professionals to create CURE, an ad-hoc retrieval test dataset for passage ranking with 2000 queries spanning 10 medical domains with a monolingual (English) and two cross-lingual (French/Spanish -> English) conditions. In this paper, we describe how CURE was constructed and provide baseline results to showcase its effectiveness as an evaluation tool. CURE is published with a Creative Commons Attribution Non Commercial 4.0 license and can be accessed on Hugging Face.
Jiaan Wang, Fandong Meng, Yingxue Zhang, Jie Zhou
Retrieval-augmented generation (RAG) introduces additional information to enhance large language models (LLMs). In machine translation (MT), previous work typically retrieves in-context examples from paired MT corpora, or domain-specific knowledge from knowledge graphs, to enhance models' MT ability. However, a large amount of world knowledge is organized in unstructured documents, and might not be fully paired across different languages. In this paper, we study retrieval-augmented MT using unstructured documents. Specifically, we build RAGtrans, the first benchmark to train and evaluate LLMs' retrieval-augmented MT ability. RAGtrans contains 79K MT samples collected via GPT-4o and human translators. Besides, documents from different languages are also provided to supply the knowledge to these samples. Based on RAGtrans, we further propose a multi-task training method to teach LLMs how to use information from multilingual documents during their translation. The method uses existing multilingual corpora to create auxiliary training objectives without additional labeling requirements. Extensive experiments show that the method improves LLMs by 1.58-3.09 BLEU and 1.00-2.03 COMET scores.
Puxuan Yu, Luke Merrick, Gaurav Nuti, Daniel Campos
This paper presents the training methodology of Arctic-Embed 2.0, a set of
open-source text embedding models built for accurate and efficient multilingual
retrieval. While prior works have suffered from degraded English retrieval
quality, Arctic-Embed 2.0 delivers competitive retrieval quality on
multilingual and English-only benchmarks, and supports Matryoshka
Representation Learning (MRL) for efficient embedding storage with
significantly lower compressed quality degradation compared to alternatives. We
detail the design and implementation, presenting several important open
research questions that arose during model development. We conduct experiments
exploring these research questions and include extensive discussion aimed at
fostering further discussion in this field.
Authors' comments: 10 pages, 5 figures, 3 tables
Joel Suro
Retrieval-Augmented Generation (RAG) architectures have recently garnered significant attention for their ability to improve truth grounding and coherence in natural language processing tasks. However, the reliability of RAG systems in producing accurate answers diminishes as the volume of data they access increases. Even with smaller datasets, these systems occasionally fail to address simple queries. This issue arises from their dependence on state-of-the-art large language models (LLMs), which can introduce uncertainty into the system's outputs. In this work, I propose a novel Comparative RAG system that introduces an evaluator module to bridge the gap between probabilistic RAG systems and deterministically verifiable responses. The evaluator compares external recommendations with the retrieved document chunks, adding a decision-making layer that enhances the system's reliability. This approach ensures that the chunks retrieved are both semantically relevant and logically consistent with deterministic insights, thereby improving the accuracy and overall efficiency of RAG systems. This framework paves the way for more reliable and scalable question-answering applications in domains requiring high precision and verifiability.
Hongji Yang, Yiru Li, Yingying Zhu
Information retrieval techniques have demonstrated exceptional capabilities in identifying semantic similarities across diverse domains through robust feature representations. However, their potential in guiding synthesis tasks, particularly cross-view image synthesis, remains underexplored. Cross-view image synthesis presents significant challenges in establishing reliable correspondences between drastically different viewpoints. To address this, we propose a novel retrieval-guided framework that reimagines how retrieval techniques can facilitate effective cross-view image synthesis. Unlike existing methods that rely on auxiliary information, such as semantic segmentation maps or preprocessing modules, our retrieval-guided framework captures semantic similarities across different viewpoints, trained through contrastive learning to create a smooth embedding space. Furthermore, a novel fusion mechanism leverages these embeddings to guide image synthesis while learning and encoding both view-invariant and view-specific features. To further advance this area, we introduce VIGOR-GEN, a new urban-focused dataset with complex viewpoint variations in real-world scenarios. Extensive experiments demonstrate that our retrieval-guided approach significantly outperforms existing methods on the CVUSA, CVACT and VIGOR-GEN datasets, particularly in retrieval accuracy (R@1) and synthesis quality (FID). Our work bridges information retrieval and synthesis tasks, offering insights into how retrieval techniques can address complex cross-domain synthesis challenges.
Batuhan Sariturk, Rabia Bayraktar, Merve Elmas Erdem
With the rise of online education platforms, there is a growing abundance of educational content across various domain. It can be difficult to navigate the numerous available resources to find the most suitable training, especially in domains that include many interconnected areas, such as ICT. In this study, we propose a domain-specific chatbot application that requires limited resources, utilizing versions of the Phi language model to help learners with educational content. In the proposed method, Phi-2 and Phi-3 models were fine-tuned using QLoRA. The data required for fine-tuning was obtained from the Huawei Talent Platform, where courses are available at different levels of expertise in the field of computer science. RAG system was used to support the model, which was fine-tuned by 500 Q&A pairs. Additionally, a total of 420 Q&A pairs of content were extracted from different formats such as JSON, PPT, and DOC to create a vector database to be used in the RAG system. By using the fine-tuned model and RAG approach together, chatbots with different competencies were obtained. The questions and answers asked to the generated chatbots were saved separately and evaluated using ROUGE, BERTScore, METEOR, and BLEU metrics. The precision value of the Phi-2 model supported by RAG was 0.84 and the F1 score was 0.82. In addition to a total of 13 different evaluation metrics in 4 different categories, the answers of each model were compared with the created content and the most appropriate method was selected for real-life applications.