Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, Siliang Tang
Recently, Retrieval-Augmented Generation (RAG) has achieved remarkable
success in addressing the challenges of Large Language Models (LLMs) without
necessitating retraining. By referencing an external knowledge base, RAG
refines LLM outputs, effectively mitigating issues such as ``hallucination'',
lack of domain-specific knowledge, and outdated information. However, the
complex structure of relationships among different entities in databases
presents challenges for RAG systems. In response, GraphRAG leverages structural
information across entities to enable more precise and comprehensive retrieval,
capturing relational knowledge and facilitating more accurate, context-aware
responses. Given the novelty and potential of GraphRAG, a systematic review of
current technologies is imperative. This paper provides the first comprehensive
overview of GraphRAG methodologies. We formalize the GraphRAG workflow,
encompassing Graph-Based Indexing, Graph-Guided Retrieval, and Graph-Enhanced
Generation. We then outline the core technologies and training methods at each
stage. Additionally, we examine downstream tasks, application domains,
evaluation methodologies, and industrial use cases of GraphRAG. Finally, we
explore future research directions to inspire further inquiries and advance
progress in the field.
Authors' comments: Ongoing work
Samhaa R. El-Beltagy, Mohamed A. Abdallah
Recently, Retrieval Augmented Generation (RAG) has emerged as a powerful technique in natural language processing, combining the strengths of retrieval-based and generation-based models to enhance text generation tasks. However, the application of RAG in Arabic, a language with unique characteristics and resource constraints, remains underexplored. This paper presents a comprehensive case study on the implementation and evaluation of RAG for Arabic text. The work focuses on exploring various semantic embedding models in the retrieval stage and several LLMs in the generation stage, in order to investigate what works and what doesn't in the context of Arabic. The work also touches upon the issue of variations between document dialect and query dialect in the retrieval stage. Results show that existing semantic embedding models and LLMs can be effectively employed to build Arabic RAG pipelines.
Nicholas Rossi, Juexin Lin, Feng Liu, Zhen Yang, Tony Lee, Alessandro Magnani, Ciya Liao
In embedding-based retrieval, Approximate Nearest Neighbor (ANN) search
enables efficient retrieval of similar items from large-scale datasets. While
maximizing recall of relevant items is usually the goal of retrieval systems, a
low precision may lead to a poor search experience. Unlike lexical retrieval,
which inherently limits the size of the retrieved set through keyword matching,
dense retrieval via ANN search has no natural cutoff. Moreover, the cosine
similarity scores of embedding vectors are often optimized via contrastive or
ranking losses, which make them difficult to interpret. Consequently, relying
on top-K or cosine-similarity cutoff is often insufficient to filter out
irrelevant results effectively. This issue is prominent in product search,
where the number of relevant products is often small. This paper introduces a
novel relevance filtering component (called "Cosine Adapter") for
embedding-based retrieval to address this challenge. Our approach maps raw
cosine similarity scores to interpretable scores using a query-dependent
mapping function. We then apply a global threshold on the mapped scores to
filter out irrelevant results. We are able to significantly increase the
precision of the retrieved set, at the expense of a small loss of recall. The
effectiveness of our approach is demonstrated through experiments on both
public MS MARCO dataset and internal Walmart product search data. Furthermore,
online A/B testing on the Walmart site validates the practical value of our
approach in real-world e-commerce settings.
Authors' comments: 8 pages, 3 figures, CIKM 2024
Hanjia Lyu, Hanqing Zeng, Yinglong Xia, Ren Chen, Jiebo Luo
Many existing industrial recommender systems are sensitive to the patterns of user-item engagement. Light users, who interact less frequently, correspond to a data sparsity problem, making it difficult for the system to accurately learn and represent their preferences. On the other hand, heavy users with rich interaction history often demonstrate a variety of niche interests that are hard to be precisely captured under the standard "user-item" similarity measurement. Moreover, implementing these systems in an industrial environment necessitates that they are resource-efficient and scalable to process web-scale data under strict latency constraints. In this paper, we address these challenges by introducing an intermediate "interest" layer between users and items. We propose a novel approach that efficiently constructs user interest and facilitates low computational cost inference by clustering engagement graphs and incorporating user-interest attention. This method enhances the understanding of light users' preferences by linking them with heavy users. By integrating user-interest attention, our approach allows a more personalized similarity metric, adept at capturing the complex dynamics of user-item interactions. The use of interest as an intermediary layer fosters a balance between scalability and expressiveness in the model. Evaluations on two public datasets reveal that our method not only achieves improved recommendation performance but also demonstrates enhanced computational efficiency compared to item-level attention models. Our approach has also been deployed in multiple products at Meta, facilitating short-form video related recommendation.
Hassan S. Shavarani, Anoop Sarkar
The similarity between the question and indexed documents is a crucial factor
in document retrieval for retrieval-augmented question answering. Although this
is typically the only method for obtaining the relevant documents, it is not
the sole approach when dealing with entity-centric questions. In this study, we
propose Entity Retrieval, a novel retrieval method which rather than relying on
question-document similarity, depends on the salient entities within the
question to identify the retrieval documents. We conduct an in-depth analysis
of the performance of both dense and sparse retrieval methods in comparison to
Entity Retrieval. Our findings reveal that our method not only leads to more
accurate answers to entity-centric questions but also operates more
efficiently.
Authors' comments: 17 pages total, 10 Tables, 4 Figures
Arian Askari, Chuan Meng, Mohammad Aliannejadi, Zhaochun Ren, Evangelos Kanoulas, Suzan Verberne
Existing generative retrieval (GR) approaches rely on training-based indexing, i.e., fine-tuning a model to memorise the associations between a query and the document identifier (docid) of a relevant document. Training-based indexing has three limitations: high training overhead, under-utilization of the pre-trained knowledge of large language models (LLMs), and challenges in adapting to a dynamic document corpus. To address the above issues, we propose a novel few-shot indexing-based GR framework (Few-Shot GR). It has a novel few-shot indexing process, where we prompt an LLM to generate docids for all documents in a corpus, ultimately creating a docid bank for the entire corpus. During retrieval, we feed a query to the same LLM and constrain it to generate a docid within the docid bank created during indexing, and then map the generated docid back to its corresponding document. Few-Shot GR relies solely on prompting an LLM without requiring any training, making it more efficient. Moreover, we devise few-shot indexing with one-to-many mapping to further enhance Few-Shot GR. Experiments show that Few-Shot GR achieves superior performance to state-of-the-art GR methods that require heavy training.
Xi Wang, Procheta Sen, Ruizhe Li, Emine Yilmaz
Despite the success of integrating large language models into the development
of conversational systems, many studies have shown the effectiveness of
retrieving and augmenting external knowledge for informative responses. Hence,
many existing studies commonly assume the always need for Retrieval Augmented
Generation (RAG) in a conversational system without explicit control. This
raises a research question about such a necessity. In this study, we propose to
investigate the need for each turn of system response to be augmented with
external knowledge. In particular, by leveraging human judgements on the binary
choice of adaptive augmentation, we develop RAGate, a gating model, which
models conversation context and relevant inputs to predict if a conversational
system requires RAG for improved responses. We conduct extensive experiments on
devising and applying RAGate to conversational models and well-rounded analyses
of different conversational scenarios. Our experimental results and analysis
indicate the effective application of RAGate in RAG-based conversational
systems in identifying system responses for appropriate RAG with high-quality
responses and a high generation confidence. This study also identifies the
correlation between the generation's confidence level and the relevance of the
augmented knowledge.
Authors' comments: 12 pages, under review
To Eun Kim, Alireza Salemi, Andrew Drozdov, Fernando Diaz, Hamed Zamani
In the field of language modeling, models augmented with retrieval components have emerged as a promising solution to address several challenges faced in the natural language processing (NLP) field, including knowledge grounding, interpretability, and scalability. Despite the primary focus on NLP, we posit that the paradigm of retrieval-enhancement can be extended to a broader spectrum of machine learning (ML) such as computer vision, time series prediction, and computational biology. Therefore, this work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature. Also, we found that while a number of studies employ retrieval components to augment their models, there is a lack of integration with foundational Information Retrieval (IR) research. We bridge this gap between the seminal IR research and contemporary REML studies by investigating each component that comprises the REML framework. Ultimately, the goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.
Mohammad Aliannejadi, Jacek Gwizdka, Hamed Zamani
At its core, information access and seeking is an interactive process. In
existing search engines, interactions are limited to a few pre-defined actions,
such as "requery", "click on a document", "scrolling up/down", "going to the
next result page", "leaving the search engine", etc. A major benefit of moving
towards generative IR systems is enabling users with a richer expression of
information need and feedback and free-form interactions in natural language
and beyond. In other words, the actions users take are no longer limited by the
clickable links and buttons available on the search engine result page and
users can express themselves freely through natural language. This can go even
beyond natural language, through images, videos, gestures, and sensors using
multi-modal generative IR systems. This chapter briefly discusses the role of
interaction in generative IR systems. We will first discuss different ways
users can express their information needs by interacting with generative IR
systems. We then explain how users can provide explicit or implicit feedback to
generative IR systems and how they can consume such feedback. Next, we will
cover how users interactively can refine retrieval results. We will expand upon
mixed-initiative interactions and discuss clarification and preference
elicitation in more detail. We then discuss proactive generative IR systems,
including context-aware recommendation, following up past conversations,
contributing to multi-party conversations, and feedback requests. Providing
explanation is another interaction type that we briefly discuss in this
chapter. We will also briefly describe multi-modal interactions in generative
information retrieval. Finally, we describe emerging frameworks and solutions
for user interfaces with generative AI systems.
Authors' comments: Draft of a chapter intended to appear in a forthcoming book on
generative information retrieval, co-edited by Chirag Shah and Ryen White
Yuxuan Wu, Xiao Yi, Yang Tan, Huiqun Yu, Guisheng Fan, Gaowei Zheng
Protein retrieval, which targets the deconstruction of the relationship
between sequences, structures and functions, empowers the advancing of biology.
Basic Local Alignment Search Tool (BLAST), a sequence-similarity-based
algorithm, has proved the efficiency of this field. Despite the existing tools
for protein retrieval, they prioritize sequence similarity and probably
overlook proteins that are dissimilar but share homology or functionality. In
order to tackle this problem, we propose a novel protein retrieval framework
that mitigates the bias towards sequence similarity. Our framework initiatively
harnesses protein language models (PLMs) to embed protein sequences within a
high-dimensional feature space, thereby enhancing the representation capacity
for subsequent analysis. Subsequently, an accelerated indexed vector database
is constructed to facilitate expedited access and retrieval of dense vectors.
Extensive experiments demonstrate that our framework can equally retrieve both
similar and dissimilar proteins. Moreover, this approach enables the
identification of proteins that conventional methods fail to uncover. This
framework will effectively assist in protein mining and empower the development
of biology.
Authors' comments: 16 pages, 12 figures
Alex Oesterling, Claudio Mayrink Verdun, Carol Xuan Long, Alexander Glynn, Lucas Monteiro Paes, Sajani Vithana, Martina Cardone, Flavio P. Calmon
Image search and retrieval tasks can perpetuate harmful stereotypes, erase
cultural identities, and amplify social disparities. Current approaches to
mitigate these representational harms balance the number of retrieved items
across population groups defined by a small number of (often binary)
attributes. However, most existing methods overlook intersectional groups
determined by combinations of group attributes, such as gender, race, and
ethnicity. We introduce Multi-Group Proportional Representation (MPR), a novel
metric that measures representation across intersectional groups. We develop
practical methods for estimating MPR, provide theoretical guarantees, and
propose optimization algorithms to ensure MPR in retrieval. We demonstrate that
existing methods optimizing for equal and proportional representation metrics
may fail to promote MPR. Crucially, our work shows that optimizing MPR yields
more proportional representation across multiple intersectional groups
specified by a rich function class, often with minimal compromise in retrieval
accuracy.
Authors' comments: 48 pages, 33 figures. Accepted as poster at NeurIPS 2024. Code can be
found at
https://github.com/alex-oesterling/multigroup-proportional-representation
Zongyang Ma, Ziqi Zhang, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Yingmin Luo, Xu Li et al.
Understanding the content of events occurring in the video and their inherent
temporal logic is crucial for video-text retrieval. However, web-crawled
pre-training datasets often lack sufficient event information, and the widely
adopted video-level cross-modal contrastive learning also struggles to capture
detailed and complex video-text event alignment. To address these challenges,
we make improvements from both data and model perspectives. In terms of
pre-training data, we focus on supplementing the missing specific event content
and event temporal transitions with the proposed event augmentation strategies.
Based on the event-augmented data, we construct a novel Event-Aware Video-Text
Retrieval model, ie, EA-VTR, which achieves powerful video-text retrieval
ability through superior video event awareness. EA-VTR can efficiently encode
frame-level and video-level visual representations simultaneously, enabling
detailed event content and complex event temporal cross-modal alignment,
ultimately enhancing the comprehensive understanding of video events. Our
method not only significantly outperforms existing approaches on multiple
datasets for Text-to-Video Retrieval and Video Action Recognition tasks, but
also demonstrates superior event content perceive ability on Multi-event
Video-Text Retrieval and Video Moment Retrieval tasks, as well as outstanding
event temporal logic understanding ability on Test of Time task.
Authors' comments: Accepted by ECCV 2024
Hao Sun, Yong Jiang, Bo Wang, Yingyan Hou, Yan Zhang, Pengjun Xie, Fei Huang
In-context learning (ICL) has been instrumental in adapting Large Language Models (LLMs) to downstream tasks using correct input-output examples. Recent advances have attempted to improve model performance through principles derived from mistakes, yet these approaches suffer from lack of customization and inadequate error coverage. To address these limitations, we propose Retrieved In-Context Principles (RICP), a novel teacher-student framework. In RICP, the teacher model analyzes mistakes from the student model to generate reasons and insights for preventing similar mistakes. These mistakes are clustered based on their underlying reasons for developing task-level principles, enhancing the error coverage of principles. During inference, the most relevant mistakes for each question are retrieved to create question-level principles, improving the customization of the provided guidance. RICP is orthogonal to existing prompting methods and does not require intervention from the teacher model during inference. Experimental results across seven reasoning benchmarks reveal that RICP effectively enhances performance when applied to various prompting strategies.
Jiaxin Ge, Xueying Jia, Vijay Viswanathan, Hongyin Luo, Graham Neubig
One of the most reliable ways to create deployable models for specialized tasks is to obtain an adequate amount of high-quality task-specific data. However, for specialized tasks, often such datasets do not exist. Existing methods address this by creating such data from large language models (LLMs) and then distilling such knowledge into smaller models. However, these methods are limited by the quality of the LLMs output, and tend to generate repetitive or incorrect data. In this work, we present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data. This method greatly enhances data diversity. Moreover, ReBase generates Chain-of-Thought reasoning and distills the reasoning capacity of LLMs. We test our method on 4 benchmarks and results show that our method significantly improves performance by up to 7.8% on SQuAD, 1.37% on MNLI, and 1.94% on BigBench-Hard.
Jacob Trzaska, Amit Ashok
We consider the problem of determining the spatial phase profile of a single-mode electromagnetic field. Our attention is on input states that are a statistical mixture of displaced and squeezed number states, a superset of Gaussian states. In particular, we derive the quantum Fisher information matrix (QFIM) for estimating the expansion coefficients of the wavefront in an orthonormal basis, finding that it is diagonal. Moreover, we show that a measurement saturating the QFIM always exists, and point to an adaptive strategy capable of implementing it. We then construct the optimal measurements for three particular states: mixtures of photon number, coherent, and single-mode squeezed vacuum states. Sensitivity of the measurements to nuisance parameters is explored.
João Rodrigues, António Branco
Retrieval-augmented generation resorts to content retrieved from external sources in order to leverage the performance of large language models in downstream tasks. The excessive volume of retrieved content, the possible dispersion of its parts, or their out of focus range may happen nevertheless to eventually have a detrimental rather than an incremental effect. To mitigate this issue and improve retrieval-augmented generation, we propose a method to refine the retrieved content before it is included in the prompt by resorting to meta-prompting optimization. Put to empirical test with the demanding multi-hop question answering task from the StrategyQA dataset, the evaluation results indicate that this method outperforms a similar retrieval-augmented system but without this method by over 30%.
Nadezhda Chirkova, David Rau, Hervé Déjean, Thibault Formal, Stéphane Clinchant, Vassilina Nikoulina
Retrieval-augmented generation (RAG) has recently emerged as a promising solution for incorporating up-to-date or domain-specific knowledge into large language models (LLMs) and improving LLM factuality, but is predominantly studied in English-only settings. In this work, we consider RAG in the multilingual setting (mRAG), i.e. with user queries and the datastore in 13 languages, and investigate which components and with which adjustments are needed to build a well-performing mRAG pipeline, that can be used as a strong baseline in future works. Our findings highlight that despite the availability of high-quality off-the-shelf multilingual retrievers and generators, task-specific prompt engineering is needed to enable generation in user languages. Moreover, current evaluation metrics need adjustments for multilingual setting, to account for variations in spelling named entities. The main limitations to be addressed in future works include frequent code-switching in non-Latin alphabet languages, occasional fluency errors, wrong reading of the provided documents, or irrelevant retrieval. We release the code for the resulting mRAG baseline pipeline at https://github.com/naver/bergen.
Varun Nagaraj Rao, Siddharth Choudhary, Aditya Deshpande, Ravi Kumar Satzoda, Srikar Appalaraju
The scaling of large language models to encode all the world's knowledge in model parameters is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. Existing methods focus on models designed for single tasks. Furthermore, they're limited by the need for resource intensive pre training, additional parameter requirements, unaddressed modality prioritization and lack of clear benefit over non-retrieval baselines. This paper introduces RAVEN, a multitask retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning. By integrating retrieval augmented samples without the need for additional retrieval-specific parameters, we show that the model acquires retrieval properties that are effective across multiple tasks. Our results and extensive ablations across retrieved modalities for the image captioning and VQA tasks indicate significant performance improvements compared to non retrieved baselines +1 CIDEr on MSCOCO, +4 CIDEr on NoCaps and nearly a +3\% accuracy on specific VQA question types. This underscores the efficacy of applying RAG approaches to VLMs, marking a stride toward more efficient and accessible multimodal learning.
Qiushi Huang, Shuai Fu, Xubo Liu, Wenwu Wang, Tom Ko, Yu Zhang, Lilian Tang
Personalized dialogue generation, focusing on generating highly tailored
responses by leveraging persona profiles and dialogue context, has gained
significant attention in conversational AI applications. However, persona
profiles, a prevalent setting in current personalized dialogue datasets,
typically composed of merely four to five sentences, may not offer
comprehensive descriptions of the persona about the agent, posing a challenge
to generate truly personalized dialogues. To handle this problem, we propose
$\textbf{L}$earning Retrieval $\textbf{A}$ugmentation for
$\textbf{P}$ersonalized $\textbf{D}$ial$\textbf{O}$gue $\textbf{G}$eneration
($\textbf{LAPDOG}$), which studies the potential of leveraging external
knowledge for persona dialogue generation. Specifically, the proposed LAPDOG
model consists of a story retriever and a dialogue generator. The story
retriever uses a given persona profile as queries to retrieve relevant
information from the story document, which serves as a supplementary context to
augment the persona profile. The dialogue generator utilizes both the dialogue
history and the augmented persona profile to generate personalized responses.
For optimization, we adopt a joint training framework that collaboratively
learns the story retriever and dialogue generator, where the story retriever is
optimized towards desired ultimate metrics (e.g., BLEU) to retrieve content for
the dialogue generator to generate personalized responses. Experiments
conducted on the CONVAI2 dataset with ROCStory as a supplementary data source
show that the proposed LAPDOG method substantially outperforms the baselines,
indicating the effectiveness of the proposed method. The LAPDOG model code is
publicly available for further exploration.
https://github.com/hqsiswiliam/LAPDOG
Authors' comments: Accepted to EMNLP-2023
Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, Jong C. Park
Information retrieval models that aim to search for the documents relevant to the given query have shown many successes, which have been applied to diverse tasks. However, the query provided by the user is oftentimes very short, which challenges the retrievers to correctly fetch relevant documents. To tackle this, existing studies have proposed expanding the query with a couple of additional (user-related) features related to the query. Yet, they may be suboptimal to effectively augment the query, though there is plenty of information available to augment it in a relational database. Motivated by this, we present a novel retrieval framework called Database-Augmented Query representation (DAQu), which augments the original query with various (query-related) metadata across multiple tables. In addition, as the number of features in the metadata can be very large and there is no order among them, we encode them with our graph-based set encoding strategy, which considers hierarchies of features in the database without order. We validate DAQu in diverse retrieval scenarios that can incorporate metadata from the relational database, demonstrating that ours significantly enhances overall retrieval performance, compared to existing query augmentation methods.