Jiarui Rao, Jionghao Lin
Massive Open Online Courses (MOOCs) have significantly enhanced educational
accessibility by offering a wide variety of courses and breaking down
traditional barriers related to geography, finance, and time. However, students
often face difficulties navigating the vast selection of courses, especially
when exploring new fields of study. Driven by this challenge, researchers have
been exploring course recommender systems to offer tailored guidance that
aligns with individual learning preferences and career aspirations. These
systems face particular challenges in effectively addressing the ``cold start''
problem for new users. Recent advancements in recommender systems suggest
integrating large language models (LLMs) into the recommendation process to
enhance personalized recommendations and address the ``cold start'' problem.
Motivated by these advancements, our study introduces RAMO (Retrieval-Augmented
Generation for MOOCs), a system specifically designed to overcome the ``cold
start'' challenges of traditional course recommender systems. The RAMO system
leverages the capabilities of LLMs, along with Retrieval-Augmented Generation
(RAG)-facilitated contextual understanding, to provide course recommendations
through a conversational interface, aiming to enhance the e-learning
experience.
Authors' comments: 7 pages, this paper underwent a rigorous review process and was
officially accepted on May 31, 2024, for presentation at the Educational Data
Mining 2024 Workshop: Leveraging Large Language Models for Next Generation
Educational Technologies
Roman Novikov, Tianli Xu
We give a large class of examples of non-uniqueness for the phase retrieval
problem in multidimensions. Our examples include the case of functions with
strongly disconnected compact support.
Authors' comments: We substantially revised the first version
Xiangyang Li, Kuicai Dong, Yi Quan Lee, Wei Xia, Hao Zhang, Xinyi Dai, Yasheng Wang, Ruiming Tang
Despite the substantial success of Information Retrieval (IR) in various NLP
tasks, most IR systems predominantly handle queries and corpora in natural
language, neglecting the domain of code retrieval. Code retrieval is critically
important yet remains under-explored, with existing methods and benchmarks
inadequately representing the diversity of code in various domains and tasks.
Addressing this gap, we present COIR (Code Information Retrieval Benchmark), a
robust and comprehensive benchmark specifically designed to assess code
retrieval capabilities. COIR comprises ten meticulously curated code datasets,
spanning eight distinctive retrieval tasks across seven diverse domains. We
first discuss the construction of COIR and its diverse dataset composition.
Further, we evaluate nine widely used retrieval models using COIR, uncovering
significant difficulties in performing code retrieval tasks even with
state-of-the-art systems. To facilitate easy adoption and integration within
existing research workflows, COIR has been developed as a user-friendly Python
framework, readily installable via pip. It shares same data schema as other
popular benchmarks like MTEB and BEIR, enabling seamless cross-benchmark
evaluations. Through COIR, we aim to invigorate research in the code retrieval
domain, providing a versatile benchmarking tool that encourages further
development and exploration of code retrieval systems.
https://github.com/CoIR-team/coir.
Authors' comments: ACL 2025 Main
Ali Safaya, Deniz Yuret
This paper introduces Neurocache, an approach to extend the effective context
size of large language models (LLMs) using an external vector cache to store
its past states. Like recent vector retrieval approaches, Neurocache uses an
efficient k-nearest-neighbor (kNN) algorithm to retrieve relevant past states
and incorporate them into the attention process. Neurocache improves upon
previous methods by (1) storing compressed states, which reduces cache size;
(2) performing a single retrieval operation per token which increases inference
speed; and (3) extending the retrieval window to neighboring states, which
improves both language modeling and downstream task accuracy. Our experiments
show the effectiveness of Neurocache both for models trained from scratch and
for pre-trained models such as Llama2-7B and Mistral-7B when enhanced with the
cache mechanism. We also compare Neurocache with text retrieval methods and
show improvements in single-document question-answering and few-shot learning
tasks. We made the source code available under:
https://github.com/alisafaya/neurocache
Authors' comments: Long paper, published at the main conference NAACL'24
Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, Bryan Catanzaro
Large language models (LLMs) typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG). In this work, we propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG. In particular, the instruction-tuned LLMs work surprisingly well by adding a small fraction of ranking data into the training blend, and outperform existing expert ranking models, including the same LLM exclusively fine-tuned on a large amount of ranking data. For generation, we compare our model with many strong baselines, including GPT-4-0613, GPT-4-turbo-2024-0409, and ChatQA-1.5, an open-sourced model with the state-of-the-art performance on RAG benchmarks. Specifically, our Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks. In addition, it also performs comparably to GPT-4 on five RAG benchmarks in the biomedical domain without instruction fine-tuning on biomedical data, demonstrating its superb capability for generalization to new domains.
Harrie Oosterhuis, Rolf Jagerman, Zhen Qin, Xuanhui Wang, Michael Bendersky
The traditional evaluation of information retrieval (IR) systems is generally
very costly as it requires manual relevance annotation from human experts.
Recent advancements in generative artificial intelligence -- specifically large
language models (LLMs) -- can generate relevance annotations at an enormous
scale with relatively small computational costs. Potentially, this could
alleviate the costs traditionally associated with IR evaluation and make it
applicable to numerous low-resource applications. However, generated relevance
annotations are not immune to (systematic) errors, and as a result, directly
using them for evaluation produces unreliable results.
In this work, we propose two methods based on prediction-powered inference
and conformal risk control that utilize computer-generated relevance
annotations to place reliable confidence intervals (CIs) around IR evaluation
metrics. Our proposed methods require a small number of reliable annotations
from which the methods can statistically analyze the errors in the generated
annotations. Using this information, we can place CIs around evaluation metrics
with strong theoretical guarantees. Unlike existing approaches, our conformal
risk control method is specifically designed for ranking metrics and can vary
its CIs per query and document. Our experimental results show that our CIs
accurately capture both the variance and bias in evaluation based on LLM
annotations, better than the typical empirical bootstrapping estimates. We hope
our contributions bring reliable evaluation to the many IR applications where
this was traditionally infeasible.
Authors' comments: KDD '24
Aneeshan Sain, Pinaki Nath Chowdhury, Subhadeep Koley, Ayan Kumar Bhunia, Yi-Zhe Song
In this paper, we delve into the intricate dynamics of Fine-Grained
Sketch-Based Image Retrieval (FG-SBIR) by addressing a critical yet overlooked
aspect -- the choice of viewpoint during sketch creation. Unlike photo systems
that seamlessly handle diverse views through extensive datasets, sketch
systems, with limited data collected from fixed perspectives, face challenges.
Our pilot study, employing a pre-trained FG-SBIR model, highlights the system's
struggle when query-sketches differ in viewpoint from target instances.
Interestingly, a questionnaire however shows users desire autonomy, with a
significant percentage favouring view-specific retrieval. To reconcile this, we
advocate for a view-aware system, seamlessly accommodating both view-agnostic
and view-specific tasks. Overcoming dataset limitations, our first contribution
leverages multi-view 2D projections of 3D objects, instilling cross-modal view
awareness. The second contribution introduces a customisable cross-modal
feature through disentanglement, allowing effortless mode switching. Extensive
experiments on standard datasets validate the effectiveness of our method.
Authors' comments: Accepted in European Conference on Computer Vision (ECCV) 2024
Vitaly Bulgakov
In this paper, we focus on methods to reduce the size and improve the quality of the prompt context required for question-answering systems. Attempts to increase the number of retrieved chunked documents and thereby enlarge the context related to the query can significantly complicate the processing and decrease the performance of a Large Language Model (LLM) when generating responses to queries. It is well known that a large set of documents retrieved from a database in response to a query may contain irrelevant information, which often leads to hallucinations in the resulting answers. Our goal is to select the most semantically relevant documents, treating the discarded ones as outliers. We propose and evaluate several methods for identifying outliers by creating features that utilize the distances of embedding vectors, retrieved from the vector database, to both the centroid and the query vectors. The methods were evaluated by comparing the similarities of the retrieved LLM responses to ground-truth answers obtained using the OpenAI GPT-4o model. It was found that the greatest improvements were achieved with increasing complexity of the questions and answers.
Cole Brabec, Sivan Trajtenberg-Mills, Luca Daniel, Dirk Englund
We present the first phase retrieval algorithm guaranteed to solve the multidimensional phase retrieval problem in polynomial arithmetic complexity without prior information. The method successfully terminates in O(N log(N)) operations for Fourier measurements with cardinality N. The algorithm is guaranteed to succeed for a large class of objects, which we term "Schwarz objects". We further present an easy-to-calculate and well-conditioned diagonal operator that transforms any feasible phase-retrieval instance into one that is solved by our method. We derive our method by combining techniques from classical complex analysis, algebraic topology, and modern numerical analysis. Concretely, we pose the phase retrieval problem as a multiplicative Cousin problem, construct an approximate solution using a modified integral used for the Schwarz problem, and refine the approximate solution to an exact solution via standard optimization methods. We present numerical experimentation demonstrating our algorithm's performance and its superiority to existing method. Finally, we demonstrate that our method is robust against Gaussian noise.
Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang et al.
Retrieval-augmented generation (RAG) techniques have proven to be effective in integrating up-to-date information, mitigating hallucinations, and enhancing response quality, particularly in specialized domains. While many RAG approaches have been proposed to enhance large language models through query-dependent retrievals, these approaches still suffer from their complex implementation and prolonged response times. Typically, a RAG workflow involves multiple processing steps, each of which can be executed in various ways. Here, we investigate existing RAG approaches and their potential combinations to identify optimal RAG practices. Through extensive experiments, we suggest several strategies for deploying RAG that balance both performance and efficiency. Moreover, we demonstrate that multimodal retrieval techniques can significantly enhance question-answering capabilities about visual inputs and accelerate the generation of multimodal content using a "retrieval as generation" strategy.
David Rau, Hervé Déjean, Nadezhda Chirkova, Thibault Formal, Shuai Wang, Vassilina Nikoulina, Stéphane Clinchant
Retrieval-Augmented Generation allows to enhance Large Language Models with
external knowledge. In response to the recent popularity of generative LLMs,
many RAG approaches have been proposed, which involve an intricate number of
different configurations such as evaluation datasets, collections, metrics,
retrievers, and LLMs. Inconsistent benchmarking poses a major challenge in
comparing approaches and understanding the impact of each component in the
pipeline. In this work, we study best practices that lay the groundwork for a
systematic evaluation of RAG and present BERGEN, an end-to-end library for
reproducible research standardizing RAG experiments. In an extensive study
focusing on QA, we benchmark different state-of-the-art retrievers, rerankers,
and LLMs. Additionally, we analyze existing RAG metrics and datasets. Our
open-source library BERGEN is available under
\url{https://github.com/naver/bergen}.
Authors' comments: 29 pages
Yunqi Xu, Tianchi Cai, Jiyan Jiang, Xierui Song
The prevailing issue of factual inconsistency errors in conventional Retrieval Augmented Generation (RAG) motivates the study of Factual Consistency Evaluation (FCE). Despite the various FCE methods proposed earlier, these methods are evaluated on datasets generated by specific Large Language Models (LLMs). Without a comprehensive benchmark, it remains unexplored how these FCE methods perform on other LLMs with different error distributions or even unseen error types, as these methods may fail to detect the error types generated by other LLMs. To fill this gap, in this paper, we propose the first comprehensive FCE benchmark \emph{Face4RAG} for RAG independent of the underlying LLM. Our benchmark consists of a synthetic dataset built upon a carefully designed typology for factuality inconsistency error and a real-world dataset constructed from six commonly used LLMs, enabling evaluation of FCE methods on specific error types or real-world error distributions. On the proposed benchmark, we discover the failure of existing FCE methods to detect the logical fallacy, which refers to a mismatch of logic structures between the answer and the retrieved reference. To fix this issue, we further propose a new method called \emph{L-Face4RAG} with two novel designs of logic-preserving answer decomposition and fact-logic FCE. Extensive experiments show L-Face4RAG substantially outperforms previous methods for factual inconsistency detection on a wide range of tasks, notably beyond the RAG task from which it is originally motivated. Both the benchmark and our proposed method are publicly available.\footnote{\url{https://huggingface.co/datasets/yq27/Face4RAG}\label{link_face4rag}}
Palak Jain, Livio Baldini Soares, Tom Kwiatkowski
We present RICHES, a novel approach that interleaves retrieval with sequence
generation tasks. RICHES offers an alternative to conventional RAG systems by
eliminating the need for separate retriever and generator. It retrieves
documents by directly decoding their contents, constrained on the corpus.
Unifying retrieval with generation allows us to adapt to diverse new tasks via
prompting alone. RICHES can work with any Instruction-tuned model, without
additional training. It provides attributed evidence, supports multi-hop
retrievals and interleaves thoughts to plan on what to retrieve next, all
within a single decoding pass of the LLM. We demonstrate the strong performance
of RICHES across ODQA tasks including attributed and multi-hop QA.
Authors' comments: 18 pages, 3 figures, Preprint
Martin Rathmair
In recent work [P. Grohs and M. Rathmair. Stable Gabor Phase Retrieval and
Spectral Clustering. Communications on Pure and Applied Mathematics (2018)] and
[P. Grohs and M. Rathmair. Stable Gabor phase retrieval for multivariate
functions. Journal of the European Mathematical Society (2021)] the
instabilities of Gabor phase retrieval problem, i.e. reconstructing $ f\in
L^2(\mathbb{R})$ from its spectrogram $|\mathcal{V}_g f|$ where $$\mathcal{V}_g
f(x,\xi) = \int_{\mathbb{R}} f(t)\overline{g(t-x)}e^{-2\pi i \xi
t}\,\mbox{d}t,$$ have been classified in terms of the connectivity of the
measurements. These findings were however crucially restricted to the case
where the window $g(t)=e^{-\pi t^2}$ is Gaussian. In this work we establish a
corresponding result for a number of other window functions including the
one-sided exponential $g(t)=e^{-t}\mathbb{1}_{[0,\infty)}(t)$ and
$g(t)=\exp(t-e^t)$. As a by-product we establish a modified version of
Poincar\'e's inequality which can be applied to non-differentiable functions
and may be of independent interest.
Authors' comments: 21 pages, 2 figures
Chenlong Deng, Kelong Mao, Zhicheng Dou
Legal case retrieval for sourcing similar cases is critical in upholding judicial fairness. Different from general web search, legal case retrieval involves processing lengthy, complex, and highly specialized legal documents. Existing methods in this domain often overlook the incorporation of legal expert knowledge, which is crucial for accurately understanding and modeling legal cases, leading to unsatisfactory retrieval performance. This paper introduces KELLER, a legal knowledge-guided case reformulation approach based on large language models (LLMs) for effective and interpretable legal case retrieval. By incorporating professional legal knowledge about crimes and law articles, we enable large language models to accurately reformulate the original legal case into concise sub-facts of crimes, which contain the essential information of the case. Extensive experiments on two legal case retrieval benchmarks demonstrate superior retrieval performance and robustness on complex legal case queries of KELLER over existing methods.
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo
Documents are visually rich structures that convey information through text,
but also figures, page layouts, tables, or even fonts. Since modern retrieval
systems mainly rely on the textual information they extract from document pages
to index documents -often through lengthy and brittle processes-, they struggle
to exploit key visual cues efficiently. This limits their capabilities in many
practical document retrieval applications such as Retrieval Augmented
Generation (RAG). To benchmark current systems on visually rich document
retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe,
composed of various page-level retrieval tasks spanning multiple domains,
languages, and practical settings. The inherent complexity and performance
shortcomings of modern systems motivate a new concept; doing document retrieval
by directly embedding the images of the document pages. We release ColPali, a
Vision Language Model trained to produce high-quality multi-vector embeddings
from images of document pages. Combined with a late interaction matching
mechanism, ColPali largely outperforms modern document retrieval pipelines
while being drastically simpler, faster and end-to-end trainable. We release
models, data, code and benchmarks under open licenses at https://hf.co/vidore.
Authors' comments: Published as a conference paper at ICLR 2025
Yuying Li, Gaoyang Liu, Chen Wang, Yang Yang
Retrieval-Augmented Generation (RAG) is a state-of-the-art technique that mitigates issues such as hallucinations and knowledge staleness in Large Language Models (LLMs) by retrieving relevant knowledge from an external database to assist in content generation. Existing research has demonstrated potential privacy risks associated with the LLMs of RAG. However, the privacy risks posed by the integration of an external database, which often contains sensitive data such as medical records or personal identities, have remained largely unexplored. In this paper, we aim to bridge this gap by focusing on membership privacy of RAG's external database, with the aim of determining whether a given sample is part of the RAG's database. Our basic idea is that if a sample is in the external database, it will exhibit a high degree of semantic similarity to the text generated by the RAG system. We present S$^2$MIA, a \underline{M}embership \underline{I}nference \underline{A}ttack that utilizes the \underline{S}emantic \underline{S}imilarity between a given sample and the content generated by the RAG system. With our proposed S$^2$MIA, we demonstrate the potential to breach the membership privacy of the RAG database. Extensive experiment results demonstrate that S$^2$MIA can achieve a strong inference performance compared with five existing MIAs, and is able to escape from the protection of three representative defenses.
Xiaoyu Shen, Rexhina Blloshmi, Dawei Zhu, Jiahuan Pei, Wei Zhang
Retrieval-augmented generation has gained popularity as a framework to enhance large language models with external knowledge. However, its effectiveness hinges on the retrieval robustness of the model. If the model lacks retrieval robustness, its performance is constrained by the accuracy of the retriever, resulting in significant compromises when the retrieved context is irrelevant. In this paper, we evaluate the "implicit" retrieval robustness of various large language models, instructing them to directly output the final answer without explicitly judging the relevance of the retrieved context. Our findings reveal that fine-tuning on a mix of gold and distracting context significantly enhances the model's robustness to retrieval inaccuracies, while still maintaining its ability to extract correct answers when retrieval is accurate. This suggests that large language models can implicitly handle relevant or irrelevant retrieved context by learning solely from the supervision of the final answer in an end-to-end manner. Introducing an additional process for explicit relevance judgment can be unnecessary and disrupts the end-to-end approach.
Qiancheng Xu, Yongqi Li, Heming Xia, Wenjie Li
Tool learning aims to enhance and expand large language models' (LLMs) capabilities with external tools, which has gained significant attention recently. Current methods have shown that LLMs can effectively handle a certain amount of tools through in-context learning or fine-tuning. However, in real-world scenarios, the number of tools is typically extensive and irregularly updated, emphasizing the necessity for a dedicated tool retrieval component. Tool retrieval is nontrivial due to the following challenges: 1) complex user instructions and tool descriptions; 2) misalignment between tool retrieval and tool usage models. To address the above issues, we propose to enhance tool retrieval with iterative feedback from the large language model. Specifically, we prompt the tool usage model, i.e., the LLM, to provide feedback for the tool retriever model in multi-round, which could progressively improve the tool retriever's understanding of instructions and tools and reduce the gap between the two standalone components. We build a unified and comprehensive benchmark to evaluate tool retrieval models. The extensive experiments indicate that our proposed approach achieves advanced performance in both in-domain evaluation and out-of-domain evaluation.
Robert Friel, Masha Belyi, Atindriyo Sanyal
Retrieval-Augmented Generation (RAG) has become a standard architectural pattern for incorporating domain-specific knowledge into user-facing chat applications powered by Large Language Models (LLMs). RAG systems are characterized by (1) a document retriever that queries a domain-specific corpus for context information relevant to an input query, and (2) an LLM that generates a response based on the provided query and context. However, comprehensive evaluation of RAG systems remains a challenge due to the lack of unified evaluation criteria and annotated datasets. In response, we introduce RAGBench: the first comprehensive, large-scale RAG benchmark dataset of 100k examples. It covers five unique industry-specific domains and various RAG task types. RAGBench examples are sourced from industry corpora such as user manuals, making it particularly relevant for industry applications. Further, we formalize the TRACe evaluation framework: a set of explainable and actionable RAG evaluation metrics applicable across all RAG domains. We release the labeled dataset at https://huggingface.co/datasets/rungalileo/ragbench. RAGBench explainable labels facilitate holistic evaluation of RAG systems, enabling actionable feedback for continuous improvement of production applications. Thorough extensive benchmarking, we find that LLM-based RAG evaluation methods struggle to compete with a finetuned RoBERTa model on the RAG evaluation task. We identify areas where existing approaches fall short and propose the adoption of RAGBench with TRACe towards advancing the state of RAG evaluation systems.