Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, Vassilis N. Ioannidis, Karthik Subbian, Jure Leskovec et al.
Large language model (LLM) agents have demonstrated impressive capability in
utilizing external tools and knowledge to boost accuracy and reduce
hallucinations. However, developing the prompting techniques that make LLM
agents able to effectively use external tools and knowledge is a heuristic and
laborious task. Here, we introduce AvaTaR, a novel and automatic framework that
optimizes an LLM agent to effectively use the provided tools and improve its
performance on a given task/domain. During optimization, we design a comparator
module to iteratively provide insightful and holistic prompts to the LLM agent
via reasoning between positive and negative examples sampled from training
data. We demonstrate AvaTaR on four complex multimodal retrieval datasets
featuring textual, visual, and relational information. We find AvaTaR
consistently outperforms state-of-the-art approaches across all four
challenging tasks and exhibits strong generalization ability when applied to
novel cases, achieving an average relative improvement of 14% on the Hit@1
metric. Code and dataset are available at https://github.com/zou-group/avatar.
Authors' comments: 19 pages, 8 figures, 6 tables
Zhonghao Li, Xuming Hu, Aiwei Liu, Kening Zheng, Sirui Huang, Hui Xiong
Large Language Models (LLMs) are limited by their parametric knowledge,
leading to hallucinations in knowledge-extensive tasks. To address this,
Retrieval-Augmented Generation (RAG) incorporates external document chunks to
expand LLM knowledge. Furthermore, compressing information from document chunks
through extraction or summarization can improve LLM performance. Nonetheless,
LLMs still struggle to notice and utilize scattered key information, a problem
known as the "lost-in-the-middle" syndrome. Therefore, we typically need to
restructure the content for LLM to recognize the key information. We propose
$\textit{Refiner}$, an end-to-end extract-and-restructure paradigm that
operates in the post-retrieval process of RAG. $\textit{Refiner}$ leverages a
single decoder-only LLM to adaptively extract query-relevant contents verbatim
along with the necessary context, and section them based on their
interconnectedness, thereby highlights information distinction, and aligns
downstream LLMs with the original context effectively. Experiments show that a
trained $\textit{Refiner}$ (with 7B parameters) exhibits significant gain to
downstream LLM in improving answer accuracy, and outperforms other
state-of-the-art advanced RAG and concurrent compressing approaches in various
single-hop and multi-hop QA tasks. Notably, $\textit{Refiner}$ achieves a 80.5%
tokens reduction and a 1.6-7.0% improvement margin in multi-hop tasks compared
to the next best solution. $\textit{Refiner}$ is a plug-and-play solution that
can be seamlessly integrated with RAG systems, facilitating its application
across diverse open-source frameworks.
Authors' comments: 8 pages
Xinhao Zhang, Jinghan Zhang, Fengran Mo, Yuzhong Chen, Kunpeng Liu
Feature generation can significantly enhance learning outcomes, particularly for tasks with limited data. An effective way to improve feature generation is by expanding the current feature space using existing features and enriching the informational content. However, generating new, interpretable features in application fields often requires domain-specific knowledge about the existing features. This paper introduces a new method RAFG for generating reasonable and explainable features specific to domain classification tasks. To generate new features with interpretability in domain knowledge, we perform information retrieval on existing features to identify potential feature associations, and utilize these associations to generate meaningful features. Furthermore, we develop a Large Language Model (LLM)-based framework for feature generation with reasoning to verify and filter features during the generation process. Experiments across several datasets in medical, economic, and geographic domains show that our RAFG method produces high-quality, meaningful features and significantly improves classification performance compared with baseline methods.
Zhenrui Yue, Huimin Zeng, Lanyu Shang, Yifan Liu, Yang Zhang, Dong Wang
The rapid propagation of misinformation poses substantial risks to public
interest. To combat misinformation, large language models (LLMs) are adapted to
automatically verify claim credibility. Nevertheless, existing methods heavily
rely on the embedded knowledge within LLMs and / or black-box APIs for evidence
collection, leading to subpar performance with smaller LLMs or upon unreliable
context. In this paper, we propose retrieval augmented fact verification
through the synthesis of contrasting arguments (RAFTS). Upon input claims,
RAFTS starts with evidence retrieval, where we design a retrieval pipeline to
collect and re-rank relevant documents from verifiable sources. Then, RAFTS
forms contrastive arguments (i.e., supporting or refuting) conditioned on the
retrieved evidence. In addition, RAFTS leverages an embedding model to identify
informative demonstrations, followed by in-context prompting to generate the
prediction and explanation. Our method effectively retrieves relevant documents
as evidence and evaluates arguments from varying perspectives, incorporating
nuanced information for fine-grained decision-making. Combined with informative
in-context examples as prior, RAFTS achieves significant improvements to
supervised and LLM baselines without complex prompts. We demonstrate the
effectiveness of our method through extensive experiments, where RAFTS can
outperform GPT-based methods with a significantly smaller 7B LLM.
Authors' comments: Accepted to ACL 2024
Jari Kolehmainen, Aditya Gourav, Prashanth Gurunath Shivakumar, Yile Gu, Ankur Gandhe, Ariya Rastrow, Grant Strimel, Ivan Bulyko
Retrieval is a widely adopted approach for improving language models leveraging external information. As the field moves towards multi-modal large language models, it is important to extend the pure text based methods to incorporate other modalities in retrieval as well for applications across the wide spectrum of machine learning tasks and data types. In this work, we propose multi-modal retrieval with two approaches: kNN-LM and cross-attention techniques. We demonstrate the effectiveness of our retrieval approaches empirically by applying them to automatic speech recognition tasks with access to external information. Under this setting, we show that speech-based multi-modal retrieval outperforms text based retrieval, and yields up to 50 % improvement in word error rate over the multi-modal language model baseline. Furthermore, we achieve state-of-the-art recognition results on the Spoken-Squad question answering dataset.
Nimol Thuon
The search engine process is crucial for document content retrieval. For Khmer documents, a tool is needed to extract essential keywords. Despite the daily generation of significant Khmer content, Cambodians struggle to find necessary documents due to the lack of an effective semantic searching tool. Even Google does not deliver high accuracy for Khmer content. Semantic search engines improve search results by employing advanced algorithms to understand various content types. With the rise in Khmer digital content such as reports, articles, and social media feedback enhanced search capabilities are essential. This research proposes the first Khmer Semantic Search Engine (KSE), designed to improve traditional Khmer search methods. Utilizing semantic matching techniques and formally annotated semantic content, our tool extracts meaningful keywords from user queries performs precise matching, and provides the best matching offline documents and online URL documents. We propose two semantic search frameworks based on keyword extraction and semantic search matching. Additionally, we developed tools for data preparation, including document addition and manual keyword extraction. To evaluate performance, we created a ground truth dataset and discussed issues related to searching and semantic search. Our findings show how understanding search term semantics can lead to more accurate results.
Yuhan Quan, Huan ZHao, Jinfeng Yi, Yuqiang Chen
CAD (Computer-Aided Design) plays a crucial role in mechanical industry, where large numbers of similar-shaped CAD parts are often created. Efficiently reusing these parts is key to reducing design and production costs for enterprises. Retrieval systems are vital for achieving CAD reuse, but the complex shapes of CAD models are difficult to accurately describe using text or keywords, making traditional retrieval methods ineffective. While existing representation learning approaches have been developed for CAD, manually labeling similar samples in these methods is expensive. Additionally, CAD models' unique parameterized data structure presents challenges for applying existing 3D shape representation learning techniques directly. In this work, we propose GC-CAD, a self-supervised contrastive graph neural network-based method for mechanical CAD retrieval that directly models parameterized CAD raw files. GC-CAD consists of two key modules: structure-aware representation learning and contrastive graph learning framework. The method leverages graph neural networks to extract both geometric and topological information from CAD models, generating feature representations. We then introduce a simple yet effective contrastive graph learning framework approach, enabling the model to train without manual labels and generate retrieval-ready representations. Experimental results on four datasets including human evaluation demonstrate that the proposed method achieves significant accuracy improvements and up to 100 times efficiency improvement over the baseline methods.
Cheng Niu, Yang Guan, Yuanhao Wu, Juno Zhu, Juntong Song, Randy Zhong, Kaihua Zhu, Siliang Xu et al.
The proliferation of fake news poses a significant threat not only by disseminating misleading information but also by undermining the very foundations of democracy. The recent advance of generative artificial intelligence has further exacerbated the challenge of distinguishing genuine news from fabricated stories. In response to this challenge, we introduce VeraCT Scan, a novel retrieval-augmented system for fake news detection. This system operates by extracting the core facts from a given piece of news and subsequently conducting an internet-wide search to identify corroborating or conflicting reports. Then sources' credibility is leveraged for information verification. Besides determining the veracity of news, we also provide transparent evidence and reasoning to support its conclusions, resulting in the interpretability and trust in the results. In addition to GPT-4 Turbo, Llama-2 13B is also fine-tuned for news content understanding, information verification, and reasoning. Both implementations have demonstrated state-of-the-art accuracy in the realm of fake news detection.
Maciej Pióro, Maciej Wołczyk, Razvan Pascanu, Johannes von Oswald, João Sacramento
A new breed of gated-linear recurrent neural networks has reached state-of-the-art performance on a range of sequence modeling problems. Such models naturally handle long sequences efficiently, as the cost of processing a new input is independent of sequence length. Here, we explore another advantage of these stateful sequence models, inspired by the success of model merging through parameter interpolation. Building on parallels between fine-tuning and in-context learning, we investigate whether we can treat internal states as task vectors that can be stored, retrieved, and then linearly combined, exploiting the linearity of recurrence. We study this form of fast model merging on Mamba-2.8b, a pretrained recurrent model, and present preliminary evidence that simple linear state interpolation methods suffice to improve next-token perplexity as well as downstream in-context learning task performance.
Zile Qiao, Wei Ye, Yong Jiang, Tong Mo, Pengjun Xie, Weiping Li, Fei Huang, Shikun Zhang
Retrieval-augmented language models (RALMs) have recently shown great potential in mitigating the limitations of implicit knowledge in LLMs, such as untimely updating of the latest expertise and unreliable retention of long-tail knowledge. However, since the external knowledge base, as well as the retriever, can not guarantee reliability, potentially leading to the knowledge retrieved not being helpful or even misleading for LLM generation. In this paper, we introduce Supportiveness-based Knowledge Rewriting (SKR), a robust and pluggable knowledge rewriter inherently optimized for LLM generation. Specifically, we introduce the novel concept of "supportiveness"--which represents how effectively a knowledge piece facilitates downstream tasks--by considering the perplexity impact of augmented knowledge on the response text of a white-box LLM. Based on knowledge supportiveness, we first design a training data curation strategy for our rewriter model, effectively identifying and filtering out poor or irrelevant rewrites (e.g., with low supportiveness scores) to improve data efficacy. We then introduce the direct preference optimization (DPO) algorithm to align the generated rewrites to optimal supportiveness, guiding the rewriter model to summarize augmented content that better improves the final response. Comprehensive evaluations across six popular knowledge-intensive tasks and four LLMs have demonstrated the effectiveness and superiority of SKR. With only 7B parameters, SKR has shown better knowledge rewriting capability over GPT-4, the current state-of-the-art general-purpose LLM.
MohammadTaghi Hajiaghayi, Sébastien Lahaie, Keivan Rezaei, Suho Shin
In the field of computational advertising, the integration of ads into the
outputs of large language models (LLMs) presents an opportunity to support
these services without compromising content integrity. This paper introduces
novel auction mechanisms for ad allocation and pricing within the textual
outputs of LLMs, leveraging retrieval-augmented generation (RAG). We propose a
segment auction where an ad is probabilistically retrieved for each discourse
segment (paragraph, section, or entire output) according to its bid and
relevance, following the RAG framework, and priced according to competing bids.
We show that our auction maximizes logarithmic social welfare, a new notion of
welfare that balances allocation efficiency and fairness, and we characterize
the associated incentive-compatible pricing rule. These results are extended to
multi-ad allocation per segment. An empirical evaluation validates the
feasibility and effectiveness of our approach over several ad auction
scenarios, and exhibits inherent tradeoffs in metrics as we allow the LLM more
flexibility to allocate ads.
Authors' comments: NeurIPS 2024
Gabriel de Jesus
Tetun is one of Timor-Leste's official languages alongside Portuguese. It is a low-resource language with over 932,400 speakers that started developing when Timor-Leste restored its independence in 2002. The media mainly uses Tetun, and more than ten national online newspapers actively broadcast news in Tetun every day. However, since information retrieval-based solutions for Tetun do not exist, finding Tetun information on the internet is challenging. This work aims to investigate and develop solutions that can enable the application of information retrieval techniques to develop search solutions for Tetun. We present a preliminary result of an experiment conducted on the task of ad-hoc retrieval in Tetun.
Muhammad Shihab Rashid, Jannat Ara Meem, Yue Dong, Vagelis Hristidis
Query expansion has been employed for a long time to improve the accuracy of query retrievers. Earlier works relied on pseudo-relevance feedback (PRF) techniques, which augment a query with terms extracted from documents retrieved in a first stage. However, the documents may be noisy hindering the effectiveness of the ranking. To avoid this, recent studies have instead used Large Language Models (LLMs) to generate additional content to expand a query. These techniques are prone to hallucination and also focus on the LLM usage cost. However, the cost may be dominated by the retrieval in several important practical scenarios, where the corpus is only available via APIs which charge a fee per retrieved document. We propose combining classic PRF techniques with LLMs and create a progressive query expansion algorithm ProQE that iteratively expands the query as it retrieves more documents. ProQE is compatible with both sparse and dense retrieval systems. Our experimental results on four retrieval datasets show that ProQE outperforms state-of-the-art baselines by 37% and is the most cost-effective.
Girma M. Yilma, Jose A. Ayala-Romero, Andres Garcia-Saavedra, Xavier Costa-Perez
Large Language Models (LLMs) have immense potential to transform the
telecommunications industry. They could help professionals understand complex
standards, generate code, and accelerate development. However, traditional LLMs
struggle with the precision and source verification essential for telecom work.
To address this, specialized LLM-based solutions tailored to telecommunication
standards are needed. Retrieval-augmented generation (RAG) offers a way to
create precise, fact-based answers. This paper proposes TelecomRAG, a framework
for a Telecommunication Standards Assistant that provides accurate, detailed,
and verifiable responses. Our implementation, using a knowledge base built from
3GPP Release 16 and Release 18 specification documents, demonstrates how this
assistant surpasses generic LLMs, offering superior accuracy, technical depth,
and verifiability, and thus significant value to the telecommunications field.
Authors' comments: 7 pages, 2 figures, 3 tables
Ashkan Alinejad, Krtin Kumar, Ali Vahdat
Question answering systems (QA) utilizing Large Language Models (LLMs) heavily depend on the retrieval component to provide them with domain-specific information and reduce the risk of generating inaccurate responses or hallucinations. Although the evaluation of retrievers dates back to the early research in Information Retrieval, assessing their performance within LLM-based chatbots remains a challenge. This study proposes a straightforward baseline for evaluating retrievers in Retrieval-Augmented Generation (RAG)-based chatbots. Our findings demonstrate that this evaluation framework provides a better image of how the retriever performs and is more aligned with the overall performance of the QA system. Although conventional metrics such as precision, recall, and F1 score may not fully capture LLMs' capabilities - as they can yield accurate responses despite imperfect retrievers - our method considers LLMs' strengths to ignore irrelevant contexts, as well as potential errors and hallucinations in their responses.
Yan Gao, Zhiwei Cao, Zhongjian Miao, Baosong Yang, Shiyu Liu, Min Zhang, Jinsong Su
To achieve non-parametric NMT domain adaptation, $k$-Nearest-Neighbor Machine
Translation ($k$NN-MT) constructs an external datastore to store
domain-specific translation knowledge, which derives a $k$NN distribution to
interpolate the prediction distribution of the NMT model via a linear
interpolation coefficient $\lambda$. Despite its success, $k$NN retrieval at
each timestep leads to substantial time overhead. To address this issue,
dominant studies resort to $k$NN-MT with adaptive retrieval ($k$NN-MT-AR),
which dynamically estimates $\lambda$ and skips $k$NN retrieval if $\lambda$ is
less than a fixed threshold. Unfortunately, $k$NN-MT-AR does not yield
satisfactory results. In this paper, we first conduct a preliminary study to
reveal two key limitations of $k$NN-MT-AR: 1) the optimization gap leads to
inaccurate estimation of $\lambda$ for determining $k$NN retrieval skipping,
and 2) using a fixed threshold fails to accommodate the dynamic demands for
$k$NN retrieval at different timesteps. To mitigate these limitations, we then
propose $k$NN-MT with dynamic retrieval ($k$NN-MT-DR) that significantly
extends vanilla $k$NN-MT in two aspects. Firstly, we equip $k$NN-MT with a
MLP-based classifier for determining whether to skip $k$NN retrieval at each
timestep. Particularly, we explore several carefully-designed scalar features
to fully exert the potential of the classifier. Secondly, we propose a
timestep-aware threshold adjustment method to dynamically generate the
threshold, which further improves the efficiency of our model. Experimental
results on the widely-used datasets demonstrate the effectiveness and
generality of our model.\footnote{Our code is available at
\url{https://github.com/DeepLearnXMU/knn-mt-dr}.
Authors' comments: Accepted to ACL 2024 Findings
Valentin Thomas, Junwei Ma, Rasa Hosseinzadeh, Keyvan Golestan, Guangwei Yu, Maksims Volkovs, Anthony Caterini
Tabular data is a pervasive modality spanning a wide range of domains, and the inherent diversity poses a considerable challenge for deep learning. Recent advancements using transformer-based in-context learning have shown promise on smaller and less complex datasets, but have struggled to scale to larger and more complex ones. To address this limitation, we propose a combination of retrieval and fine-tuning: we can adapt the transformer to a local subset of the data by collecting nearest neighbours, and then perform task-specific fine-tuning with this retrieved set of neighbours in context. Using TabPFN as the base model -- currently the best tabular in-context learner -- and applying our retrieval and fine-tuning scheme on top results in what we call a locally-calibrated PFN, or LoCalPFN. We conduct extensive evaluation on 95 datasets curated by TabZilla from OpenML, upon which we establish a new state-of-the-art with LoCalPFN -- even with respect to tuned tree-based models. Notably, we show a significant boost in performance compared to the base in-context model, demonstrating the efficacy of our approach and advancing the frontier of deep learning in tabular data.
Caleb Ziems, William Held, Jane Dwivedi-Yu, Diyi Yang
Information Retrieval (IR) systems are designed to deliver relevant content,
but traditional systems may not optimize rankings for fairness, neutrality, or
the balance of ideas. Consequently, IR can often introduce indexical biases, or
biases in the positional order of documents. Although indexical bias can
demonstrably affect people's opinion, voting patterns, and other behaviors,
these issues remain understudied as the field lacks reliable metrics and
procedures for automatically measuring indexical bias. Towards this end, we
introduce the PAIR framework, which supports automatic bias audits for ranked
documents or entire IR systems. After introducing DUO, the first
general-purpose automatic bias metric, we run an extensive evaluation of 8 IR
systems on a new corpus of 32k synthetic and 4.7k natural documents, with 4k
queries spanning 1.4k controversial issue topics. A human behavioral study
validates our approach, showing that our bias metric can help predict when and
how indexical bias will shift a reader's opinion.
Authors' comments: ACL 2024
Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, Yongping Xiong
Multi-modal retrieval becomes increasingly popular in practice. However, the
existing retrievers are mostly text-oriented, which lack the capability to
process visual information. Despite the presence of vision-language models like
CLIP, the current methods are severely limited in representing the text-only
and image-only data. In this work, we present a new embedding model VISTA for
universal multi-modal retrieval. Our work brings forth threefold technical
contributions. Firstly, we introduce a flexible architecture which extends a
powerful text encoder with the image understanding capability by introducing
visual token embeddings. Secondly, we develop two data generation strategies,
which bring high-quality composed image-text to facilitate the training of the
embedding model. Thirdly, we introduce a multi-stage training algorithm, which
first aligns the visual token embedding with the text encoder using massive
weakly labeled data, and then develops multi-modal representation capability
using the generated composed image-text data. In our experiments, VISTA
achieves superior performances across a variety of multi-modal retrieval tasks
in both zero-shot and supervised settings. Our model, data, and source code are
available at https://github.com/FlagOpen/FlagEmbedding.
Authors' comments: Accepted to ACL 2024 main conference
Kohei Makino, Makoto Miwa, Yutaka Sasaki
This paper addresses a crucial challenge in retrieval-augmented
generation-based relation extractors; the end-to-end training is not applicable
to conventional retrieval-augmented generation due to the non-differentiable
nature of instance retrieval. This problem prevents the instance retrievers
from being optimized for the relation extraction task, and conventionally it
must be trained with an objective different from that for relation extraction.
To address this issue, we propose a novel End-to-end Trainable
Retrieval-Augmented Generation (ETRAG), which allows end-to-end optimization of
the entire model, including the retriever, for the relation extraction
objective by utilizing a differentiable selection of the $k$ nearest instances.
We evaluate the relation extraction performance of ETRAG on the TACRED dataset,
which is a standard benchmark for relation extraction. ETRAG demonstrates
consistent improvements against the baseline model as retrieved instances are
added. Furthermore, the analysis of instances retrieved by the end-to-end
trained retriever confirms that the retrieved instances contain common relation
labels or entities with the query and are specialized for the target task. Our
findings provide a promising foundation for future research on
retrieval-augmented generation and the broader applications of text generation
in Natural Language Processing.
Authors' comments: preprint