Mehrdad Farahani, Richard Johansson
Generative language models often struggle with specialized or less-discussed
knowledge. A potential solution is found in Retrieval-Augmented Generation
(RAG) models which act like retrieving information before generating responses.
In this study, we explore how the \textsc{Atlas} approach, a RAG model, decides
between what it already knows (parametric) and what it retrieves
(non-parametric). We use causal mediation analysis and controlled experiments
to examine how internal representations influence information processing. Our
findings disentangle the effects of parametric knowledge and the retrieved
context. They indicate that in cases where the model can choose between both
types of information (parametric and non-parametric), it relies more on the
context than the parametric knowledge. Furthermore, the analysis investigates
the computations involved in \emph{how} the model uses the information from the
context. We find that multiple mechanisms are active within the model and can
be detected with mediation analysis: first, the decision of \emph{whether the
context is relevant}, and second, how the encoder computes output
representations to support copying when relevant.
Authors' comments: Accepted at EMNLP 2024
Francesco Maria Molfese, Simone Conia, Riccardo Orlando, Roberto Navigli
Current Large Language Models (LLMs) have shown strong reasoning capabilities
in commonsense question answering benchmarks, but the process underlying their
success remains largely opaque. As a consequence, recent approaches have
equipped LLMs with mechanisms for knowledge retrieval, reasoning and
introspection, not only to improve their capabilities but also to enhance the
interpretability of their outputs. However, these methods require additional
training, hand-crafted templates or human-written explanations. To address
these issues, we introduce ZEBRA, a zero-shot question answering framework that
combines retrieval, case-based reasoning and introspection and dispenses with
the need for additional training of the LLM. Given an input question, ZEBRA
retrieves relevant question-knowledge pairs from a knowledge base and generates
new knowledge by reasoning over the relationships in these pairs. This
generated knowledge is then used to answer the input question, improving the
model's performance and interpretability. We evaluate our approach across 8
well-established commonsense reasoning benchmarks, demonstrating that ZEBRA
consistently outperforms strong LLMs and previous knowledge integration
approaches, achieving an average accuracy improvement of up to 4.5 points.
Authors' comments: Accepted at EMNLP 2024 Main Conference
Olivia Saffer, Jesús Humberto Marines Cabello, Steven Becker, Andreas Geilen, Birgit Stiller
Photonic memory is an important building block to delay, route and buffer
optical information, for instance in optical interconnects or for recurrent
optical signal processing. Photonic-phononic memory based on stimulated
Brillouin-Mandelstam scattering (SBS) has been demonstrated as a coherent
optical storage approach with broad bandwidth, frequency selectivity and
intrinsic nonreciprocity. Here, we experimentally demonstrated the storage of
quadrature-phase encoded data at room temperature and at cryogenic
temperatures. We store and retrieve the 2-bit states $\{00, 01, 10, 11\}$
encoded as optical pulses with the phases $\{0, {\pi}/2 , {\pi}, 3{\pi}/2\}$ -
a quadrature phase shift keying (QPSK) signal. The 2-bit signals are retrieved
from the acoustic domain with a global phase rotation of ${\pi}$, which is
inherent in the process due to SBS. We also demonstrate full phase control over
the retrieved data based on two different handles: by detuning slightly from
the SBS resonance, or by changing the storage time in the memory scheme we can
cover the full range $[0, 2{\pi})$. At a cryogenic temperature of 3.9 K, we
have increased readout efficiency as well as gained access to longer storage
times, which results in a detectable signal at 140 ns. All in all, the work
sets the cornerstone for optoacoustic memory schemes with phase-encoded data
Authors' comments: O.S. and J.H.M.C. contributed equally to this work
Yijiong Yu, Ma Xiufa, Fang Jianwei, Zhi Xu, Su Guangyao, Wang Jiancheng, Yongfeng Huang, Zhixiao Qi et al.
Long-context language models (LCLMs), characterized by their extensive
context window, are becoming popular. However, despite they are nearly perfect
at standard long-context retrieval tasks, our evaluations demonstrate they are
not good at 2 basic cases, "multi-matching retrieval," and "logic-based
retrieval", which are beyond LCLMs' ability boundary. But we find they can be
well addressed with a sufficient number of reasoning steps, guided by specific
CoT prompts, indicating the necessity of combining long-context tasks with CoT
methods for more advanced long context handling. However, current CoT methods
are too time-consuming, when the context is very long, which means efficient
long-context handling still has a long way to go.
Authors' comments: Our code is publicly available at
https://github.com/yuyijiong/hard_retrieval_for_llm and the datasets is at
https://huggingface.co/datasets/yuyijiong/difficult_retrieval
Jakub Pokrywka
Passage Retrieval has traditionally relied on lexical methods like TF-IDF and BM25. Recently, some neural network models have surpassed these methods in performance. However, these models face challenges, such as the need for large annotated datasets and adapting to new domains. This paper presents a winning solution to the Poleval 2023 Task 3: Passage Retrieval challenge, which involves retrieving passages of Polish texts in three domains: trivia, legal, and customer support. However, only the trivia domain was used for training and development data. The method used the OKAPI BM25 algorithm to retrieve documents and an ensemble of publicly available multilingual Cross Encoders for Reranking. Fine-tuning the reranker models slightly improved performance but only in the training domain, while it worsened in other domains.
Dafa Li, Yao Liu, Hongchi Wang, Min Fang, Lei Wang
Investigating the dust grain size and its dependence on substructures in
protoplanetary disks is a crucial step in understanding the initial process of
planet formation. Spectral indices derived from millimeter observations are
used as a common probe for grain size. Converting observed spectral indices
into grain sizes is a complex task that involves solving the radiative transfer
equation, taking into account the disk structure and dust properties. In this
work, we ran reference radiative transfer models with known disk properties,
and generated four synthetic images at wavelengths of 0.8, 1.3, 3, and 7.8 mm,
representing high-resolution continuum observations. Rings and gaps were
considered in the setup. We fit the synthetic images using the analytic
solution of the radiative transfer equation to investigate the circumstances
under which the input grain sizes can be recovered. The results show that
fitting images at only two wavelengths is not sufficient to retrieve the grain
size. Fitting three images improves the retrieval of grain size, but the dust
surface density is still not well recovered. When taking all of the four images
into account, degeneracies between different parameters are highly reduced, and
consequently the best-fit grain sizes are consistent with the reference setup
at almost all radii. We find that the inclination angle has a significant
impact on the fitting results. For disks with low inclinations, the analytic
approach works quite well. However, when the disk is tilted above about 60
degree, neither the grain size nor the dust surface density can be constrained,
as the inclination effect will smooth out all substructures in the radial
intensity profile of the disk.
Authors' comments: 10 pages, 9 figures, Published in the journal of A&A (ref. 2024, A&A,
688, A204)
Teruaki Hayashi, Hiroki Sakaji, Jiayi Dai, Randy Goebel
Developing the capacity to effectively search for requisite datasets is an urgent requirement to assist data users in identifying relevant datasets considering the very limited available metadata. For this challenge, the utilization of third-party data is emerging as a valuable source for improvement. Our research introduces a new architecture for data exploration which employs a form of Retrieval-Augmented Generation (RAG) to enhance metadata-based data discovery. The system integrates large language models (LLMs) with external vector databases to identify semantic relationships among diverse types of datasets. The proposed framework offers a new method for evaluating semantic similarity among heterogeneous data sources and for improving data exploration. Our study includes experimental results on four critical tasks: 1) recommending similar datasets, 2) suggesting combinable datasets, 3) estimating tags, and 4) predicting variables. Our results demonstrate that RAG can enhance the selection of relevant datasets, particularly from different categories, when compared to conventional metadata approaches. However, performance varied across tasks and models, which confirms the significance of selecting appropriate techniques based on specific use cases. The findings suggest that this approach holds promise for addressing challenges in data exploration and discovery, although further refinement is necessary for estimation tasks.
Youngwoo Kim, Razieh Rahimi, James Allan
Most efforts in interpreting neural relevance models have focused on local explanations, which explain the relevance of a document to a query but are not useful in predicting the model's behavior on unseen query-document pairs. We propose a novel method to globally explain neural relevance models by constructing a "relevance thesaurus" containing semantically relevant query and document term pairs. This thesaurus is used to augment lexical matching models such as BM25 to approximate the neural model's predictions. Our method involves training a neural relevance model to score the relevance of partial query and document segments, which is then used to identify relevant terms across the vocabulary space. We evaluate the obtained thesaurus explanation based on ranking effectiveness and fidelity to the target neural ranking model. Notably, our thesaurus reveals the existence of brand name bias in ranking models, demonstrating one advantage of our explanation method.
Jesper Knapp, Klas Moberg, Yuchuan Jin, Simin Sun, Miroslaw Staron
Autonomous driving software generates enormous amounts of data every second, which software development organizations save for future analysis and testing in the form of logs. However, given the vast size of this data, locating specific scenarios within a collection of vehicle logs can be challenging. Writing the correct SQL queries to find these scenarios requires engineers to have a strong background in SQL and the specific databases in question, further complicating the search process. This paper presents and evaluates a pipeline that allows searching for specific scenarios in log collections using natural language descriptions instead of SQL. The generated descriptions were evaluated by engineers working with vehicle logs at the Zenseact on a scale from 1 to 5. Our approach achieved a mean score of 3.3, demonstrating the potential of using a multi-model architecture to improve the software development workflow. We also present an interface that can visualize the query process and visualize the results.
Monoshiz Mahbub Khan, Zhe Yu
Code search is vital in the maintenance and extension of software systems. Past works have used separate language models for the natural language and programming language artifacts on models with multiple encoders and different loss functions. Similarly, this work approaches code search for Python as a translation retrieval problem while the natural language queries and the programming language are treated as two types of languages. By using dual encoders, these two types of language sequences are projected onto a shared embedding space, in which the distance reflects the similarity between a given pair of query and code. However, in contrast to previous work, this approach uses a unified language model, and a dual encoder structure with a cosine similarity loss function. A unified language model helps the model take advantage of the considerable overlap of words between the artifacts, making the learning much easier. On the other hand, the dual encoders trained with cosine similarity loss helps the model learn the underlining patterns of which terms are important for predicting linked pairs of artifacts. Evaluation shows the proposed model achieves performance better than state-of-the-art code search models. In addition, this model is much less expensive in terms of time and complexity, offering a cheaper, faster, and better alternative.
SeungHeon Doh, Minhee Lee, Dasaem Jeong, Juhan Nam
Text-to-Music Retrieval, finding music based on a given natural language
query, plays a pivotal role in content discovery within extensive music
databases. To address this challenge, prior research has predominantly focused
on a joint embedding of music audio and text, utilizing it to retrieve music
tracks that exactly match descriptive queries related to musical attributes
(i.e. genre, instrument) and contextual elements (i.e. mood, theme). However,
users also articulate a need to explore music that shares similarities with
their favorite tracks or artists, such as \textit{I need a similar track to
Superstition by Stevie Wonder}. To address these concerns, this paper proposes
an improved Text-to-Music Retrieval model, denoted as TTMR++, which utilizes
rich text descriptions generated with a finetuned large language model and
metadata. To accomplish this, we obtained various types of seed text from
several existing music tag and caption datasets and a knowledge graph dataset
of artists and tracks. The experimental results show the effectiveness of
TTMR++ in comparison to state-of-the-art music-text joint embedding models
through a comprehensive evaluation involving various musical text queries.
Authors' comments: Accepted for publication at the IEEE ICASSP 2024
Huayang Li, Pat Verga, Priyanka Sen, Bowen Yang, Vijay Viswanathan, Patrick Lewis, Taro Watanabe, Yixuan Su
The context window of large language models (LLMs) has been extended significantly in recent years. However, while the context length that the LLM can process has grown, the capability of the model to accurately reason over that context degrades noticeably. This occurs because modern LLMs often become overwhelmed by the vast amount of information in the context; when answering questions, the model must identify and reason over relevant evidence sparsely distributed throughout the text. To alleviate the challenge of long-context reasoning, we develop a retrieve-then-reason framework, enabling LLMs to reason over relevant evidence collected during an intermediate retrieval step. We find that modern LLMs struggle to accurately retrieve relevant facts and instead, often hallucinate "retrieved facts", resulting in flawed reasoning and the production of incorrect answers. To address these issues, we introduce ALR$^2$, a method that augments the long-context reasoning capability of LLMs via an explicit two-stage procedure, i.e., aligning LLMs with the objectives of both retrieval and reasoning. We demonstrate the efficacy of ALR$^2$ for mitigating performance degradation in long-context reasoning tasks. Through extensive experiments on long-context QA benchmarks, we find our method to outperform competitive baselines by large margins, achieving at least 8.4 and 7.9 EM gains on the long-context versions of HotpotQA and SQuAD datasets, respectively.
Yuxiang Zhang, Xin Fan, Junjie Wang, Chongxian Chen, Fan Mo, Tetsuya Sakai, Hayato Yamana
Recent advancements in large language models (LLMs) integrated with external tools and APIs have successfully addressed complex tasks by using in-context learning or fine-tuning. Despite this progress, the vast scale of tool retrieval remains challenging due to stringent input length constraints. In response, we propose a pre-retrieval strategy from an extensive repository, effectively framing the problem as the massive tool retrieval (MTR) task. We introduce the MTRB (massive tool retrieval benchmark) to evaluate real-world tool-augmented LLM scenarios with a large number of tools. This benchmark is designed for low-resource scenarios and includes a diverse collection of tools with descriptions refined for consistency and clarity. It consists of three subsets, each containing 90 test samples and 10 training samples. To handle the low-resource MTR task, we raise a new query-tool alignment (QTA) framework leverages LLMs to enhance query-tool alignment by rewriting user queries through ranking functions and the direct preference optimization (DPO) method. This approach consistently outperforms existing state-of-the-art models in top-5 and top-10 retrieval tasks across the MTRB benchmark, with improvements up to 93.28% based on the metric Sufficiency@k, which measures the adequacy of tool retrieval within the first k results. Furthermore, ablation studies validate the efficacy of our framework, highlighting its capacity to optimize performance even with limited annotated samples. Specifically, our framework achieves up to 78.53% performance improvement in Sufficiency@k with just a single annotated sample. Additionally, QTA exhibits strong cross-dataset generalizability, emphasizing its potential for real-world applications.
Yilong Zhao, Daifeng Li
The drafting of documents in the procurement field has progressively become more complex and diverse, driven by the need to meet legal requirements, adapt to technological advancements, and address stakeholder demands. While large language models (LLMs) show potential in document generation, most LLMs lack specialized knowledge in procurement. To address this gap, we use retrieval-augmented techniques to achieve professional document generation, ensuring accuracy and relevance in procurement documentation.
Tobias Leemann, Periklis Petridis, Giuseppe Vietri, Dionysis Manousakas, Aaron Roth, Sergul Aydore
While retrieval-augmented generation (RAG) has been shown to enhance factuality of large language model (LLM) outputs, LLMs still suffer from hallucination, generating incorrect or irrelevant information. A common detection strategy involves prompting the LLM again to assess whether its response is grounded in the retrieved evidence, but this approach is costly. Alternatively, lightweight natural language inference (NLI) models for efficient grounding verification can be used at inference time. While existing pre-trained NLI models offer potential solutions, their performance remains subpar compared to larger models on realistic RAG inputs. RAG inputs are more complex than most datasets used for training NLI models and have characteristics specific to the underlying knowledge base, requiring adaptation of the NLI models to a specific target domain. Additionally, the lack of labeled instances in the target domain makes supervised domain adaptation, e.g., through fine-tuning, infeasible. To address these challenges, we introduce Automatic Generative Domain Adaptation (Auto-GDA). Our framework enables unsupervised domain adaptation through synthetic data generation. Unlike previous methods that rely on handcrafted filtering and augmentation strategies, Auto-GDA employs an iterative process to continuously improve the quality of generated samples using weak labels from less efficient teacher models and discrete optimization to select the most promising augmented samples. Experimental results demonstrate the effectiveness of our approach, with models fine-tuned on synthetic data using Auto-GDA often surpassing the performance of the teacher model and reaching the performance level of LLMs at 10% of their computational cost.
Shailja Gupta, Rajesh Ranjan, Surya Narayan Singh
This paper presents a comprehensive study of Retrieval-Augmented Generation
(RAG), tracing its evolution from foundational concepts to the current state of
the art. RAG combines retrieval mechanisms with generative language models to
enhance the accuracy of outputs, addressing key limitations of LLMs. The study
explores the basic architecture of RAG, focusing on how retrieval and
generation are integrated to handle knowledge-intensive tasks. A detailed
review of the significant technological advancements in RAG is provided,
including key innovations in retrieval-augmented language models and
applications across various domains such as question-answering, summarization,
and knowledge-based tasks. Recent research breakthroughs are discussed,
highlighting novel methods for improving retrieval efficiency. Furthermore, the
paper examines ongoing challenges such as scalability, bias, and ethical
concerns in deployment. Future research directions are proposed, focusing on
improving the robustness of RAG models, expanding the scope of application of
RAG models, and addressing societal implications. This survey aims to serve as
a foundational resource for researchers and practitioners in understanding the
potential of RAG and its trajectory in natural language processing.
Authors' comments: 4 Figures
Pengzhi Yang, Xinyu Wang, Ruipeng Zhang, Cong Wang, Frans Oliehoek, Jens Kober
Real-world environments require robots to continuously acquire new skills while retaining previously learned abilities, all without the need for clearly defined task boundaries. Storing all past data to prevent forgetting is impractical due to storage and privacy concerns. To address this, we propose a method that efficiently restores a robot's proficiency in previously learned tasks over its lifespan. Using an Episodic Memory (EM), our approach enables experience replay during training and retrieval during testing for local fine-tuning, allowing rapid adaptation to previously encountered problems without explicit task identifiers. Additionally, we introduce a selective weighting mechanism that emphasizes the most challenging segments of retrieved demonstrations, focusing local adaptation where it is most needed. This framework offers a scalable solution for lifelong learning in dynamic, task-unaware environments, combining retrieval-based adaptation with selective weighting to enhance robot performance in open-ended scenarios.
Ryan C. Barron, Ves Grantcharov, Selma Wanna, Maksim E. Eren, Manish Bhattarai, Nicholas Solovyev, George Tompkins, Charles Nicholas et al.
Large Language Models (LLMs) are pre-trained on large-scale corpora and excel
in numerous general natural language processing (NLP) tasks, such as question
answering (QA). Despite their advanced language capabilities, when it comes to
domain-specific and knowledge-intensive tasks, LLMs suffer from hallucinations,
knowledge cut-offs, and lack of knowledge attributions. Additionally, fine
tuning LLMs' intrinsic knowledge to highly specific domains is an expensive and
time consuming process. The retrieval-augmented generation (RAG) process has
recently emerged as a method capable of optimization of LLM responses, by
referencing them to a predetermined ontology. It was shown that using a
Knowledge Graph (KG) ontology for RAG improves the QA accuracy, by taking into
account relevant sub-graphs that preserve the information in a structured
manner. In this paper, we introduce SMART-SLIC, a highly domain-specific LLM
framework, that integrates RAG with KG and a vector store (VS) that store
factual domain specific information. Importantly, to avoid hallucinations in
the KG, we build these highly domain-specific KGs and VSs without the use of
LLMs, but via NLP, data mining, and nonnegative tensor factorization with
automatic model selection. Pairing our RAG with a domain-specific: (i) KG
(containing structured information), and (ii) VS (containing unstructured
information) enables the development of domain-specific chat-bots that
attribute the source of information, mitigate hallucinations, lessen the need
for fine-tuning, and excel in highly domain-specific question answering tasks.
We pair SMART-SLIC with chain-of-thought prompting agents. The framework is
designed to be generalizable to adapt to any specific or specialized domain. In
this paper, we demonstrate the question answering capabilities of our framework
on a corpus of scientific publications on malware analysis and anomaly
detection.
Authors' comments: 9 pages 7 figures, 1 table, 1 cypher code Accepted to ICMLA 2024
Zixuan Li, Jing Xiong, Fanghua Ye, Chuanyang Zheng, Xun Wu, Jianqiao Lu, Zhongwei Wan, Xiaodan Liang et al.
We present UncertaintyRAG, a novel approach for long-context Retrieval-Augmented Generation (RAG) that utilizes Signal-to-Noise Ratio (SNR)-based span uncertainty to estimate similarity between text chunks. This span uncertainty enhances model calibration, improving robustness and mitigating semantic inconsistencies introduced by random chunking. Leveraging this insight, we propose an efficient unsupervised learning technique to train the retrieval model, alongside an effective data sampling and scaling strategy. UncertaintyRAG outperforms baselines by 2.03% on LLaMA-2-7B, achieving state-of-the-art results while using only 4% of the training data compared to other advanced open-source retrieval models under distribution shift settings. Our method demonstrates strong calibration through span uncertainty, leading to improved generalization and robustness in long-context RAG tasks. Additionally, UncertaintyRAG provides a lightweight retrieval model that can be integrated into any large language model with varying context window lengths, without the need for fine-tuning, showcasing the flexibility of our approach.
Jiale Fu, Yaqing Wang, Simeng Han, Jiaming Fan, Xu Yang
In-context learning (ICL) enhances large language models (LLMs) by incorporating demonstration examples, yet its effectiveness heavily depends on the quality of selected examples. Current methods typically use text embeddings to measure semantic similarity, which often introduces bias in multi-step reasoning tasks. This occurs because text embeddings contain irrelevant semantic information and lack deeper reasoning structures. To address this, we propose GraphIC, a graph-based retrieval model that leverages reasoning-aware representation and specialized similarity metric for in-context example retrieval. GraphIC first constructs thought graphs-directed, node-attributed graphs that explicitly model reasoning steps and their dependencies-for candidate examples and queries. This approach filters out superficial semantics while preserving essential reasoning processes. Next, GraphIC retrieves examples using a novel similarity metric tailored for these graphs, capturing sequential reasoning patterns and asymmetry between examples. Comprehensive evaluations across mathematical reasoning, code generation, and logical reasoning tasks demonstrate that GraphIC outperforms 10 baseline methods. Our results highlight the importance of reasoning-aware retrieval in ICL, offering a robust solution for enhancing LLM performance in multi-step reasoning scenarios.