Chaoyi Ai, Yong Jiang, Shen Huang, Pengjun Xie, Kewei Tu
Named entity recognition (NER) models often struggle with noisy inputs, such as those with spelling mistakes or errors generated by Optical Character Recognition processes, and learning a robust NER model is challenging. Existing robust NER models utilize both noisy text and its corresponding gold text for training, which is infeasible in many real-world applications in which gold text is not available. In this paper, we consider a more realistic setting in which only noisy text and its NER labels are available. We propose to retrieve relevant text of the noisy text from a knowledge corpus and use it to enhance the representation of the original noisy input. We design three retrieval methods: sparse retrieval based on lexicon similarity, dense retrieval based on semantic similarity, and self-retrieval based on task-specific text. After retrieving relevant text, we concatenate the retrieved text with the original noisy text and encode them with a transformer network, utilizing self-attention to enhance the contextual token representations of the noisy text using the retrieved text. We further employ a multi-view training framework that improves robust NER without retrieving text during inference. Experiments show that our retrieval-augmented model achieves significant improvements in various noisy NER settings.
Eric Yang, Jonathan Amar, Jong Ha Lee, Bhawesh Kumar, Yugang Jia
Digital health chatbots powered by Large Language Models (LLMs) have the
potential to significantly improve personal health management for chronic
conditions by providing accessible and on-demand health coaching and
question-answering. However, these chatbots risk providing unverified and
inaccurate information because LLMs generate responses based on patterns
learned from diverse internet data. Retrieval Augmented Generation (RAG) can
help mitigate hallucinations and inaccuracies in LLM responses by grounding it
on reliable content. However, efficiently and accurately retrieving most
relevant set of content for real-time user questions remains a challenge. In
this work, we introduce Query-Based Retrieval Augmented Generation (QB-RAG), a
novel approach that pre-computes a database of potential queries from a content
base using LLMs. For an incoming patient question, QB-RAG efficiently matches
it against this pre-generated query database using vector search, improving
alignment between user questions and the content. We establish a theoretical
foundation for QB-RAG and provide a comparative analysis of existing retrieval
enhancement techniques for RAG systems. Finally, our empirical evaluation
demonstrates that QB-RAG significantly improves the accuracy of healthcare
question answering, paving the way for robust and trustworthy LLM applications
in digital health.
Authors' comments: 22 pages
Chaofan Gan, Yuanpeng Tu, Yuxi Li, Weiyao Lin
With the recent burst of 2D and 3D data, cross-modal retrieval has attracted
increasing attention recently. However, manual labeling by non-experts will
inevitably introduce corrupted annotations given ambiguous 2D/3D content.
Though previous works have addressed this issue by designing a naive division
strategy with hand-crafted thresholds, their performance generally exhibits
great sensitivity to the threshold value. Besides, they fail to fully utilize
the valuable supervisory signals within each divided subset. To tackle this
problem, we propose a Divide-and-conquer 2D-3D cross-modal Alignment and
Correction framework (DAC), which comprises Multimodal Dynamic Division (MDD)
and Adaptive Alignment and Correction (AAC). Specifically, the former performs
accurate sample division by adaptive credibility modeling for each sample based
on the compensation information within multimodal loss distribution. Then in
AAC, samples in distinct subsets are exploited with different alignment
strategies to fully enhance the semantic compactness and meanwhile alleviate
over-fitting to noisy labels, where a self-correction strategy is introduced to
improve the quality of representation. Moreover. To evaluate the effectiveness
in real-world scenarios, we introduce a challenging noisy benchmark, namely
Objaverse-N200, which comprises 200k-level samples annotated with 1156
realistic noisy labels. Extensive experiments on both traditional and the newly
proposed benchmarks demonstrate the generality and superiority of our DAC,
where DAC outperforms state-of-the-art models by a large margin. (i.e., with
+5.9% gain on ModelNet40 and +5.8% on Objaverse-N200).
Authors' comments: accepted by ACM MM 2024
Yongqi Li, Hongru Cai, Wenjie Wang, Leigang Qu, Yinwei Wei, Wenjie Li, Liqiang Nie, Tat-Seng Chua
Text-to-image retrieval is a fundamental task in multimedia processing,
aiming to retrieve semantically relevant cross-modal content. Traditional
studies have typically approached this task as a discriminative problem,
matching the text and image via the cross-attention mechanism (one-tower
framework) or in a common embedding space (two-tower framework). Recently,
generative cross-modal retrieval has emerged as a new research line, which
assigns images with unique string identifiers and generates the target
identifier as the retrieval target. Despite its great potential, existing
generative approaches are limited due to the following issues: insufficient
visual information in identifiers, misalignment with high-level semantics, and
learning gap towards the retrieval target. To address the above issues, we
propose an autoregressive voken generation method, named AVG. AVG tokenizes
images into vokens, i.e., visual tokens, and innovatively formulates the
text-to-image retrieval task as a token-to-voken generation problem. AVG
discretizes an image into a sequence of vokens as the identifier of the image,
while maintaining the alignment with both the visual information and high-level
semantics of the image. Additionally, to bridge the learning gap between
generative training and the retrieval target, we incorporate discriminative
training to modify the learning direction during token-to-voken training.
Extensive experiments demonstrate that AVG achieves superior results in both
effectiveness and efficiency.
Authors' comments: Work in progress
Cui Long, Yongbin Liu, Chunping Ouyang, Ying Yu
Large Language Models (LLMs) have exhibited remarkable proficiency in natural language understanding, prompting extensive exploration of their potential applications across diverse domains. In the medical domain, open-source LLMs have demonstrated moderate efficacy following domain-specific fine-tuning; however, they remain substantially inferior to proprietary models such as GPT-4 and GPT-3.5. These open-source models encounter limitations in the comprehensiveness of domain-specific knowledge and exhibit a propensity for 'hallucinations' during text generation. To mitigate these issues, researchers have implemented the Retrieval-Augmented Generation (RAG) approach, which augments LLMs with background information from external knowledge bases while preserving the model's internal parameters. However, document noise can adversely affect performance, and the application of RAG in the medical field remains in its nascent stages. This study presents the Bailicai framework: a novel integration of retrieval-augmented generation with large language models optimized for the medical domain. The Bailicai framework augments the performance of LLMs in medicine through the implementation of four sub-modules. Experimental results demonstrate that the Bailicai approach surpasses existing medical domain LLMs across multiple medical benchmarks and exceeds the performance of GPT-3.5. Furthermore, the Bailicai method effectively attenuates the prevalent issue of hallucinations in medical applications of LLMs and ameliorates the noise-related challenges associated with traditional RAG techniques when processing irrelevant or pseudo-relevant documents.
Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Michael Bendersky
Retrieval Augmented Generation (RAG) has been a powerful tool for Large Language Models (LLMs) to efficiently process overly lengthy contexts. However, recent LLMs like Gemini-1.5 and GPT-4 show exceptional capabilities to understand long contexts directly. We conduct a comprehensive comparison between RAG and long-context (LC) LLMs, aiming to leverage the strengths of both. We benchmark RAG and LC across various public datasets using three latest LLMs. Results reveal that when resourced sufficiently, LC consistently outperforms RAG in terms of average performance. However, RAG's significantly lower cost remains a distinct advantage. Based on this observation, we propose Self-Route, a simple yet effective method that routes queries to RAG or LC based on model self-reflection. Self-Route significantly reduces the computation cost while maintaining a comparable performance to LC. Our findings provide a guideline for long-context applications of LLMs using RAG and LC.
Ioana Buhnila, Aman Sinha, Mathieu Constant
Recent surge in the accessibility of large language models (LLMs) to the
general population can lead to untrackable use of such models for
medical-related recommendations. Language generation via LLMs models has two
key problems: firstly, they are prone to hallucination and therefore, for any
medical purpose they require scientific and factual grounding; secondly, LLMs
pose tremendous challenge to computational resources due to their gigantic
model size. In this work, we introduce pRAGe, a pipeline for Retrieval
Augmented Generation and evaluation of medical paraphrases generation using
Small Language Models (SLM). We study the effectiveness of SLMs and the impact
of external knowledge base for medical paraphrase generation in French.
Authors' comments: KnowledgeableLM 2024
Fengran Mo, Longxiang Zhao, Kaiyu Huang, Yue Dong, Degen Huang, Jian-Yun Nie
Personalized conversational information retrieval (CIR) combines
conversational and personalizable elements to satisfy various users' complex
information needs through multi-turn interaction based on their backgrounds.
The key promise is that the personal textual knowledge base (PTKB) can improve
the CIR effectiveness because the retrieval results can be more related to the
user's background. However, PTKB is noisy: not every piece of knowledge in PTKB
is relevant to the specific query at hand. In this paper, we explore and test
several ways to select knowledge from PTKB and use it for query reformulation
by using a large language model (LLM). The experimental results show the PTKB
might not always improve the search results when used alone, but LLM can help
generate a more appropriate personalized query when high-quality guidance is
provided.
Authors' comments: Accepted to CIKM 2024
Yannick Assogba, Donghao Ren
As language models support larger and larger context sizes, evaluating their
ability to make effective use of that context becomes increasingly important.
We analyze the ability of several code generation models to handle long range
dependencies using a suite of multi-step key retrieval tasks in context windows
up to 8k tokens in length. The tasks progressively increase in difficulty and
allow more nuanced evaluation of model capabilities than tests like the popular
needle-in-the-haystack test. We find that performance degrades significantly
(up to 2x) when a function references another function that is defined later in
the prompt. We also observe that models that use sliding window attention
mechanisms have difficulty handling references further than the size of a
single window. We perform simple prompt modifications using call graph
information to improve multi-step retrieval performance up to 3x. Our analysis
highlights different facets of long-context performance and is suggestive of
prompt construction strategies for code completion tools
Authors' comments: 29 pages, 18 figures
Marco Simoni, Andrea Saracino, Vinod P., Mauro Conti
In this paper, we introduce MoRSE (Mixture of RAGs Security Experts), the first specialised AI chatbot for cybersecurity. MoRSE aims to provide comprehensive and complete knowledge about cybersecurity. MoRSE uses two RAG (Retrieval Augmented Generation) systems designed to retrieve and organize information from multidimensional cybersecurity contexts. MoRSE differs from traditional RAGs by using parallel retrievers that work together to retrieve semantically related information in different formats and structures. Unlike traditional Large Language Models (LLMs) that rely on Parametric Knowledge Bases, MoRSE retrieves relevant documents from Non-Parametric Knowledge Bases in response to user queries. Subsequently, MoRSE uses this information to generate accurate answers. In addition, MoRSE benefits from real-time updates to its knowledge bases, enabling continuous knowledge enrichment without retraining. We have evaluated the effectiveness of MoRSE against other state-of-the-art LLMs, evaluating the system on 600 cybersecurity specific questions. The experimental evaluation has shown that the improvement in terms of relevance and correctness of the answer is more than 10\% compared to known solutions such as GPT-4 and Mixtral 7x8.
Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Qingming Huang
The rapid growth of online video resources has significantly promoted the development of video retrieval methods. As a standard evaluation metric for video retrieval, Average Precision (AP) assesses the overall rankings of relevant videos at the top list, making the predicted scores a reliable reference for users. However, recent video retrieval methods utilize pair-wise losses that treat all sample pairs equally, leading to an evident gap between the training objective and evaluation metric. To effectively bridge this gap, in this work, we aim to address two primary challenges: a) The current similarity measure and AP-based loss are suboptimal for video retrieval; b) The noticeable noise from frame-to-frame matching introduces ambiguity in estimating the AP loss. In response to these challenges, we propose the Hierarchical learning framework for Average-Precision-oriented Video Retrieval (HAP-VR). For the former challenge, we develop the TopK-Chamfer Similarity and QuadLinear-AP loss to measure and optimize video-level similarities in terms of AP. For the latter challenge, we suggest constraining the frame-level similarities to achieve an accurate AP loss estimation. Experimental results present that HAP-VR outperforms existing methods on several benchmark datasets, providing a feasible solution for video retrieval tasks and thus offering potential benefits for the multi-media application.
Mahesh Kandhare, Thibault Gisselbrecht
Numerous video frame sampling methodologies detailed in the literature
present a significant challenge in determining the optimal video frame method
for Video RAG pattern without a comparative side-by-side analysis. In this
work, we investigate the trade-offs in frame sampling methods for Video & Frame
Retrieval using natural language questions. We explore the balance between the
quantity of sampled frames and the retrieval recall score, aiming to identify
efficient video frame sampling strategies that maintain high retrieval efficacy
with reduced storage and processing demands. Our study focuses on the storage
and retrieval of image data (video frames) within a vector database required by
Video RAG pattern, comparing the effectiveness of various frame sampling
techniques. Our investigation indicates that the recall@k metric for both
text-to-video and text-to-frame retrieval tasks using various methods covered
as part of this work is comparable to or exceeds that of storing each frame
from the video. Our findings are intended to inform the selection of frame
sampling methods for practical Video RAG implementations, serving as a
springboard for innovative research in this domain.
Authors' comments: 19 pages, 24 figures (65 images)
Malick Ebiele, Malika Bendechache, Eamonn Clinton, Rob Brennan
In this paper, we propose a novel data valuation method for a Dataset
Retrieval (DR) use case in Ireland's National mapping agency. To the best of
our knowledge, data valuation has not yet been applied to Dataset Retrieval. By
leveraging metadata and a user's preferences, we estimate the personal value of
each dataset to facilitate dataset retrieval and filtering. We then validated
the data value-based ranking against the stakeholders' ranking of the datasets.
The proposed data valuation method and use case demonstrated that data
valuation is promising for dataset retrieval. For instance, the outperforming
dataset retrieval based on our approach obtained 0.8207 in terms of NDCG@5 (the
truncated Normalized Discounted Cumulative Gain at 5). This study is unique in
its exploration of a data valuation-based approach to dataset retrieval and
stands out because, unlike most existing methods, our approach is validated
using the stakeholders ranking of the datasets.
Authors' comments: 5 pages, 1 figure
Daeun Lee, Jaewoong Choi, Hiroshi Mizuseki, Byungju Lee
Recent studies have increasingly applied natural language processing (NLP) to automatically extract experimental research data from the extensive battery materials literature. Despite the complex process involved in battery manufacturing -- from material synthesis to cell assembly -- there has been no comprehensive study systematically organizing this information. In response, we propose a language modeling-based protocol, Text-to-Battery Recipe (T2BR), for the automatic extraction of end-to-end battery recipes, validated using a case study on batteries containing LiFePO4 cathode material. We report machine learning-based paper filtering models, screening 2,174 relevant papers from the keyword-based search results, and unsupervised topic models to identify 2,876 paragraphs related to cathode synthesis and 2,958 paragraphs related to cell assembly. Then, focusing on the two topics, two deep learning-based named entity recognition models are developed to extract a total of 30 entities -- including precursors, active materials, and synthesis methods -- achieving F1 scores of 88.18% and 94.61%. The accurate extraction of entities enables the systematic generation of 165 end-toend recipes of LiFePO4 batteries. Our protocol and results offer valuable insights into specific trends, such as associations between precursor materials and synthesis methods, or combinations between different precursor materials. We anticipate that our findings will serve as a foundational knowledge base for facilitating battery-recipe information retrieval. The proposed protocol will significantly accelerate the review of battery material literature and catalyze innovations in battery design and development.
Junha Song, Tae Soo Kim, Junha Kim, Gunhee Nam, Thijs Kooi, Jaegul Choo
This paper aims to adapt the source model to the target environment,
leveraging small user feedback (i.e., labeled target data) readily available in
real-world applications. We find that existing semi-supervised domain
adaptation (SemiSDA) methods often suffer from poorly improved adaptation
performance when directly utilizing such feedback data, as shown in Figure 1.
We analyze this phenomenon via a novel concept called Negatively Biased
Feedback (NBF), which stems from the observation that user feedback is more
likely for data points where the model produces incorrect predictions. To
leverage this feedback while avoiding the issue, we propose a scalable adapting
approach, Retrieval Latent Defending. This approach helps existing SemiSDA
methods to adapt the model with a balanced supervised signal by utilizing
latent defending samples throughout the adaptation process. We demonstrate the
problem caused by NBF and the efficacy of our approach across various
benchmarks, including image classification, semantic segmentation, and a
real-world medical imaging application. Our extensive experiments reveal that
integrating our approach with multiple state-of-the-art SemiSDA methods leads
to significant performance improvements.
Authors' comments: Accepted to ECCV 2024, Project page:
https://sites.google.com/view/junha/nbf-rld
Yang Xu, Yifan Feng, Yu Jiang
Existing methods of 3D cross-modal retrieval heavily lean on category
distribution priors within the training set, which diminishes their efficacy
when tasked with unseen categories under open-set environments. To tackle this
problem, we propose the Structure-Aware Residual-Center Representation (SRCR)
framework for self-supervised open-set 3D cross-modal retrieval. To address the
center deviation due to category distribution differences, we utilize the
Residual-Center Embedding (RCE) for each object by nested auto-encoders, rather
than directly mapping them to the modality or category centers. Besides, we
perform the Hierarchical Structure Learning (HSL) approach to leverage the
high-order correlations among objects for generalization, by constructing a
heterogeneous hypergraph structure based on hierarchical inter-modality,
intra-object, and implicit-category correlations. Extensive experiments and
ablation studies on four benchmarks demonstrate the superiority of our proposed
framework compared to state-of-the-art methods.
Authors' comments: ICME 2024
Gabriel de Souza P. Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, Even Oldridge
Text embedding models have been popular for information retrieval applications such as semantic search and Question-Answering systems based on Retrieval-Augmented Generation (RAG). Those models are typically Transformer models that are fine-tuned with contrastive learning objectives. One of the challenging aspects of fine-tuning embedding models is the selection of high quality hard-negative passages for contrastive learning. In this paper we introduce a family of positive-aware mining methods that use the positive relevance score as an anchor for effective false negative removal, leading to faster training and more accurate retrieval models. We provide an ablation study on hard-negative mining methods over their configurations, exploring different teacher and base models. We further demonstrate the efficacy of our proposed mining methods at scale with the NV-Retriever-v1 model, which scores 60.9 on MTEB Retrieval (BEIR) benchmark and placed 1st when it was published to the MTEB Retrieval on July, 2024.
Soroosh Tayebi Arasteh, Mahshad Lotfinia, Keno Bressem, Robert Siepmann, Lisa Adams, Dyke Ferber, Christiane Kuhl, Jakob Nikolas Kather et al.
Large language models (LLMs) often generate outdated or inaccurate information based on static training datasets. Retrieval augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiology RAG (RadioRAG), an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. We evaluate the diagnostic accuracy of various LLMs when answering radiology-specific questions with and without access to additional online information via RAG. Using 80 questions from the RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions with reference standard answers, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG in a zero-shot inference scenario RadioRAG retrieved context-specific information from www.radiopaedia.org in real-time. Accuracy was investigated. Statistical analyses were performed using bootstrapping. The results were further compared with human performance. RadioRAG improved diagnostic accuracy across most LLMs, with relative accuracy increases ranging up to 54% for different LLMs. It matched or exceeded non-RAG models and the human radiologist in question answering across radiologic subspecialties, particularly in breast imaging and emergency radiology. However, the degree of improvement varied among models; GPT-3.5-turbo and Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2 showed no improvement, highlighting variability in RadioRAG's effectiveness. LLMs benefit when provided access to domain-specific data beyond their training data. For radiology, RadioRAG establishes a robust framework that substantially improves diagnostic accuracy and factuality in radiological question answering.
Yuetong Zhao, Hongyu Cao, Xianyu Zhao, Zhijian Ou
Since the launch of ChatGPT at the end of 2022, generative dialogue models
represented by ChatGPT have quickly become essential tools in daily life. As
user expectations increase, enhancing the capability of generative dialogue
models to solve complex problems has become a focal point of current research.
This paper delves into the effectiveness of the RAFT (Retrieval Augmented
Fine-Tuning) method in improving the performance of Generative dialogue models.
RAFT combines chain-of-thought with model supervised fine-tuning (SFT) and
retrieval augmented generation (RAG), which significantly enhanced the model's
information extraction and logical reasoning abilities. We evaluated the RAFT
method across multiple datasets and analysed its performance in various
reasoning tasks, including long-form QA and short-form QA tasks, tasks in both
Chinese and English, and supportive and comparison reasoning tasks. Notably, it
addresses the gaps in previous research regarding long-form QA tasks and
Chinese datasets. Moreover, we also evaluate the benefit of the
chain-of-thought (CoT) in the RAFT method. This work offers valuable insights
for studies focused on enhancing the performance of generative dialogue models.
Authors' comments: Accepted by ISCSLP 2024
Yuan Pu, Zhuolun He, Tairu Qiu, Haoyuan Wu, Bei Yu
Retrieval augmented generation (RAG) enhances the accuracy and reliability of
generative AI models by sourcing factual information from external databases,
which is extensively employed in document-grounded question-answering (QA)
tasks. Off-the-shelf RAG flows are well pretrained on general-purpose
documents, yet they encounter significant challenges when being applied to
knowledge-intensive vertical domains, such as electronic design automation
(EDA). This paper addresses such issue by proposing a customized RAG framework
along with three domain-specific techniques for EDA tool documentation QA,
including a contrastive learning scheme for text embedding model fine-tuning, a
reranker distilled from proprietary LLM, and a generative LLM fine-tuned with
high-quality domain corpus. Furthermore, we have developed and released a
documentation QA evaluation benchmark, ORD-QA, for OpenROAD, an advanced
RTL-to-GDSII design platform. Experimental results demonstrate that our
proposed RAG flow and techniques have achieved superior performance on ORD-QA
as well as on a commercial tool, compared with state-of-the-arts. The ORD-QA
benchmark and the training dataset for our customized RAG flow are open-source
at https://github.com/lesliepy99/RAG-EDA.
Authors' comments: Accepted by ICCAD 2024