Théodore Fougereux, Cédric Josz, Xiaopeng Li
This paper addresses the phase retrieval problem, which aims to recover a signal vector $x$ from $m$ measurements $y_i=|\langle a_i,x^{\natural}\rangle|^2$, $i=1,\ldots,m$. A standard approach is to solve a nonconvex least squares problem using gradient descent with random initialization, which is known to work efficiently given a sufficient number of measurements. However, whether $O(n)$ measurements suffice for gradient descent to recover the ground truth efficiently has remained an open question. Prior work has established that $O(n\,{\rm poly}(\log n))$ measurements are sufficient. In this paper, we resolve this open problem by proving that $m=O(n)$ Gaussian random measurements are sufficient to guarantee, with high probability, that the objective function has a benign global landscape. This sample complexity is optimal because at least $\Omega(n)$ measurements are required for exact recovery. The landscape result allows us to further show that gradient descent with a constant step size converges to the ground truth from almost any initial point.
Alireza Salemi, Hamed Zamani
This paper investigates the design of a unified search engine to serve multiple retrieval-augmented generation (RAG) agents, each with a distinct task, backbone large language model (LLM), and retrieval-augmentation strategy. We introduce an iterative approach where the search engine generates retrieval results for these RAG agents and gathers feedback on the quality of the retrieved documents during an offline phase. This feedback is then used to iteratively optimize the search engine using a novel expectation-maximization algorithm, with the goal of maximizing each agent's utility function. Additionally, we adapt this approach to an online setting, allowing the search engine to refine its behavior based on real-time individual agents feedback to better serve the results for each of them. Experiments on diverse datasets from the Knowledge-Intensive Language Tasks (KILT) benchmark demonstrates that our approach significantly on average outperforms competitive baselines across 18 RAG models. We also demonstrate that our method effectively ``personalizes'' the retrieval process for each RAG agent based on the collected feedback. Finally, we provide a comprehensive ablation study to explore various aspects of our method.
Xiquan Li, Wenxi Chen, Ziyang Ma, Xuenan Xu, Yuzhe Liang, Zhisheng Zheng, Qiuqiang Kong, Xie Chen
While automated audio captioning (AAC) has made notable progress, traditional fully supervised AAC models still face two critical challenges: the need for expensive audio-text pair data for training and performance degradation when transferring across domains. To overcome these limitations, we present DRCap, a data-efficient and flexible zero-shot audio captioning system that requires text-only data for training and can quickly adapt to new domains without additional fine-tuning. DRCap integrates a contrastive language-audio pre-training (CLAP) model and a large-language model (LLM) as its backbone. During training, the model predicts the ground-truth caption with a fixed text encoder from CLAP, whereas, during inference, the text encoder is replaced with the audio encoder to generate captions for audio clips in a zero-shot manner. To mitigate the modality gap of the CLAP model, we use both the projection strategy from the encoder side and the retrieval-augmented generation strategy from the decoder side. Specifically, audio embeddings are first projected onto a text embedding support to absorb extensive semantic information within the joint multi-modal space of CLAP. At the same time, similar captions retrieved from a datastore are fed as prompts to instruct the LLM, incorporating external knowledge to take full advantage of its strong generative capability. Conditioned on both the projected CLAP embedding and the retrieved similar captions, the model is able to produce a more accurate and semantically rich textual description. By tailoring the text embedding support and the caption datastore to the target domain, DRCap acquires a robust ability to adapt to new domains in a training-free manner. Experimental results demonstrate that DRCap outperforms all other zero-shot models in in-domain scenarios and achieves state-of-the-art performance in cross-domain scenarios.
Pengfei He, Shaowei Wang, Shaiful Chowdhury, Tse-Hsun Chen
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by
integrating external knowledge bases, achieving state-of-the-art results in
various coding tasks. The core of RAG is retrieving demonstration examples,
which is essential to balance effectiveness (generation quality) and efficiency
(retrieval time) for optimal performance. However, the high-dimensional nature
of code representations and large knowledge bases often create efficiency
bottlenecks, which are overlooked in previous research. This paper
systematically evaluates the efficiency-effectiveness trade-off of retrievers
across three coding tasks: Program Synthesis, Commit Message Generation, and
Assertion Generation. We examined six retrievers: two sparse (BM25 and BM25L)
and four dense retrievers, including one exhaustive dense retriever (SBERT's
Semantic Search) and three approximate dense retrievers (ANNOY, LSH, and HNSW).
Our findings show that while BM25 excels in effectiveness, it suffers in
efficiency as the knowledge base grows beyond 1000 entries. In large-scale
retrieval, efficiency differences become more pronounced, with approximate
dense retrievers offering the greatest gains. For instance, in Commit
Generation task, HNSW achieves a 44x speed up, while only with a 1.74% drop in
RougeL compared with BM25. Our results also show that increasing the number of
demonstrations in the prompt doesn't always improve the effectiveness and can
increase latency and lead to incorrect outputs. Our findings provide valuable
insights for practitioners aiming to build efficient and effective RAG systems
for coding tasks.
Authors' comments: 11 pages, 6 figures, 6 tables, accepted by SANER 2025
Seungwook Lee, Maulana Bisyir Azhari, Gyuree Kang, Ozan Günes, Donghun Han, David Hyunchul Shim
We present an integrated UAV-hexapod robotic system designed for GNSS-denied maritime operations, capable of autonomous deployment and retrieval of a hexapod robot via a winch mechanism installed on a UAV. This system is intended to address the challenges of localization, control, and mobility in dynamic maritime environments. Our solution leverages sensor fusion techniques, combining optical flow, LiDAR, and depth data for precise localization. Experimental results demonstrate the effectiveness of this system in real-world scenarios, validating its performance during field tests in both controlled and operational conditions in the MBZIRC 2023 Maritime Challenge.
Houlun Chen, Xin Wang, Hong Chen, Zeyang Zhang, Wei Feng, Bin Huang, Jia Jia, Wenwu Zhu
Existing Video Corpus Moment Retrieval (VCMR) is limited to coarse-grained
understanding, which hinders precise video moment localization when given
fine-grained queries. In this paper, we propose a more challenging fine-grained
VCMR benchmark requiring methods to localize the best-matched moment from the
corpus with other partially matched candidates. To improve the dataset
construction efficiency and guarantee high-quality data annotations, we propose
VERIFIED, an automatic \underline{V}id\underline{E}o-text annotation pipeline
to generate captions with \underline{R}el\underline{I}able
\underline{FI}n\underline{E}-grained statics and \underline{D}ynamics.
Specifically, we resort to large language models (LLM) and large multimodal
models (LMM) with our proposed Statics and Dynamics Enhanced Captioning modules
to generate diverse fine-grained captions for each video. To filter out the
inaccurate annotations caused by the LLM hallucination, we propose a
Fine-Granularity Aware Noise Evaluator where we fine-tune a video foundation
model with disturbed hard-negatives augmented contrastive and matching losses.
With VERIFIED, we construct a more challenging fine-grained VCMR benchmark
containing Charades-FIG, DiDeMo-FIG, and ActivityNet-FIG which demonstrate a
high level of annotation quality. We evaluate several state-of-the-art VCMR
models on the proposed dataset, revealing that there is still significant scope
for fine-grained video understanding in VCMR. Code and Datasets are in
\href{https://github.com/hlchen23/VERIFIED}{https://github.com/hlchen23/VERIFIED}.
Authors' comments: Accepted by 38th NeurIPS Datasets & Benchmarks Track (NeurIPS 2024)
Philipp Christmann, Svitlana Vakulenko, Ionut Teodor Sorodoc, Bill Byrne, Adrià de Gispert
Long-form question answering (LFQA) aims at generating in-depth answers to
end-user questions, providing relevant information beyond the direct answer.
However, existing retrievers are typically optimized towards information that
directly targets the question, missing out on such contextual information.
Furthermore, there is a lack of training data for relevant context. To this
end, we propose and compare different weak supervision techniques to optimize
retrieval for contextual information. Experiments demonstrate improvements on
the end-to-end QA performance on ASQA, a dataset for long-form question
answering. Importantly, as more contextual information is retrieved, we improve
the relevant page recall for LFQA by 14.7% and the groundedness of generated
long-form answers by 12.5%. Finally, we show that long-form answers often
anticipate likely follow-up questions, via experiments on a conversational QA
dataset.
Authors' comments: Accepted at EMNLP 2024 (Findings)
Peiran Wang, Xiaogeng Liu, Chaowei Xiao
In this study, we introduce RePD, an innovative attack Retrieval-based Prompt Decomposition framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs). Despite rigorous pretraining and finetuning focused on ethical alignment, LLMs are still susceptible to jailbreak exploits. RePD operates on a one-shot learning model, wherein it accesses a database of pre-collected jailbreak prompt templates to identify and decompose harmful inquiries embedded within user prompts. This process involves integrating the decomposition of the jailbreak prompt into the user's original query into a one-shot learning example to effectively teach the LLM to discern and separate malicious components. Consequently, the LLM is equipped to first neutralize any potentially harmful elements before addressing the user's prompt in a manner that aligns with its ethical guidelines. RePD is versatile and compatible with a variety of open-source LLMs acting as agents. Through comprehensive experimentation with both harmful and benign prompts, we have demonstrated the efficacy of our proposed RePD in enhancing the resilience of LLMs against jailbreak attacks, without compromising their performance in responding to typical user requests.
Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, Yaohua Tang
Current Retrieval-Augmented Generation (RAG) systems concatenate and process numerous retrieved document chunks for prefill which requires a large volume of computation, therefore leading to significant latency in time-to-first-token (TTFT). To reduce the computation overhead as well as TTFT, we introduce TurboRAG, a novel RAG system that redesigns the inference paradigm of the current RAG system by first pre-computing and storing the key-value (KV) caches of documents offline, and then directly retrieving the saved KV cache for prefill. Hence, online computation of KV caches is eliminated during inference. In addition, we provide a number of insights into the mask matrix and positional embedding mechanisms, plus fine-tune a pretrained language model to maintain model accuracy of TurboRAG. Our approach is applicable to most existing large language models and their applications without any requirement in modification of models and inference systems. Experimental results across a suite of RAG benchmarks demonstrate that TurboRAG reduces TTFT by up to 9.4x compared to the conventional RAG systems (on an average of 8.6x), but reserving comparable performance to the standard RAG systems.
Mengxuan Hu, Hongyi Wu, Zihan Guan, Ronghang Zhu, Dongliang Guo, Daiqing Qi, Sheng Li
Retrieval-Augmented Generation (RAG) is widely adopted for its effectiveness and cost-efficiency in mitigating hallucinations and enhancing the domain-specific generation capabilities of large language models (LLMs). However, is this effectiveness and cost-efficiency truly a free lunch? In this study, we comprehensively investigate the fairness costs associated with RAG by proposing a practical three-level threat model from the perspective of user awareness of fairness. Specifically, varying levels of user fairness awareness result in different degrees of fairness censorship on the external dataset. We examine the fairness implications of RAG using uncensored, partially censored, and fully censored datasets. Our experiments demonstrate that fairness alignment can be easily undermined through RAG without the need for fine-tuning or retraining. Even with fully censored and supposedly unbiased external datasets, RAG can lead to biased outputs. Our findings underscore the limitations of current alignment methods in the context of RAG-based LLMs and highlight the urgent need for new strategies to ensure fairness. We propose potential mitigations and call for further research to develop robust fairness safeguards in RAG-based LLMs.
Haocheng Xu, Haotian Hu, Sitao Huang
High-level synthesis (HLS) allows hardware designers to create hardware designs with high-level programming languages like C/C++/OpenCL, which greatly improves hardware design productivity. However, existing HLS flows require programmers' hardware design expertise and rely on programmers' manual code transformations and directive annotations to guide compiler optimizations. Optimizing HLS designs requires non-trivial HLS expertise and tedious iterative process in HLS code optimization. Automating HLS code optimizations has become a burning need. Recently, large language models (LLMs) trained on massive code and programming tasks have demonstrated remarkable proficiency in comprehending code, showing the ability to handle domain-specific programming queries directly without labor-intensive fine-tuning. In this work, we propose a novel retrieval-augmented LLM-based approach to effectively optimize high-level synthesis (HLS) programs. Our proposed method leverages few-shot learning, enabling large language models to adopt domain-specific knowledge through natural language prompts. We propose a unique framework, Retrieve Augmented Large Language Model Aided Design (RALAD), designed to enhance LLMs' performance in HLS code optimization tasks. RALAD employs advanced embedding techniques and top-\emph{k} search algorithms to dynamically source relevant knowledge from extensive databases, thereby providing contextually appropriate responses to complex programming queries. Our implementation of RALAD on two specialized domains, utilizing comparatively smaller language models, achieves an impressive 80\% success rate in compilation tasks and outperforms general LLMs by 3.7 -- 19$\times$ in latency improvement.
Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, Sercan Ö. Arık
Retrieval-Augmented Generation (RAG), while effective in integrating external
knowledge to address the limitations of large language models (LLMs), can be
undermined by imperfect retrieval, which may introduce irrelevant, misleading,
or even malicious information. Despite its importance, previous studies have
rarely explored the behavior of RAG through joint analysis on how errors from
imperfect retrieval attribute and propagate, and how potential conflicts arise
between the LLMs' internal knowledge and external sources. We find that
imperfect retrieval augmentation might be inevitable and quite harmful, through
controlled analysis under realistic conditions. We identify the knowledge
conflicts between LLM-internal and external knowledge from retrieval as a
bottleneck to overcome in the post-retrieval stage of RAG. To render LLMs
resilient to imperfect retrieval, we propose Astute RAG, a novel RAG approach
that adaptively elicits essential information from LLMs' internal knowledge,
iteratively consolidates internal and external knowledge with source-awareness,
and finalizes the answer according to information reliability. Our experiments
using Gemini and Claude demonstrate that Astute RAG significantly outperforms
previous robustness-enhanced RAG methods. Notably, Astute RAG is the only
approach that matches or exceeds the performance of LLMs without RAG under
worst-case scenarios. Further analysis reveals that Astute RAG effectively
resolves knowledge conflicts, improving the reliability and trustworthiness of
RAG systems.
Authors' comments: Preprint
Jian Xiao, Zhenzhen Hu, Jia Li, Richang Hong
Text-video retrieval (TVR) has seen substantial advancements in recent years, fueled by the utilization of pre-trained models and large language models (LLMs). Despite these advancements, achieving accurate matching in TVR remains challenging due to inherent disparities between video and textual modalities and irregularities in data representation. In this paper, we propose Text-Video-ProxyNet (TV-ProxyNet), a novel framework designed to decompose the conventional 1-to-N relationship of TVR into N distinct 1-to-1 relationships. By replacing a single text query with a series of text proxies, TV-ProxyNet not only broadens the query scope but also achieves a more precise expansion. Each text proxy is crafted through a refined iterative process, controlled by mechanisms we term as the director and dash, which regulate the proxy's direction and distance relative to the original text query. This setup not only facilitates more precise semantic alignment but also effectively manages the disparities and noise inherent in multimodal data. Our experiments on three representative video-text retrieval benchmarks, MSRVTT, DiDeMo, and ActivityNet Captions, demonstrate the effectiveness of TV-ProxyNet. The results show an improvement of 2.0% to 3.3% in R@1 over the baseline. TV-ProxyNet achieved state-of-the-art performance on MSRVTT and ActivityNet Captions, and a 2.0% improvement on DiDeMo compared to existing methods, validating our approach's ability to enhance semantic mapping and reduce error propensity.
Cheng Gao, Chaojun Xiao, Zhenghao Liu, Huimin Chen, Zhiyuan Liu, Maosong Sun
Legal case retrieval (LCR) aims to provide similar cases as references for a
given fact description. This task is crucial for promoting consistent judgments
in similar cases, effectively enhancing judicial fairness and improving work
efficiency for judges. However, existing works face two main challenges for
real-world applications: existing works mainly focus on case-to-case retrieval
using lengthy queries, which does not match real-world scenarios; and the
limited data scale, with current datasets containing only hundreds of queries,
is insufficient to satisfy the training requirements of existing data-hungry
neural models. To address these issues, we introduce an automated method to
construct synthetic query-candidate pairs and build the largest LCR dataset to
date, LEAD, which is hundreds of times larger than existing datasets. This data
construction method can provide ample training signals for LCR models.
Experimental results demonstrate that model training with our constructed data
can achieve state-of-the-art results on two widely-used LCR benchmarks.
Besides, the construction method can also be applied to civil cases and achieve
promising results. The data and codes can be found in
https://github.com/thunlp/LEAD.
Authors' comments: 15 pages, 3 figures, accepted by EMNLP 2024
Thea Aviss
In this paper we present APEX-Embedding-7B (Advanced Processing for Epistemic
eXtraction), a 7-billion parameter decoder-only text Feature Extraction Model,
specifically designed for Document Retrieval-Augmented Generation (RAG) tasks.
Our approach employs two training techniques that yield an emergent improvement
in factual focus: (1) Pre-convergence interrupted fine-tuning using Structured
Entity Relationship Maps as training data input: designed to shift the model's
attention and create a bias towards factual content rather than semantic style
- this enhances plain text performance despite not being directly trained for
it; and (2) Model-Aware Contrastive Sampling, creating a balanced and evenly
distributed collation map of hard and soft negatives directly informed by the
base model's competency. This combined methodology yields significant
improvements, enhancing plain text query/document pair retrieval to achieve an
absolute rank@1 accuracy of 90.86% (an increase of 6.26% compared to the next
leading model) in our evaluation, and reducing training data input context size
by an average of 37.71% compared to plain text for both queries and document
texts. Based on our evaluations, our model establishes a new state-of-the-art
standard in text feature extraction for longer context document retrieval
tasks.
Authors' comments: 10 Pages, 9 Figures
Wenyu Huang, Guancheng Zhou, Hongru Wang, Pavlos Vougiouklis, Mirella Lapata, Jeff Z. Pan
Retrieval-Augmented Generation (RAG) is widely used to inject external
non-parametric knowledge into large language models (LLMs). Recent works
suggest that Knowledge Graphs (KGs) contain valuable external knowledge for
LLMs. Retrieving information from KGs differs from extracting it from document
sets. Most existing approaches seek to directly retrieve relevant subgraphs,
thereby eliminating the need for extensive SPARQL annotations, traditionally
required by semantic parsing methods. In this paper, we model the subgraph
retrieval task as a conditional generation task handled by small language
models. Specifically, we define a subgraph identifier as a sequence of
relations, each represented as a special token stored in the language models.
Our base generative subgraph retrieval model, consisting of only 220M
parameters, achieves competitive retrieval performance compared to
state-of-the-art models relying on 7B parameters, demonstrating that small
language models are capable of performing the subgraph retrieval task.
Furthermore, our largest 3B model, when plugged with an LLM reader, sets new
SOTA end-to-end performance on both the WebQSP and CWQ benchmarks. Our model
and data will be made available online: https://github.com/hwy9855/GSR.
Authors' comments: Accepted by EMNLP 2024 Findings
Ryota Tozuka, Hisashi Johno, Akitomo Amakawa, Junichi Sato, Mizuki Muto, Shoichiro Seki, Atsushi Komaba, Hiroshi Onishi
Purpose: In radiology, large language models (LLMs), including ChatGPT, have
recently gained attention, and their utility is being rapidly evaluated.
However, concerns have emerged regarding their reliability in clinical
applications due to limitations such as hallucinations and insufficient
referencing. To address these issues, we focus on the latest technology,
retrieval-augmented generation (RAG), which enables LLMs to reference reliable
external knowledge (REK). Specifically, this study examines the utility and
reliability of a recently released RAG-equipped LLM (RAG-LLM), NotebookLM, for
staging lung cancer.
Materials and methods: We summarized the current lung cancer staging
guideline in Japan and provided this as REK to NotebookLM. We then tasked
NotebookLM with staging 100 fictional lung cancer cases based on CT findings
and evaluated its accuracy. For comparison, we performed the same task using a
gold-standard LLM, GPT-4 Omni (GPT-4o), both with and without the REK.
Results: NotebookLM achieved 86% diagnostic accuracy in the lung cancer
staging experiment, outperforming GPT-4o, which recorded 39% accuracy with the
REK and 25% without it. Moreover, NotebookLM demonstrated 95% accuracy in
searching reference locations within the REK.
Conclusion: NotebookLM successfully performed lung cancer staging by
utilizing the REK, demonstrating superior performance compared to GPT-4o.
Additionally, it provided highly accurate reference locations within the REK,
allowing radiologists to efficiently evaluate the reliability of NotebookLM's
responses and detect possible hallucinations. Overall, this study highlights
the potential of NotebookLM, a RAG-LLM, in image diagnosis.
Authors' comments: 9 pages, 5 figures, 1 table, 3 ancillary files
Junya Shiraishi, Anders E. Kalør, Israel Leyva-Mayorga, Federico Chiariotti, Petar Popovski, Hiroyuki Yomo
Energy efficiency and information freshness are key requirements for sensor
nodes serving Industrial Internet of Things (IIoT) applications, where a sink
node collects informative and fresh data before a deadline, e.g., to control an
external actuator. Content-based wake-up (CoWu) activates a subset of nodes
that hold data relevant for the sink's goal, thereby offering an
energy-efficient way to attain objectives related to information freshness.
This paper focuses on a scenario where the sink collects fresh information on
top-k values, defined as data from the nodes observing the k highest readings
at the deadline. We introduce a new metric called top-k Query Age of
Information (k-QAoI), which allows us to characterize the performance of CoWu
by considering the characteristics of the physical process. Further, we show
how to select the CoWu parameters, such as its timing and threshold, to attain
both information freshness and energy efficiency. The numerical results reveal
the effectiveness of the CoWu approach, which is able to collect top-k data
with higher energy efficiency while reducing k-QAoI when compared to
round-robin scheduling, especially when the number of nodes is large and the
required size of k is small.
Authors' comments: Submitted to IEEE Transactions on Communications
Tianhui Cai, Yifan Liu, Zewei Zhou, Haoxuan Ma, Seth Z. Zhao, Zhiwen Wu, Jiaqi Ma
This work presents an interpretable decision-making framework for autonomous vehicles that integrates traffic regulations, norms, and safety guidelines comprehensively and enables seamless adaptation to different regions. While traditional rule-based methods struggle to incorporate the full scope of traffic rules, we develop a Traffic Regulation Retrieval (TRR) Agent based on Retrieval-Augmented Generation (RAG) to automatically retrieve relevant traffic rules and guidelines from extensive regulation documents and relevant records based on the ego vehicle's situation. Given the semantic complexity of the retrieved rules, we also design a reasoning module powered by a Large Language Model (LLM) to interpret these rules, differentiate between mandatory rules and safety guidelines, and assess actions on legal compliance and safety. Additionally, the reasoning is designed to be interpretable, enhancing both transparency and reliability. The framework demonstrates robust performance on both hypothesized and real-world cases across diverse scenarios, along with the ability to adapt to different regions with ease.
Ameer Hamza, Abdullah, Yong Hyun Ahn, Sungyoung Lee, Seong Tae Kim
Generating Natural Language Explanations (NLEs) for model predictions on medical images, particularly those depicting thoracic pathologies, remains a critical and challenging task. Existing methodologies often struggle due to general models' insufficient domain-specific medical knowledge and privacy concerns associated with retrieval-based augmentation techniques. To address these issues, we propose a novel Vision-Language framework augmented with a Knowledge Graph (KG)-based datastore, which enhances the model's understanding by incorporating additional domain-specific medical knowledge essential for generating accurate and informative NLEs. Our framework employs a KG-based retrieval mechanism that not only improves the precision of the generated explanations but also preserves data privacy by avoiding direct data retrieval. The KG datastore is designed as a plug-and-play module, allowing for seamless integration with various model architectures. We introduce and evaluate three distinct frameworks within this paradigm: KG-LLaVA, which integrates the pre-trained LLaVA model with KG-RAG; Med-XPT, a custom framework combining MedCLIP, a transformer-based projector, and GPT-2; and Bio-LLaVA, which adapts LLaVA by incorporating the Bio-ViT-L vision model. These frameworks are validated on the MIMIC-NLE dataset, where they achieve state-of-the-art results, underscoring the effectiveness of KG augmentation in generating high-quality NLEs for thoracic pathologies.