Adil Bahaj, Mounir Ghogho
Asthma rates have risen globally, driven by environmental and lifestyle
factors. Access to immediate medical care is limited, particularly in
developing countries, necessitating automated support systems. Large Language
Models like ChatGPT (Chat Generative Pre-trained Transformer) and Gemini have
advanced natural language processing in general and question answering in
particular, however, they are prone to producing factually incorrect responses
(i.e. hallucinations). Retrieval-augmented generation systems, integrating
curated documents, can improve large language models' performance and reduce
the incidence of hallucination. We introduce AsthmaBot, a multi-lingual,
multi-modal retrieval-augmented generation system for asthma support.
Evaluation of an asthma-related frequently asked questions dataset shows
AsthmaBot's efficacy. AsthmaBot has an added interactive and intuitive
interface that integrates different data modalities (text, images, videos) to
make it accessible to the larger public. AsthmaBot is available online via
\url{asthmabot.datanets.org}.
Authors' comments: 10 pages
Sung Yun Lee, Do Hyung Cho, Chulho Jung, Daeho Sung, Daewoong Nam, Sangsoo Kim, Changyong Song
Machine learning is attracting surging interest across nearly all scientific areas by enabling the analysis of large datasets and the extraction of scientific information from incomplete data. Data-driven science is rapidly growing, especially in X-ray methodologies, where advanced light sources and detection technologies accumulate vast amounts of data that exceed meticulous human inspection capabilities. Despite the increasing demands, the full application of machine learning has been hindered by the need for data-specific optimizations. In this study, we introduce a new deep-learning-based phase retrieval method for imperfect diffraction data. This method provides robust phase retrieval for simulated data and performs well on weak-signal single-pulse diffraction data from X-ray free-electron lasers. Moreover, the method significantly reduces data processing time, facilitating real-time image reconstructions that are crucial for high-repetition-rate data acquisition. Thus, this approach offers a reliable solution to the phase problem and is expected to be widely adopted across various research areas.
Zheng Liu, Chenyuan Wu, Ninglu Shao, Shitao Xiao, Chaozhuo Li, Defu Lian
The existing Retrieval-Augmented Generation (RAG) systems face significant challenges in terms of cost and effectiveness. On one hand, they need to encode the lengthy retrieved contexts before responding to the input tasks, which imposes substantial computational overhead. On the other hand, directly using generic Large Language Models (LLMs) often leads to sub-optimal answers, while task-specific fine-tuning may compromise the LLMs' general capabilities. To address these challenges, we introduce a novel approach called FlexRAG (Flexible Context Adaptation for RAG). In this approach, the retrieved contexts are compressed into compact embeddings before being encoded by the LLMs. Simultaneously, these compressed embeddings are optimized to enhance downstream RAG performance. A key feature of FlexRAG is its flexibility, which enables effective support for diverse compression ratios and selective preservation of important contexts. Thanks to these technical designs, FlexRAG achieves superior generation quality while significantly reducing running costs. Comprehensive experiments on various question-answering datasets validate our approach as a cost-effective and flexible solution for RAG systems.
Gyuree Kang, Ozan Güneş, Seungwook Lee, Maulana Bisyir Azhari, David Hyunchul Shim
In real-world field operations, aerial grasping systems face significant challenges in dynamic environments due to strong winds, shifting surfaces, and the need to handle heavy loads. Particularly when dealing with heavy objects, the powerful propellers of the drone can inadvertently blow the target object away as it approaches, making the task even more difficult. To address these challenges, we introduce SPIBOT, a novel drone-tethered mobile gripper system designed for robust and stable autonomous target retrieval. SPIBOT operates via a tether, much like a spider, allowing the drone to maintain a safe distance from the target. To ensure both stable mobility and secure grasping capabilities, SPIBOT is equipped with six legs and sensors to estimate the robot's and mission's states. It is designed with a reduced volume and weight compared to other hexapod robots, allowing it to be easily stowed under the drone and reeled in as needed. Designed for the 2024 MBZIRC Maritime Grand Challenge, SPIBOT is built to retrieve a 1kg target object in the highly dynamic conditions of the moving deck of a ship. This system integrates a real-time action selection algorithm that dynamically adjusts the robot's actions based on proximity to the mission goal and environmental conditions, enabling rapid and robust mission execution. Experimental results across various terrains, including a pontoon on a lake, a grass field, and rubber mats on coastal sand, demonstrate SPIBOT's ability to efficiently and reliably retrieve targets. SPIBOT swiftly converges on the target and completes its mission, even when dealing with irregular initial states and noisy information introduced by the drone.
Nirmal Roy, Leonardo F. R. Ribeiro, Rexhina Blloshmi, Kevin Small
Augmenting Large Language Models (LLMs) with information retrieval
capabilities (i.e., Retrieval-Augmented Generation (RAG)) has proven beneficial
for knowledge-intensive tasks. However, understanding users' contextual search
intent when generating responses is an understudied topic for conversational
question answering (QA). This conversational extension leads to additional
concerns when compared to single-turn QA as it is more challenging for systems
to comprehend conversational context and manage retrieved passages over
multiple turns. In this work, we propose a method for enabling LLMs to decide
when to retrieve in RAG settings given a conversational context. When retrieval
is deemed necessary, the LLM then rewrites the conversation for passage
retrieval and judges the relevance of returned passages before response
generation. Operationally, we build on the single-turn SELF-RAG framework (Asai
et al., 2023) and propose SELF-multi-RAG for conversational settings.
SELF-multi-RAG demonstrates improved capabilities over single-turn variants
with respect to retrieving relevant passages (by using summarized
conversational context) and assessing the quality of generated responses.
Experiments on three conversational QA datasets validate the enhanced response
generation capabilities of SELF-multi-RAG, with improvements of ~13% measured
by human annotation.
Authors' comments: Accepted in EMNLP (findings) 2024
Haoyu Huang, Tong Niu, Rui Yang, Luping Shi
Recently, many studies focus on utilizing large language models (LLMs) into educational dialogues. Especially, within liberal arts dialogues, educators must balance \textbf{H}umanized communication, \textbf{T}eaching expertise, and \textbf{S}afety-ethics (\textbf{HTS}), besides the subject knowledge itself. However, due to collecting massive amounts of HTS-compliant teaching dialogues from real world as training corpus is expensive, the outputs of existing LLMs in teaching dialogues fall short of human standards. To address this, we design a Retrieval-augmented Multi-role Multi-expert Collaboration (RAM2C) framework to automatically generate such dialogues data. Specifically, we first establish HTS-guided knowledge bases, encompassing three domain knowledge in teaching skills, psychology, and safety ethics. Then, RAM2C organizes LLMs, which are retrieval-augmented by the above different knowledge bases, into multi-experts groups with distinct roles to generate the HTS-compliant educational dialogues dataset. We then fine-tuned the LLMs using this dataset. Empirical evaluations indicate that RM2C-empowered LLMs excel in Chinese reading teaching, offering more personalized, and ethically safe teaching response, demonstrating RAM2C's practicality and high quality. We release the experiments at \hyperlink{https://github.com/ram2c/ram2c}{https://github.com/ram2c/ram2c}.
Sean Kim, Raja Mazumder
The exponential growth in computational power and accessibility has
transformed the complexity and scale of bioinformatics research, necessitating
standardized documentation for transparency, reproducibility, and regulatory
compliance. The IEEE BioCompute Object (BCO) standard addresses this need but
faces adoption challenges due to the overhead of creating compliant
documentation, especially for legacy research. This paper presents a novel
approach to automate the creation of BCOs from scientific papers using
Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs). We
describe the development of the BCO assistant tool that leverages RAG to
extract relevant information from source papers and associated code
repositories, addressing key challenges such as LLM hallucination and
long-context understanding. The implementation incorporates optimized retrieval
processes, including a two-pass retrieval with re-ranking, and employs
carefully engineered prompts for each BCO domain. We discuss the tool's
architecture, extensibility, and evaluation methods, including automated and
manual assessment approaches. The BCO assistant demonstrates the potential to
significantly reduce the time and effort required for retroactive documentation
of bioinformatics research while maintaining compliance with the standard. This
approach opens avenues for AI-assisted scientific documentation and knowledge
extraction from publications thereby enhancing scientific reproducibility. The
BCO assistant tool and documentation is available at
https://biocompute-objects.github.io/bco-rag/.
Authors' comments: 21 pages, 8 figures
Matthew Kolodner, Mingxuan Ju, Zihao Fan, Tong Zhao, Elham Ghazizadeh, Yan Wu, Neil Shah, Yozen Liu
Improving recommendation systems (RS) can greatly enhance the user experience
across many domains, such as social media. Many RS utilize embedding-based
retrieval (EBR) approaches to retrieve candidates for recommendation. In an EBR
system, the embedding quality is key. According to recent literature,
self-supervised multitask learning (SSMTL) has showed strong performance on
academic benchmarks in embedding learning and resulted in an overall
improvement in multiple downstream tasks, demonstrating a larger resilience to
the adverse conditions between each downstream task and thereby increased
robustness and task generalization ability through the training objective.
However, whether or not the success of SSMTL in academia as a robust training
objectives translates to large-scale (i.e., over hundreds of million users and
interactions in-between) industrial RS still requires verification. Simply
adopting academic setups in industrial RS might entail two issues. Firstly,
many self-supervised objectives require data augmentations (e.g., embedding
masking/corruption) over a large portion of users and items, which is
prohibitively expensive in industrial RS. Furthermore, some self-supervised
objectives might not align with the recommendation task, which might lead to
redundant computational overheads or negative transfer. In light of these two
challenges, we evaluate using a robust training objective, specifically SSMTL,
through a large-scale friend recommendation system on a social media platform
in the tech sector, identifying whether this increase in robustness can work at
scale in enhancing retrieval in the production setting. Through online A/B
testing with SSMTL-based EBR, we observe statistically significant increases in
key metrics in the friend recommendations, with up to 5.45% improvements in new
friends made and 1.91% improvements in new friends made with cold-start users.
Authors' comments: RobustRecSys workshop @ RecSys 2024
Benjamin Clavié, Antoine Chaffin, Griffin Adams
Over the last few years, multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR. By storing representations at the token level rather than at the document level, these methods have demonstrated very strong retrieval performance, especially in out-of-domain settings. However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback, hindering practical adoption. In this paper, we introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored. This method can reduce the space & memory footprint of ColBERT indexes by 50% with virtually no retrieval performance degradation. This method also allows for further reductions, reducing the vector count by 66%-to-75% , with degradation remaining below 5% on a vast majority of datasets. Importantly, this approach requires no architectural change nor query-time processing, and can be used as a simple drop-in during indexation with any ColBERT-like model.
Lindsey Linxi Wei, Guorui Xiao, Magdalena Balazinska
As an important component of data exploration and integration, Column Type Annotation (CTA) aims to label columns of a table with one or more semantic types. With the recent development of Large Language Models (LLMs), researchers have started to explore the possibility of using LLMs for CTA, leveraging their strong zero-shot capabilities. In this paper, we build on this promising work and improve on LLM-based methods for CTA by showing how to use a Knowledge Graph (KG) to augment the context information provided to the LLM. Our approach, called RACOON, combines both pre-trained parametric and non-parametric knowledge during generation to improve LLMs' performance on CTA. Our experiments show that RACOON achieves up to a 0.21 micro F-1 improvement compared against vanilla LLM inference.
Jiashuo Sun, Jihai Zhang, Yucheng Zhou, Zhaochen Su, Xiaoye Qu, Yu Cheng
Large Vision-Language Models (LVLMs) have become pivotal at the intersection
of computer vision and natural language processing. However, the full potential
of LVLMs Retrieval-Augmented Generation (RAG) capabilities remains
underutilized. Existing works either focus solely on the text modality or are
limited to specific tasks. Moreover, most LVLMs struggle to selectively utilize
retrieved information and are sensitive to irrelevant or misleading references.
To address these challenges, we propose a self-refinement framework designed to
teach LVLMs to Selectively Utilize Retrieved Information (SURf). Specifically,
when given questions that are incorrectly answered by the LVLM backbone, we
obtain references that help correct the answers (positive references) and those
that do not (negative references). We then fine-tune the LVLM backbone using a
combination of these positive and negative references. Our experiments across
three tasks and seven datasets demonstrate that our framework significantly
enhances LVLMs ability to effectively utilize retrieved multimodal references
and improves their robustness against irrelevant or misleading information. The
source code is available at https://github.com/GasolSun36/SURf.
Authors' comments: 19 pages, 9 tables, 11 figures
Georgios Sidiropoulos, Evangelos Kanoulas
Speech-based open-domain question answering (QA over a large corpus of text passages with spoken questions) has emerged as an important task due to the increasing number of users interacting with QA systems via speech interfaces. Passage retrieval is a key task in speech-based open-domain QA. So far, previous works adopted pipelines consisting of an automatic speech recognition (ASR) model that transcribes the spoken question before feeding it to a dense text retriever. Such pipelines have several limitations. The need for an ASR model limits the applicability to low-resource languages and specialized domains with no annotated speech data. Furthermore, the ASR model propagates its errors to the retriever. In this work, we try to alleviate these limitations by proposing an ASR-free, end-to-end trained multimodal dense retriever that can work directly on spoken questions. Our experimental results showed that, on shorter questions, our retriever is a promising alternative to the \textit{ASR and Retriever} pipeline, achieving better retrieval performance in cases where ASR would have mistranscribed important words in the question or have produced a transcription with a high word error rate.
Sourav Verma
Large Language Models (LLMs) showcase remarkable abilities, yet they struggle
with limitations such as hallucinations, outdated knowledge, opacity, and
inexplicable reasoning. To address these challenges, Retrieval-Augmented
Generation (RAG) has proven to be a viable solution, leveraging external
databases to improve the consistency and coherence of generated content,
especially valuable for complex, knowledge-rich tasks, and facilitates
continuous improvement by leveraging domain-specific insights. By combining the
intrinsic knowledge of LLMs with the vast, dynamic repositories of external
databases, RAG achieves a synergistic effect. However, RAG is not without its
limitations, including a limited context window, irrelevant information, and
the high processing overhead for extensive contextual data. In this
comprehensive work, we explore the evolution of Contextual Compression
paradigms, providing an in-depth examination of the field. Finally, we outline
the current challenges and suggest potential research and development
directions, paving the way for future advancements in this area.
Authors' comments: Ongoing Work
Md Nakhla Rafi, Dong Jae Kim, Tse-Hsun Chen, Shaowei Wang
Identifying and resolving software faults remains a challenging and resource-intensive process. Traditional fault localization techniques, such as Spectrum-Based Fault Localization (SBFL), leverage statistical analysis of test coverage but often suffer from limited accuracy. While learning-based approaches improve fault localization, they demand extensive training datasets and high computational resources. Recent advances in Large Language Models (LLMs) offer new opportunities by enhancing code understanding and reasoning. However, existing LLM-based fault localization techniques face significant challenges, including token limitations, performance degradation with long inputs, and scalability issues in complex software systems. To overcome these obstacles, we propose LLM4FL, a multi-agent fault localization framework that utilizes three specialized LLM agents. First, the Context Extraction Agent applies an order-sensitive segmentation strategy to partition large coverage data within the LLM's token limit, analyze failure context, and prioritize failure-related methods. The Debugger Agent then processes the extracted data, which employs graph-based retrieval-augmented code navigation to reason about failure causes and rank suspicious methods. Finally, the Reviewer Agent re-evaluates the identified faulty methods using verbal reinforcement learning, engaging in self-criticism and iterative refinement. Evaluated on the Defects4J (V2.0.0) benchmark, which includes 675 faults from 14 Java projects, LLM4FL achieves an 18.55\% improvement in Top-1 accuracy over AutoFL and 4.82\% over SoapFL. It outperforms supervised techniques such as DeepFL and Grace, all without requiring task-specific training. Furthermore, its coverage segmentation and prompt chaining strategies enhance performance, increasing Top-1 accuracy by up to 22\%.
Jiliang Li, Yifan Zhang, Yu Huang, Kevin Leach
Recent growth and proliferation of malware have tested practitioners ability to promptly classify new samples according to malware families. In contrast to labor-intensive reverse engineering efforts, machine learning approaches have demonstrated increased speed and accuracy. However, most existing deep-learning malware family classifiers must be calibrated using a large number of samples that are painstakingly manually analyzed before training. Furthermore, as novel malware samples arise that are beyond the scope of the training set, additional reverse engineering effort must be employed to update the training set. The sheer volume of new samples found in the wild creates substantial pressure on practitioners ability to reverse engineer enough malware to adequately train modern classifiers. In this paper, we present MalMixer, a malware family classifier using semi-supervised learning that achieves high accuracy with sparse training data. We present a domain-knowledge-aware data augmentation technique for malware feature representations, enhancing few-shot performance of semi-supervised malware family classification. We show that MalMixer achieves state-of-the-art performance in few-shot malware family classification settings. Our research confirms the feasibility and effectiveness of lightweight, domain-knowledge-aware data augmentation methods for malware features and shows the capabilities of similar semi-supervised classifiers in addressing malware classification issues.
Bryan Zhang, Taichi Nakatani, Stephan Walter
E-commerce stores enable multilingual product discovery which require
accurate product title translation. Multilingual large language models (LLMs)
have shown promising capacity to perform machine translation tasks, and it can
also enhance and translate product titles cross-lingually in one step. However,
product title translation often requires more than just language conversion
because titles are short, lack context, and contain specialized terminology.
This study proposes a retrieval-augmented generation (RAG) approach that
leverages existing bilingual product information in e-commerce by retrieving
similar bilingual examples and incorporating them as few-shot prompts to
enhance LLM-based product title translation. Experiment results show that our
proposed RAG approach improve product title translation quality with chrF score
gains of up to 15.3% for language pairs where the LLM has limited proficiency.
Authors' comments: 6 Pages,In Proceedings of ACM CIKM Workshop on Data-Centric AI (CIKM
DCAI 2024)
Qi Fan, Hongyu Yuan, Haolin Zuo, Rui Liu, Guanglai Gao
Multimodal emotion recognition utilizes complete multimodal information and
robust multimodal joint representation to gain high performance. However, the
ideal condition of full modality integrity is often not applicable in reality
and there always appears the situation that some modalities are missing. For
example, video, audio, or text data is missing due to sensor failure or network
bandwidth problems, which presents a great challenge to MER research.
Traditional methods extract useful information from the complete modalities and
reconstruct the missing modalities to learn robust multimodal joint
representation. These methods have laid a solid foundation for research in this
field, and to a certain extent, alleviated the difficulty of multimodal emotion
recognition under missing modalities. However, relying solely on internal
reconstruction and multimodal joint learning has its limitations, especially
when the missing information is critical for emotion recognition. To address
this challenge, we propose a novel framework of Retrieval Augment for Missing
Modality Multimodal Emotion Recognition (RAMER), which introduces similar
multimodal emotion data to enhance the performance of emotion recognition under
missing modalities. By leveraging databases, that contain related multimodal
emotion data, we can retrieve similar multimodal emotion information to fill in
the gaps left by missing modalities. Various experimental results demonstrate
that our framework is superior to existing state-of-the-art approaches in
missing modality MER tasks. Our whole project is publicly available on
https://github.com/WooyoohL/Retrieval_Augment_MER.
Authors' comments: Under reviewing
Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, Manaal Faruqui
Large Language Models (LLMs) have demonstrated significant performance
improvements across various cognitive tasks. An emerging application is using
LLMs to enhance retrieval-augmented generation (RAG) capabilities. These
systems require LLMs to understand user queries, retrieve relevant information,
and synthesize coherent and accurate responses. Given the increasing real-world
deployment of such systems, comprehensive evaluation becomes crucial. To this
end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set),
a high-quality evaluation dataset designed to test LLMs' ability to provide
factual responses, assess retrieval capabilities, and evaluate the reasoning
required to generate final answers. While previous work has provided datasets
and benchmarks to evaluate these abilities in isolation, FRAMES offers a
unified framework that provides a clearer picture of LLM performance in
end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions
that require the integration of information from multiple sources. We present
baseline results demonstrating that even state-of-the-art LLMs struggle with
this task, achieving 0.40 accuracy with no retrieval. The accuracy is
significantly improved with our proposed multi-step retrieval pipeline,
achieving an accuracy of 0.66 (>50% improvement). We hope our work will help
bridge evaluation gaps and assist in developing more robust and capable RAG
systems.
Authors' comments: Annual Conference of the Nations of the Americas Chapter of the
Association for Computational Linguistics (NAACL), 2025
Tzu-Lin Kuo, Feng-Ting Liao, Mu-Wei Hsieh, Fu-Chieh Chang, Po-Chun Hsu, Da-Shan Shiu
In real-world applications with Large Language Models (LLMs), external retrieval mechanisms - such as Search-Augmented Generation (SAG), tool utilization, and Retrieval-Augmented Generation (RAG) - are often employed to enhance the quality of augmented generations in dialogues. These approaches often come with multi-turn dialogue, where each interaction is enriched by relevant information retrieved from external sources. Existing benchmarks either assess LLMs' chat abilities in multi-turn dialogues or their use of retrieval for augmented responses in single-turn settings. However, there is a gap in evaluating LLMs' ability to leverage retrieval for more precise responses across multiple turns. To address this limitation, we introduce RAD-Bench (Retrieval Augmented Dialogue), a benchmark designed to evaluate LLMs' capabilities in multi-turn dialogues following retrievals, essential for their deployment in context-rich applications. RAD-Bench evaluates two key abilities of LLMs: Retrieval Synthesis and Retrieval Reasoning. These are measured using discriminative questions and retrieved contexts, and corresponding reference answers, assessing how effectively LLMs integrate and reason with context to maintain and enhance conversation quality over multiple turns. Our evaluation results on commonly used LLMs reveal that model performance deteriorates as additional layers of conditions or constraints are applied across conversation turns, even when accurate retrieved contexts are provided. The data and code are available at https://github.com/mtkresearch/RAD-Bench
Jicheng Wang, Yifeng He, Hao Chen
In real-world software engineering tasks, solving a problem often requires
understanding and modifying multiple functions, classes, and files across a
large codebase. Therefore, on the repository level, it is crucial to extract
the relevant information to achieve accurate code completion effectively.
Existing code completion tools have achieved some success, but they struggle to
optimize the retrieval and generation process dynamically. In this paper, we
propose RepoGenReflex, a generic, dynamic, effective framework to address this
challenge. By leveraging the Retrieval-Augmented Generation (RAG) enhanced with
Verbal Reinforcement Learning (VRL), it can dynamically choose the optimal
results for repository-level code completion.
Authors' comments: being reviewed by AAAI 2025