Ching Nam Hang, Pei-Duo Yu, Chee Wei Tan
In the age of social media, the rapid spread of misinformation and rumors has led to the emergence of infodemics, where false information poses a significant threat to society. To combat this issue, we introduce TrumorGPT , a novel generative artificial intelligence solution designed for fact-checking in the health domain. TrumorGPT aims to distinguish "trumors", which are health-related rumors that turn out to be true, providing a crucial tool in differentiating between mere speculation and verified facts. This framework leverages a large language model (LLM) with few-shot learning for semantic health knowledge graph construction and semantic reasoning. TrumorGPT incorporates graph-based retrieval-augmented generation (GraphRAG) to address the hallucination issue common in LLMs and the limitations of static training data. GraphRAG involves accessing and utilizing information from regularly updated semantic health knowledge graphs that consist of the latest medical news and health information, ensuring that fact-checking by TrumorGPT is based on the most recent data. Evaluating with extensive healthcare datasets, TrumorGPT demonstrates superior performance in fact-checking for public health claims. Its ability to effectively conduct fact-checking across various platforms marks a critical step forward in the fight against health-related misinformation, enhancing trust and accuracy in the digital information age.
Yangguang Shao, Xinjie Lin, Haozheng Luo, Chengshang Hou, Gang Xiong, Jiahao Yu, Junzheng Shi
Large language models (LLMs) have achieved remarkable success in various
domains, primarily due to their strong capabilities in reasoning and generating
human-like text. Despite their impressive performance, LLMs are susceptible to
hallucinations, which can lead to incorrect or misleading outputs. This is
primarily due to the lack of up-to-date knowledge or domain-specific
information. Retrieval-augmented generation (RAG) is a promising approach to
mitigate hallucinations by leveraging external knowledge sources. However, the
security of RAG systems has not been thoroughly studied. In this paper, we
study a poisoning attack on RAG systems named POISONCRAFT, which can mislead
the model to refer to fraudulent websites. Compared to existing poisoning
attacks on RAG systems, our attack is more practical as it does not require
access to the target user query's info or edit the user query. It not only
ensures that injected texts can be retrieved by the model, but also ensures
that the LLM will be misled to refer to the injected texts in its response. We
demonstrate the effectiveness of POISONCRAFTacross different datasets,
retrievers, and language models in RAG pipelines, and show that it remains
effective when transferred across retrievers, including black-box systems.
Moreover, we present a case study revealing how the attack influences both the
retrieval behavior and the step-by-step reasoning trace within the generation
model, and further evaluate the robustness of POISONCRAFTunder multiple defense
mechanisms. These results validate the practicality of our threat model and
highlight a critical security risk for RAG systems deployed in real-world
applications. We release our
code\footnote{https://github.com/AndyShaw01/PoisonCraft} to support future
research on the security and robustness of RAG systems in real-world settings.
Authors' comments: 12 pages, 7 tables and 3 figures
Shuai Wang, Ivona Najdenkoska, Hongyi Zhu, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring
Understanding visual art requires reasoning across multiple perspectives -- cultural, historical, and stylistic -- beyond mere object recognition. While recent multimodal large language models (MLLMs) perform well on general image captioning, they often fail to capture the nuanced interpretations that fine art demands. We propose ArtRAG, a novel, training-free framework that combines structured knowledge with retrieval-augmented generation (RAG) for multi-perspective artwork explanation. ArtRAG automatically constructs an Art Context Knowledge Graph (ACKG) from domain-specific textual sources, organizing entities such as artists, movements, themes, and historical events into a rich, interpretable graph. At inference time, a multi-granular structured retriever selects semantically and topologically relevant subgraphs to guide generation. This enables MLLMs to produce contextually grounded, culturally informed art descriptions. Experiments on the SemArt and Artpedia datasets show that ArtRAG outperforms several heavily trained baselines. Human evaluations further confirm that ArtRAG generates coherent, insightful, and culturally enriched interpretations.
Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio Di Sciascio
We present a systematic study of provider-side data poisoning in retrieval-augmented recommender systems (RAG-based). By modifying only a small fraction of tokens within item descriptions -- for instance, adding emotional keywords or borrowing phrases from semantically related items -- an attacker can significantly promote or demote targeted items. We formalize these attacks under token-edit and semantic-similarity constraints, and we examine their effectiveness in both promotion (long-tail items) and demotion (short-head items) scenarios. Our experiments on MovieLens, using two large language model (LLM) retrieval modules, show that even subtle attacks shift final rankings and item exposures while eluding naive detection. The results underscore the vulnerability of RAG-based pipelines to small-scale metadata rewrites and emphasize the need for robust textual consistency checks and provenance tracking to thwart stealthy provider-side poisoning.
Yingyi Zhang, Pengyue Jia, Xianneng Li, Derong Xu, Maolin Wang, Yichao Wang, Zhaocheng Du, Huifeng Guo et al.
Cloud-device collaboration leverages on-cloud Large Language Models (LLMs) for handling public user queries and on-device Small Language Models (SLMs) for processing private user data, collectively forming a powerful and privacy-preserving solution. However, existing approaches often fail to fully leverage the scalable problem-solving capabilities of on-cloud LLMs while underutilizing the advantage of on-device SLMs in accessing and processing personalized data. This leads to two interconnected issues: 1) Limited utilization of the problem-solving capabilities of on-cloud LLMs, which fail to align with personalized user-task needs, and 2) Inadequate integration of user data into on-device SLM responses, resulting in mismatches in contextual user information. In this paper, we propose a Leader-Subordinate Retrieval framework for Privacy-preserving cloud-device collaboration (LSRP), a novel solution that bridges these gaps by: 1) enhancing on-cloud LLM guidance to on-device SLM through a dynamic selection of task-specific leader strategies named as user-to-user retrieval-augmented generation (U-U-RAG), and 2) integrating the data advantages of on-device SLMs through small model feedback Direct Preference Optimization (SMFB-DPO) for aligning the on-cloud LLM with the on-device SLM. Experiments on two datasets demonstrate that LSRP consistently outperforms state-of-the-art baselines, significantly improving question-answer relevance and personalization, while preserving user privacy through efficient on-device retrieval. Our code is available at: https://github.com/Zhang-Yingyi/LSRP.
Ramteja Sajja, Yusuf Sermet, Ibrahim Demir
Recent advances in AI have catalyzed the adoption of intelligent educational
tools, yet many semantic retrieval systems remain ill-suited to the unique
linguistic and structural characteristics of academic content. This study
presents two open-source embedding models fine-tuned for educational question
answering, particularly in the context of course syllabi. A synthetic dataset
of 3,197 sentence pairs, spanning synonymous terminology, paraphrased
questions, and implicit-explicit mappings, was constructed through a
combination of manual curation and large language model (LLM)-assisted
generation. Two training strategies were evaluated: (1) a baseline model
fine-tuned using MultipleNegativesRankingLoss (MNRL), and (2) a dual-loss model
that combines MNRL with CosineSimilarityLoss to improve both semantic ranking
and similarity calibration. Evaluations were conducted on 28 university course
syllabi using a fixed set of natural language questions categorized into
course, faculty, and teaching assistant information. Results demonstrate that
both fine-tuned models outperform strong open-source baselines, including
all-MiniLM-L6-v2 and multi-qa-MiniLM-L6-cos-v1, and that the dual-loss model
narrows the performance gap with high-performing proprietary embeddings such as
OpenAI's text-embedding-3 series. This work contributes reusable,
domain-aligned embedding models and provides a replicable framework for
educational semantic retrieval, supporting downstream applications such as
academic chatbots, retrieval-augmented generation (RAG) systems, and learning
management system (LMS) integrations.
Authors' comments: 17 pages, 3 Tables
Mingruo Yuan, Ben Kao, Tien-Hsuan Wu
Retrieval of legal knowledge by the general public is a challenging problem due to the technicality of the professional knowledge and the lack of fundamental understanding by laypersons on the subject. Traditional information retrieval techniques assume that users are capable of formulating succinct and precise queries for effective document retrieval. In practice, however, the wide gap between the highly technical contents and untrained users makes legal knowledge retrieval very difficult. We propose a methodology, called QBR, which employs a Questions Bank (QB) as an effective medium for bridging the knowledge gap. We show how the QB is used to derive training samples to enhance the embedding of knowledge units within documents, which leads to effective fine-grained knowledge retrieval. We discuss and evaluate through experiments various advantages of QBR over traditional methods. These include more accurate, efficient, and explainable document retrieval, better comprehension of retrieval results, and highly effective fine-grained knowledge retrieval. We also present some case studies and show that QBR achieves social impact by assisting citizens to resolve everyday legal concerns.
Alexander Most, Joseph Winjum, Ayan Biswas, Shawn Jones, Nishath Rajiv Ranasinghe, Dan O'Malley, Manish Bhattarai
Retrieval-Augmented Generation (RAG) has become a popular technique for enhancing the reliability and utility of Large Language Models (LLMs) by grounding responses in external documents. Traditional RAG systems rely on Optical Character Recognition (OCR) to first process scanned documents into text. However, even state-of-the-art OCRs can introduce errors, especially in degraded or complex documents. Recent vision-language approaches, such as ColPali, propose direct visual embedding of documents, eliminating the need for OCR. This study presents a systematic comparison between a vision-based RAG system (ColPali) and more traditional OCR-based pipelines utilizing Llama 3.2 (90B) and Nougat OCR across varying document qualities. Beyond conventional retrieval accuracy metrics, we introduce a semantic answer evaluation benchmark to assess end-to-end question-answering performance. Our findings indicate that while vision-based RAG performs well on documents it has been fine-tuned on, OCR-based RAG is better able to generalize to unseen documents of varying quality. We highlight the key trade-offs between computational efficiency and semantic accuracy, offering practical guidance for RAG practitioners in selecting between OCR-dependent and vision-based document retrieval systems in production environments.
Giulio Cesare Mastrocinque Santo, Patrícia Izar, Irene Delval, Victor de Napole Gregolin, Nina S. T. Hirata
Video recordings of nonhuman primates in their natural habitat are a common source for studying their behavior in the wild. We fine-tune pre-trained video-text foundational models for the specific domain of capuchin monkeys, with the goal of developing useful computational models to help researchers to retrieve useful clips from videos. We focus on the challenging problem of training a model based solely on raw, unlabeled video footage, using weak audio descriptions sometimes provided by field collaborators. We leverage recent advances in Multimodal Large Language Models (MLLMs) and Vision-Language Models (VLMs) to address the extremely noisy nature of both video and audio content. Specifically, we propose a two-folded approach: an agentic data treatment pipeline and a fine-tuning process. The data processing pipeline automatically extracts clean and semantically aligned video-text pairs from the raw videos, which are subsequently used to fine-tune a pre-trained Microsoft's X-CLIP model through Low-Rank Adaptation (LoRA). We obtained an uplift in $Hits@5$ of $167\%$ for the 16 frames model and an uplift of $114\%$ for the 8 frame model on our domain data. Moreover, based on $NDCG@K$ results, our model is able to rank well most of the considered behaviors, while the tested raw pre-trained models are not able to rank them at all. The code will be made available upon acceptance.
Komal Gilani, Marlo Verket, Christof Peters, Michel Dumontier, Hans-Peter Brunner-La Rocca, Visara Urovi
The standardization of clinical data elements (CDEs) aims to ensure
consistent and comprehensive patient information across various healthcare
systems. Existing methods often falter when standardizing CDEs of varying
representation and complex structure, impeding data integration and
interoperability in clinical research. We introduce CDE-Mapper, an innovative
framework that leverages Retrieval-Augmented Generation approach combined with
Large Language Models to automate the linking of CDEs to controlled
vocabularies. Our modular approach features query decomposition to manage
varying levels of CDEs complexity, integrates expert-defined rules within
prompt engineering, and employs in-context learning alongside multiple
retriever components to resolve terminological ambiguities. In addition, we
propose a knowledge reservoir validated by a human-in-loop approach, achieving
accurate concept linking for future applications while minimizing computational
costs. For four diverse datasets, CDE-Mapper achieved an average of 7.2\%
higher accuracy improvement compared to baseline methods. This work highlights
the potential of advanced language models in improving data harmonization and
significantly advancing capabilities in clinical decision support systems and
research.
Authors' comments: 25 pages (one column), 7 figures
Naphat Nithisopa, Teerapong Panboonyuen
Text recognition in natural images remains a challenging yet essential task, with broad applications spanning computer vision and natural language processing. This paper introduces a novel end-to-end framework that combines ResNet and Vision Transformer backbones with advanced methodologies, including Deformable Convolutions, Retrieval-Augmented Generation, and Conditional Random Fields (CRF). These innovations collectively enhance feature representation and improve Optical Character Recognition (OCR) performance. Specifically, the framework substitutes standard convolution layers in the third and fourth blocks with Deformable Convolutions, leverages adaptive dropout for regularization, and incorporates CRF for more refined sequence modeling. Extensive experiments conducted on six benchmark datasets IC13, IC15, SVT, IIIT5K, SVTP, and CUTE80 validate the proposed method's efficacy, achieving notable accuracies: 97.32% on IC13, 58.26% on IC15, 88.10% on SVT, 74.13% on IIIT5K, 82.17% on SVTP, and 66.67% on CUTE80, resulting in an average accuracy of 77.77%. These results establish a new state-of-the-art for text recognition, demonstrating the robustness of the approach across diverse and challenging datasets.
Mohammad Aqib, Mohd Hamza, Qipei Mei, Ying Hei Chui
Building codes are regulations that establish standards for the design, construction, and safety of buildings to ensure structural integrity, fire protection, and accessibility. They are often extensive, complex, and subject to frequent updates, making manual querying challenging and time-consuming. Key difficulties include navigating large volumes of text, interpreting technical language, and identifying relevant clauses across different sections. A potential solution is to build a Question-Answering (QA) system that answers user queries based on building codes. Among the various methods for building a QA system, Retrieval-Augmented Generation (RAG) stands out in performance. RAG consists of two components: a retriever and a language model. This study focuses on identifying a suitable retriever method for building codes and optimizing the generational capability of the language model using fine-tuning techniques. We conducted a detailed evaluation of various retrieval methods by performing the retrieval on the National Building Code of Canada (NBCC) and explored the impact of domain-specific fine-tuning on several language models using the dataset derived from NBCC. Our analysis included a comparative assessment of different retrievers and the performance of both pre-trained and fine-tuned models to determine the efficacy and domain-specific adaptation of language models using fine-tuning on the NBCC dataset. Experimental results showed that Elasticsearch proved to be the most robust retriever among all. The findings also indicate that fine-tuning language models on an NBCC-specific dataset can enhance their ability to generate contextually relevant responses. When combined with context retrieved by a powerful retriever like Elasticsearch, this improvement in LLM performance can optimize the RAG system, enabling it to better navigate the complexities of the NBCC.
Matan Orbach, Ohad Eytan, Benjamin Sznajder, Ariel Gera, Odellia Boni, Yoav Kantor, Gal Bloch, Omri Levy et al.
Finding the optimal Retrieval-Augmented Generation (RAG) configuration for a given use case can be complex and expensive. Motivated by this challenge, frameworks for RAG hyper-parameter optimization (HPO) have recently emerged, yet their effectiveness has not been rigorously benchmarked. To address this gap, we present a comprehensive study involving 5 HPO algorithms over 5 datasets from diverse domains, including a new one collected for this work on real-world product documentation. Our study explores the largest HPO search space considered to date, with two optimized evaluation metrics. Analysis of the results shows that RAG HPO can be done efficiently, either greedily or with iterative random search, and that it significantly boosts RAG performance for all datasets. For greedy HPO approaches, we show that optimizing models first is preferable to the prevalent practice of optimizing sequentially according to the RAG pipeline order.
Mohammad Shoaib Ansari, Mohd Sohail Ali Khan, Shubham Revankar, Aditya Varma, Anil S. Mokhade
This research paper investigates the application of Large Language Models
(LLMs) in healthcare, specifically focusing on enhancing medical decision
support through Retrieval-Augmented Generation (RAG) integrated with
hospital-specific data and fine-tuning using Quantized Low-Rank Adaptation
(QLoRA). The system utilizes Llama 3.2-3B-Instruct as its foundation model. By
embedding and retrieving context-relevant healthcare information, the system
significantly improves response accuracy. QLoRA facilitates notable parameter
efficiency and memory optimization, preserving the integrity of medical
information through specialized quantization techniques. Our research also
shows that our model performs relatively well on various medical benchmarks,
indicating that it can be used to make basic medical suggestions. This paper
details the system's technical components, including its architecture,
quantization methods, and key healthcare applications such as enhanced disease
prediction from patient symptoms and medical history, treatment suggestions,
and efficient summarization of complex medical reports. We touch on the ethical
considerations-patient privacy, data security, and the need for rigorous
clinical validation-as well as the practical challenges of integrating such
systems into real-world healthcare workflows. Furthermore, the lightweight
quantized weights ensure scalability and ease of deployment even in
low-resource hospital environments. Finally, the paper concludes with an
analysis of the broader impact of LLMs on healthcare and outlines future
directions for LLMs in medical settings.
Authors' comments: 12 pages
Tiantian Gan, Qiyao Sun
Large language models (LLMs) struggle to effectively utilize a growing number of external tools, such as those defined by the Model Context Protocol (MCP)\cite{IntroducingMCP}, due to prompt bloat and selection complexity. We introduce RAG-MCP, a Retrieval-Augmented Generation framework that overcomes this challenge by offloading tool discovery. RAG-MCP uses semantic retrieval to identify the most relevant MCP(s) for a given query from an external index before engaging the LLM. Only the selected tool descriptions are passed to the model, drastically reducing prompt size and simplifying decision-making. Experiments, including an MCP stress test, demonstrate RAG-MCP significantly cuts prompt tokens (e.g., by over 50%) and more than triples tool selection accuracy (43.13% vs 13.62% baseline) on benchmark tasks. RAG-MCP enables scalable and accurate tool integration for LLMs.
Yu-Ren Guo, Wen-Kai Tai
Large Language Models (LLMs) have demonstrated remarkable capabilities in
natural language processing (NLP) and multimodal learning, with successful
applications in text generation and speech synthesis, enabling a deeper
understanding and generation of multimodal content. In the field of sound
effects (SFX) generation, LLMs have been leveraged to orchestrate multiple
models for audio synthesis. However, due to the scarcity of annotated datasets,
and the complexity of temproal modeling. current SFX generation techniques
still fall short in achieving high-fidelity audio. To address these
limitations, this paper introduces a novel framework that integrates LLMs with
existing sound effect databases, allowing for the retrieval, recombination, and
synthesis of audio based on user requirements. By leveraging this approach, we
enhance the diversity and quality of generated sound effects while eliminating
the need for additional recording costs, offering a flexible and efficient
solution for sound design and application.
Authors' comments: 8 pages, 5 figures
Andrew D. McRae
We study a nonconvex algorithmic approach to phase retrieval and the more general problem of semidefinite low-rank matrix sensing. Specifically, we analyze the nonconvex landscape of a quartic Burer-Monteiro factorized least-squares optimization problem. We develop a new analysis framework, taking advantage of the semidefinite problem structure, to understand the properties of second-order critical points -- specifically, whether they (approximately) recover the ground truth matrix. We show that it can be helpful to overparametrize the problem, that is, to optimize over matrices of higher rank than the ground truth. We then apply this framework to several example problems: in addition to recovering existing state-of-the-art phase retrieval landscape guarantees (without overparametrization), we show that overparametrizing by a factor at most logarithmic in the dimension allows recovery with optimal statistical sample complexity for the well-known problems of (1) phase retrieval with sub-Gaussian measurements and (2) more general semidefinite matrix sensing with rank-1 Gaussian measurements. More generally, our analysis (optionally) uses the popular method of convex dual certificates, suggesting that our analysis could be applied to a much wider class of problems.
Xueguang Ma, Luyu Gao, Shengyao Zhuang, Jiaqi Samantha Zhan, Jamie Callan, Jimmy Lin
Recent advancements in large language models (LLMs) have driven interest in
billion-scale retrieval models with strong generalization across retrieval
tasks and languages. Additionally, progress in large vision-language models has
created new opportunities for multimodal retrieval. In response, we have
updated the Tevatron toolkit, introducing a unified pipeline that enables
researchers to explore retriever models at different scales, across multiple
languages, and with various modalities. This demo paper highlights the
toolkit's key features, bridging academia and industry by supporting efficient
training, inference, and evaluation of neural retrievers. We showcase a unified
dense retriever achieving strong multilingual and multimodal effectiveness, and
conduct a cross-modality zero-shot study to demonstrate its research potential.
Alongside, we release OmniEmbed, to the best of our knowledge, the first
embedding model that unifies text, image document, video, and audio retrieval,
serving as a baseline for future research.
Authors' comments: Accepted in SIGIR 2025 (Demo)
Zhichuan Wang, Yang Zhou, Jinhai Xiang, Yulong Wang, Xinwei He
Learning discriminative 3D representations that generalize well to unknown
testing categories is an emerging requirement for many real-world 3D
applications. Existing well-established methods often struggle to attain this
goal due to insufficient 3D training data from broader concepts. Meanwhile,
pre-trained large vision-language models (e.g., CLIP) have shown remarkable
zero-shot generalization capabilities. Yet, they are limited in extracting
suitable 3D representations due to substantial gaps between their 2D training
and 3D testing distributions. To address these challenges, we propose
Testing-time Distribution Alignment (TeDA), a novel framework that adapts a
pretrained 2D vision-language model CLIP for unknown 3D object retrieval at
test time. To our knowledge, it is the first work that studies the test-time
adaptation of a vision-language model for 3D feature learning. TeDA projects 3D
objects into multi-view images, extracts features using CLIP, and refines 3D
query embeddings with an iterative optimization strategy by confident
query-target sample pairs in a self-boosting manner. Additionally, TeDA
integrates textual descriptions generated by a multimodal language model
(InternVL) to enhance 3D object understanding, leveraging CLIP's aligned
feature space to fuse visual and textual cues. Extensive experiments on four
open-set 3D object retrieval benchmarks demonstrate that TeDA greatly
outperforms state-of-the-art methods, even those requiring extensive training.
We also experimented with depth maps on Objaverse-LVIS, further validating its
effectiveness. Code is available at https://github.com/wangzhichuan123/TeDA.
Authors' comments: Accepted by ICMR 2025
Justin Ho, Alexandra Colby, William Fisher
This paper presents a domain-specific implementation of Retrieval-Augmented
Generation (RAG) tailored to the Fair Use Doctrine in U.S. copyright law.
Motivated by the increasing prevalence of DMCA takedowns and the lack of
accessible legal support for content creators, we propose a structured approach
that combines semantic search with legal knowledge graphs and court citation
networks to improve retrieval quality and reasoning reliability. Our prototype
models legal precedents at the statutory factor level (e.g., purpose, nature,
amount, market effect) and incorporates citation-weighted graph representations
to prioritize doctrinally authoritative sources. We use Chain-of-Thought
reasoning and interleaved retrieval steps to better emulate legal reasoning.
Preliminary testing suggests this method improves doctrinal relevance in the
retrieval process, laying groundwork for future evaluation and deployment of
LLM-based legal assistance tools.
Authors' comments: Submitted to the 7th Workshop on Automated Semantic Analysis of
Information in Legal Text. 8 pages, 5 Figures