Yutong Song, Chenhan Lyu, Pengfei Zhang, Sabine Brunswicker, Nikil Dutt, Amir Rahmani
Mild-stage dementia patients primarily experience two critical symptoms:
severe memory loss and emotional instability. To address these challenges, we
propose DEMENTIA-PLAN, an innovative retrieval-augmented generation framework
that leverages large language models to enhance conversational support. Our
model employs a multiple knowledge graph architecture, integrating various
dimensional knowledge representations including daily routine graphs and life
memory graphs. Through this multi-graph architecture, DEMENTIA-PLAN
comprehensively addresses both immediate care needs and facilitates deeper
emotional resonance through personal memories, helping stabilize patient mood
while providing reliable memory support. Our notable innovation is the
self-reflection planning agent, which systematically coordinates knowledge
retrieval and semantic integration across multiple knowledge graphs, while
scoring retrieved content from daily routine and life memory graphs to
dynamically adjust their retrieval weights for optimized response generation.
DEMENTIA-PLAN represents a significant advancement in the clinical application
of large language models for dementia care, bridging the gap between AI tools
and caregivers interventions.
Authors' comments: Accepted by AAAI 2025 Workshop on Knowledge Graphs for Personalized
Public Health
Juntao Jian, Xiuping Liu, Zixuan Chen, Manyi Li, Jian Liu, Ruizhen Hu
Recent advances in dexterous grasping synthesis have demonstrated significant
progress in producing reasonable and plausible grasps for many task purposes.
But it remains challenging to generalize to unseen object categories and
diverse task instructions. In this paper, we propose G-DexGrasp, a
retrieval-augmented generation approach that can produce high-quality dexterous
hand configurations for unseen object categories and language-based task
instructions. The key is to retrieve generalizable grasping priors, including
the fine-grained contact part and the affordance-related distribution of
relevant grasping instances, for the following synthesis pipeline.
Specifically, the fine-grained contact part and affordance act as generalizable
guidance to infer reasonable grasping configurations for unseen objects with a
generative model, while the relevant grasping distribution plays as
regularization to guarantee the plausibility of synthesized grasps during the
subsequent refinement optimization. Our comparison experiments validate the
effectiveness of our key designs for generalization and demonstrate the
remarkable performance against the existing approaches. Project page:
https://g-dexgrasp.github.io/
Authors' comments: 11 pages, 5 figures
Monan Zhou, Shenyang Xu, Zhaorui Liu, Zhaowen Wang, Feng Yu, Wei Li, Baoqiang Han
Data are crucial in various computer-related fields, including music
information retrieval (MIR), an interdisciplinary area bridging computer
science and music. This paper introduces CCMusic, an open and diverse database
comprising multiple datasets specifically designed for tasks related to Chinese
music, highlighting our focus on this culturally rich domain. The database
integrates both published and unpublished datasets, with steps taken such as
data cleaning, label refinement, and data structure unification to ensure data
consistency and create ready-to-use versions. We conduct benchmark evaluations
for all datasets using a unified evaluation framework developed specifically
for this purpose. This publicly available framework supports both
classification and detection tasks, ensuring standardized and reproducible
results across all datasets. The database is hosted on HuggingFace and
ModelScope, two open and multifunctional data and model hosting platforms,
ensuring ease of accessibility and usability.
Authors' comments: 17 pages, 18 figures
Tilahun Yeshambel, Moncef Garouani, Serge Molina, Josiane Mothe
This paper reports some difficulties and some results when using dense
retrievers on Amharic, one of the low-resource languages spoken by 120 millions
populations. The efforts put and difficulties faced by University Addis Ababa
toward Amharic Information Retrieval will be developed during the presentation.
Authors' comments: 4 pages, 2 figures
Yu-An Liu, Haya Nachimovsky, Ruqing Zhang, Oren Kurland, Jiafeng Guo, Moshe Tennenholtz
With the advancement of information retrieval (IR) technologies, robustness
is increasingly attracting attention. When deploying technology into practice,
we consider not only its average performance under normal conditions but, more
importantly, its ability to maintain functionality across a variety of
exceptional situations. In recent years, the research on IR robustness covers
theory, evaluation, methodology, and application, and all of them show a
growing trend. The purpose of this workshop is to systematize the latest
results of each research aspect, to foster comprehensive communication within
this niche domain while also bridging robust IR research with the broader
community, and to promote further future development of robust IR. To avoid the
one-sided talk of mini-conferences, this workshop adopts a highly interactive
format, including round-table and panel discussion sessions, to encourage
active participation and meaningful exchange among attendees.
Authors' comments: Accept by SIGIR 2025
Haoqiang Lin, Haokun Wen, Xuemeng Song, Meng Liu, Yupeng Hu, Liqiang Nie
Composed Image Retrieval (CIR) allows users to search target images with a multimodal query, comprising a reference image and a modification text that describes the user's modification demand over the reference image. Nevertheless, due to the expensive labor cost of training data annotation, recent researchers have shifted to the challenging task of zero-shot CIR (ZS-CIR), which targets fulfilling CIR without annotated triplets. The pioneer ZS-CIR studies focus on converting the CIR task into a standard text-to-image retrieval task by pre-training a textual inversion network that can map a given image into a single pseudo-word token. Despite their significant progress, their coarse-grained textual inversion may be insufficient to capture the full content of the image accurately. To overcome this issue, in this work, we propose a novel Fine-grained Textual Inversion Network for ZS-CIR, named FTI4CIR. In particular, FTI4CIR comprises two main components: fine-grained pseudo-word token mapping and tri-wise caption-based semantic regularization. The former maps the image into a subject-oriented pseudo-word token and several attribute-oriented pseudo-word tokens to comprehensively express the image in the textual form, while the latter works on jointly aligning the fine-grained pseudo-word tokens to the real-word token embedding space based on a BLIP-generated image caption template. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our proposed method.
Krisztian Balog, Donald Metzler, Zhen Qin
Large language models (LLMs) are increasingly integral to information retrieval (IR), powering ranking, evaluation, and AI-assisted content creation. This widespread adoption necessitates a critical examination of potential biases arising from the interplay between these LLM-based components. This paper synthesizes existing research and presents novel experiment designs that explore how LLM-based rankers and assistants influence LLM-based judges. We provide the first empirical evidence of LLM judges exhibiting significant bias towards LLM-based rankers. Furthermore, we observe limitations in LLM judges' ability to discern subtle system performance differences. Contrary to some previous findings, our preliminary study does not find evidence of bias against AI-generated content. These results highlight the need for a more holistic view of the LLM-driven information ecosystem. To this end, we offer initial guidelines and a research agenda to ensure the reliable use of LLMs in IR evaluation.
Yuan Li, Jun Hu, Jiaxin Jiang, Zemin Liu, Bryan Hooi, Bingsheng He
Recent advances in graph learning have paved the way for innovative retrieval-augmented generation (RAG) systems that leverage the inherent relational structures in graph data. However, many existing approaches suffer from rigid, fixed settings and significant engineering overhead, limiting their adaptability and scalability. Additionally, the RAG community has largely overlooked the decades of research in the graph database community regarding the efficient retrieval of interesting substructures on large-scale graphs. In this work, we introduce the RAG-on-Graphs Library (RGL), a modular framework that seamlessly integrates the complete RAG pipeline-from efficient graph indexing and dynamic node retrieval to subgraph construction, tokenization, and final generation-into a unified system. RGL addresses key challenges by supporting a variety of graph formats and integrating optimized implementations for essential components, achieving speedups of up to 143x compared to conventional methods. Moreover, its flexible utilities, such as dynamic node filtering, allow for rapid extraction of pertinent subgraphs while reducing token consumption. Our extensive evaluations demonstrate that RGL not only accelerates the prototyping process but also enhances the performance and applicability of graph-based RAG systems across a range of tasks.
Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Lutao Jiang, Haiwei Xue, Bin Ren, Danda Paudel, Nicu Sebe et al.
Retrieval-augmented generation (RAG) has emerged as a pivotal technique in
artificial intelligence (AI), particularly in enhancing the capabilities of
large language models (LLMs) by enabling access to external, reliable, and
up-to-date knowledge sources. In the context of AI-Generated Content (AIGC),
RAG has proven invaluable by augmenting model outputs with supplementary,
relevant information, thus improving their quality. Recently, the potential of
RAG has extended beyond natural language processing, with emerging methods
integrating retrieval-augmented strategies into the computer vision (CV)
domain. These approaches aim to address the limitations of relying solely on
internal model knowledge by incorporating authoritative external knowledge
bases, thereby improving both the understanding and generation capabilities of
vision models. This survey provides a comprehensive review of the current state
of retrieval-augmented techniques in CV, focusing on two main areas: (I) visual
understanding and (II) visual generation. In the realm of visual understanding,
we systematically review tasks ranging from basic image recognition to complex
applications such as medical report generation and multimodal question
answering. For visual content generation, we examine the application of RAG in
tasks related to image, video, and 3D generation. Furthermore, we explore
recent advancements in RAG for embodied AI, with a particular focus on
applications in planning, task execution, multimodal perception, interaction,
and specialized domains. Given that the integration of retrieval-augmented
techniques in CV is still in its early stages, we also highlight the key
limitations of current approaches and propose future research directions to
drive the development of this promising area.
Authors' comments: 19 pages, 10 figures
Justice Ou, Tinglin Huang, Yilun Zhao, Ziyang Yu, Peiqing Lu, Rex Ying
To improve the reliability of Large Language Models (LLMs) in clinical applications, retrieval-augmented generation (RAG) is extensively applied to provide factual medical knowledge. However, beyond general medical knowledge from open-ended datasets, clinical case-based knowledge is also critical for effective medical reasoning, as it provides context grounded in real-world patient experiences.Motivated by this, we propose Experience Retrieval-Augmentation ExpRAG framework based on Electronic Health Record(EHR), aiming to offer the relevant context from other patients' discharge reports. ExpRAG performs retrieval through a coarse-to-fine process, utilizing an EHR-based report ranker to efficiently identify similar patients, followed by an experience retriever to extract task-relevant content for enhanced medical reasoning.To evaluate ExpRAG, we introduce DischargeQA, a clinical QA dataset with 1,280 discharge-related questions across diagnosis, medication, and instruction tasks. Each problem is generated using EHR data to ensure realistic and challenging scenarios. Experimental results demonstrate that ExpRAG consistently outperforms a text-based ranker, achieving an average relative improvement of 5.2%, highlighting the importance of case-based knowledge for medical reasoning.
Kaiwen Zuo, Jing Tang, Hanbing Qin, Binli Luo, Ligang He, Shiyan Tang
Recent advancements in Large Language Models (LLMs) have marked significant
progress in understanding and responding to medical inquiries. However, their
performance still falls short of the standards set by professional
consultations. This paper introduces a novel framework for medical
consultation, comprising two main modules: Terminology-Enhanced Information
Retrieval (TEIR) and Emotional In-Context Learning (EICL). TEIR ensures
implicit reasoning through the utilization of inductive knowledge and key
terminology retrieval, overcoming the limitations of restricted domain
knowledge in public databases. Additionally, this module features capabilities
for processing long context. The EICL module aids in generating sentences with
high attribute relevance by memorizing semantic and attribute information from
unlabelled corpora and applying controlled retrieval for the required
information. Furthermore, a dataset comprising 803,564 consultation records was
compiled in China, significantly enhancing the model's capability for complex
dialogues and proactive inquiry initiation. Comprehensive experiments
demonstrate the proposed method's effectiveness in extending the context window
length of existing LLMs. The experimental outcomes and extensive data validate
the framework's superiority over five baseline models in terms of BLEU and
ROUGE performance metrics, with substantial leads in certain capabilities.
Notably, ablation studies confirm the significance of the TEIR and EICL
components. In addition, our new framework has the potential to significantly
improve patient satisfaction in real clinical consulting situations.
Authors' comments: The 46th European Conference on Information Retrieval Workshop
Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Gaopeng Gou, Qi Wu
Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a
broad range of visual content manipulation intent across domain, scene, object,
and attribute. The key challenge for ZS-CIR tasks is to modify a reference
image according to manipulation text to accurately retrieve a target image,
especially when the reference image is missing essential target content. In
this paper, we propose a novel prediction-based mapping network, named
PrediCIR, to adaptively predict the missing target visual content in reference
images in the latent space before mapping for accurate ZS-CIR. Specifically, a
world view generation module first constructs a source view by omitting certain
visual content of a target view, coupled with an action that includes the
manipulation intent derived from existing image-caption pairs. Then, a target
content prediction module trains a world model as a predictor to adaptively
predict the missing visual information guided by user intention in manipulating
text at the latent space. The two modules map an image with the predicted
relevant information to a pseudo-word token without extra supervision. Our
model shows strong generalization ability on six ZS-CIR tasks. It obtains
consistent and significant performance boosts ranging from 1.73% to 4.45% over
the best methods and achieves new state-of-the-art results on ZS-CIR. Our code
is available at https://github.com/Pter61/predicir.
Authors' comments: This work has been accepted to CVPR 2025
Zahra Khalila, Arbi Haza Nasution, Winda Monika, Aytug Onan, Yohei Murakami, Yasir Bin Ismail Radi, Noor Mohammad Osmani
Accurate and contextually faithful responses are critical when applying large
language models (LLMs) to sensitive and domain-specific tasks, such as
answering queries related to quranic studies. General-purpose LLMs often
struggle with hallucinations, where generated responses deviate from
authoritative sources, raising concerns about their reliability in religious
contexts. This challenge highlights the need for systems that can integrate
domain-specific knowledge while maintaining response accuracy, relevance, and
faithfulness. In this study, we investigate 13 open-source LLMs categorized
into large (e.g., Llama3:70b, Gemma2:27b, QwQ:32b), medium (e.g., Gemma2:9b,
Llama3:8b), and small (e.g., Llama3.2:3b, Phi3:3.8b). A Retrieval-Augmented
Generation (RAG) is used to make up for the problems that come with using
separate models. This research utilizes a descriptive dataset of Quranic surahs
including the meanings, historical context, and qualities of the 114 surahs,
allowing the model to gather relevant knowledge before responding. The models
are evaluated using three key metrics set by human evaluators: context
relevance, answer faithfulness, and answer relevance. The findings reveal that
large models consistently outperform smaller models in capturing query
semantics and producing accurate, contextually grounded responses. The
Llama3.2:3b model, even though it is considered small, does very well on
faithfulness (4.619) and relevance (4.857), showing the promise of smaller
architectures that have been well optimized. This article examines the
trade-offs between model size, computational efficiency, and response quality
while using LLMs in domain-specific applications.
Authors' comments: 11 pages, keywords: Large-language-models; retrieval-augmented
generation; question answering; Quranic studies; Islamic teachings
Qiang Zou, Shuli Cheng, Jiayi Chen
Cross-modal hashing is a promising approach for efficient data retrieval and
storage optimization. However, contemporary methods exhibit significant
limitations in semantic preservation, contextual integrity, and information
redundancy, which constrains retrieval efficacy. We present PromptHash, an
innovative framework leveraging affinity prompt-aware collaborative learning
for adaptive cross-modal hashing. We propose an end-to-end framework for
affinity-prompted collaborative hashing, with the following fundamental
technical contributions: (i) a text affinity prompt learning mechanism that
preserves contextual information while maintaining parameter efficiency, (ii)
an adaptive gated selection fusion architecture that synthesizes State Space
Model with Transformer network for precise cross-modal feature integration, and
(iii) a prompt affinity alignment strategy that bridges modal heterogeneity
through hierarchical contrastive learning. To the best of our knowledge, this
study presents the first investigation into affinity prompt awareness within
collaborative cross-modal adaptive hash learning, establishing a paradigm for
enhanced semantic consistency across modalities. Through comprehensive
evaluation on three benchmark multi-label datasets, PromptHash demonstrates
substantial performance improvements over existing approaches. Notably, on the
NUS-WIDE dataset, our method achieves significant gains of 18.22% and 18.65% in
image-to-text and text-to-image retrieval tasks, respectively. The code is
publicly available at https://github.com/ShiShuMo/PromptHash.
Authors' comments: Accepted by CVPR2025
Hisashi Johno, Yuki Johno, Akitomo Amakawa, Junichi Sato, Ryota Tozuka, Atsushi Komaba, Hiroaki Watanabe, Hiroki Watanabe et al.
Purpose: Retrieval-augmented generation (RAG) is a technology to enhance the
functionality and reliability of large language models (LLMs) by retrieving
relevant information from reliable external knowledge (REK). RAG has gained
interest in radiology, and we previously reported the utility of NotebookLM, an
LLM with RAG (RAG-LLM), for lung cancer staging. However, since the comparator
LLM differed from NotebookLM's internal model, it remained unclear whether its
advantage stemmed from RAG or inherent model differences. To better isolate
RAG's impact and assess its utility across different cancers, we compared
NotebookLM with its internal LLM, Gemini 2.0 Flash, in a pancreatic cancer
staging experiment.
Materials and Methods: A summary of Japan's pancreatic cancer staging
guidelines was used as REK. We compared three groups - REK+/RAG+ (NotebookLM
with REK), REK+/RAG- (Gemini 2.0 Flash with REK), and REK-/RAG- (Gemini 2.0
Flash without REK) - in staging 100 fictional pancreatic cancer cases based on
CT findings. Staging criteria included TNM classification, local invasion
factors, and resectability classification. In REK+/RAG+, retrieval accuracy was
quantified based on the sufficiency of retrieved REK excerpts.
Results: REK+/RAG+ achieved a staging accuracy of 70%, outperforming
REK+/RAG- (38%) and REK-/RAG- (35%). For TNM classification, REK+/RAG+ attained
80% accuracy, exceeding REK+/RAG- (55%) and REK-/RAG- (50%). Additionally,
REK+/RAG+ explicitly presented retrieved REK excerpts, achieving a retrieval
accuracy of 92%.
Conclusion: NotebookLM, a RAG-LLM, outperformed its internal LLM, Gemini 2.0
Flash, in a pancreatic cancer staging experiment, suggesting that RAG may
improve LLM's staging accuracy. Furthermore, its ability to retrieve and
present REK excerpts provides transparency for physicians, highlighting its
applicability for clinical diagnosis and classification.
Authors' comments: 11 pages, 6 figures, 2 tables, 6 supplementary files
Yang Tan, Chen Liu, Jingyuan Gao, Banghao Wu, Mingchen Li, Ruilin Wang, Lingrong Zhang, Huiqun Yu et al.
Natural language processing (NLP) has significantly influenced scientific
domains beyond human language, including protein engineering, where pre-trained
protein language models (PLMs) have demonstrated remarkable success. However,
interdisciplinary adoption remains limited due to challenges in data
collection, task benchmarking, and application. This work presents
VenusFactory, a versatile engine that integrates biological data retrieval,
standardized task benchmarking, and modular fine-tuning of PLMs. VenusFactory
supports both computer science and biology communities with choices of both a
command-line execution and a Gradio-based no-code interface, integrating $40+$
protein-related datasets and $40+$ popular PLMs. All implementations are
open-sourced on https://github.com/tyang816/VenusFactory.
Authors' comments: 12 pages, 1 figure, 8 tables
Jingyi Chen, Songqiang Chen, Jialun Cao, Jiasi Shen, Shing-Chi Cheung
Retrieval-augmented generation (RAG) has increasingly shown its power in extending large language models' (LLMs') capability beyond their pre-trained knowledge. Existing works have shown that RAG can help with software development tasks such as code generation, code update, and test generation. Yet, the effectiveness of adapting LLMs to fast-evolving or less common API libraries using RAG remains unknown. To bridge this gap, we take an initial step to study this unexplored yet practical setting - when developers code with a less common library, they often refer to its API documentation; likewise, when LLMs are allowed to look up API documentation via RAG, to what extent can LLMs be advanced? To mimic such a setting, we select four less common open-source Python libraries with a total of 1017 eligible APIs. We study the factors that affect the effectiveness of using the documentation of less common API libraries as additional knowledge for retrieval and generation. Our intensive study yields interesting findings: (1) RAG helps improve LLMs' performance by 83%-220%. (2) Example code contributes the most to advance LLMs, instead of the descriptive texts and parameter lists in the API documentation. (3) LLMs could sometimes tolerate mild noises (typos in description or incorrect parameters) by referencing their pre-trained knowledge or document context. Finally, we suggest that developers pay more attention to the quality and diversity of the code examples in the API documentation. The study sheds light on future low-code software development workflows.
Dehui Yang, Feng Xi
This paper is concerned with the fundamental problem of estimating chirp parameters from a mixture of linear chirp signals. Unlike most previous methods, which solve the problem by discretizing the parameter space and then estimating the chirp parameters, we propose a gridless approach by reformulating the inverse problem as a constrained two-dimensional atomic norm minimization from structured measurements. This reformulation enables the direct estimation of continuous-valued parameters without discretization, thereby resolving the issue of basis mismatch. An approximate semidefinite programming (SDP) is employed to solve the proposed convex program. Additionally, a dual polynomial is constructed to certify the optimality of the atomic decomposition. Numerical simulations demonstrate that exact recovery of chirp parameters is achievable using the proposed atomic norm minimization.
Firoj Alam, Julia Maria Struß, Tanmoy Chakraborty, Stefan Dietze, Salim Hafid, Katerina Korre, Arianna Muti, Preslav Nakov et al.
The CheckThat! lab aims to advance the development of innovative technologies
designed to identify and counteract online disinformation and manipulation
efforts across various languages and platforms. The first five editions focused
on key tasks in the information verification pipeline, including
check-worthiness, evidence retrieval and pairing, and verification. Since the
2023 edition, the lab has expanded its scope to address auxiliary tasks that
support research and decision-making in verification. In the 2025 edition, the
lab revisits core verification tasks while also considering auxiliary
challenges. Task 1 focuses on the identification of subjectivity (a follow-up
from CheckThat! 2024), Task 2 addresses claim normalization, Task 3 targets
fact-checking numerical claims, and Task 4 explores scientific web discourse
processing. These tasks present challenging classification and retrieval
problems at both the document and span levels, including multilingual settings.
Authors' comments: misinformation, factuality, fact-checking, fact-checkers,
check-worthiness, Social Media Platforms
Yuelyu Ji, Hang Zhang, Yanshan Wang
Medical Question Answering systems based on Retrieval Augmented Generation is promising for clinical decision support because they can integrate external knowledge, thus reducing inaccuracies inherent in standalone large language models (LLMs). However, these systems may unintentionally propagate or amplify biases associated with sensitive demographic attributes like race, gender, and socioeconomic factors. This study systematically evaluates demographic biases within medical RAG pipelines across multiple QA benchmarks, including MedQA, MedMCQA, MMLU, and EquityMedQA. We quantify disparities in retrieval consistency and answer correctness by generating and analyzing queries sensitive to demographic variations. We further implement and compare several bias mitigation strategies to address identified biases, including Chain of Thought reasoning, Counterfactual filtering, Adversarial prompt refinement, and Majority Vote aggregation. Experimental results reveal significant demographic disparities, highlighting that Majority Vote aggregation notably improves accuracy and fairness metrics. Our findings underscore the critical need for explicitly fairness-aware retrieval methods and prompt engineering strategies to develop truly equitable medical QA systems.