Tianlu Zheng, Yifan Zhang, Xiang An, Ziyong Feng, Kaicheng Yang, Qichuan Ding
Although Contrastive Language-Image Pre-training (CLIP) exhibits strong
performance across diverse vision tasks, its application to person
representation learning faces two critical challenges: (i) the scarcity of
large-scale annotated vision-language data focused on person-centric images,
and (ii) the inherent limitations of global contrastive learning, which
struggles to maintain discriminative local features crucial for fine-grained
matching while remaining vulnerable to noisy text tokens. This work advances
CLIP for person representation learning through synergistic improvements in
data curation and model architecture. First, we develop a noise-resistant data
construction pipeline that leverages the in-context learning capabilities of
MLLMs to automatically filter and caption web-sourced images. This yields
WebPerson, a large-scale dataset of 5M high-quality person-centric image-text
pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking
Synergetic) framework, which improves cross-modal alignment by adaptively
masking noisy textual tokens based on the gradient-attention similarity score.
Additionally, we incorporate masked token prediction objectives that compel the
model to predict informative text tokens, enhancing fine-grained semantic
representation learning. Extensive experiments show that GA-DMS achieves
state-of-the-art performance across multiple benchmarks.
Authors' comments: Accepted by EMNLP2025 Main
Nana Han, Dong Liu, Tomas Norton
Large language models (LLMs) are increasingly being recognised as valuable knowledge communication tools in many industries. However, their application in livestock farming remains limited, being constrained by several factors not least the availability, diversity and complexity of knowledge sources. This study introduces an intelligent knowledge assistant system designed to support health management in farmed goats. Leveraging the Retrieval-Augmented Generation (RAG), two structured knowledge processing methods, table textualization and decision-tree textualization, were proposed to enhance large language models' (LLMs) understanding of heterogeneous data formats. Based on these methods, a domain-specific goat farming knowledge base was established to improve LLM's capacity for cross-scenario generalization. The knowledge base spans five key domains: Disease Prevention and Treatment, Nutrition Management, Rearing Management, Goat Milk Management, and Basic Farming Knowledge. Additionally, an online search module is integrated to enable real-time retrieval of up-to-date information. To evaluate system performance, six ablation experiments were conducted to examine the contribution of each component. The results demonstrated that heterogeneous knowledge fusion method achieved the best results, with mean accuracies of 87.90% on the validation set and 84.22% on the test set. Across the text-based, table-based, decision-tree based Q&A tasks, accuracy consistently exceeded 85%, validating the effectiveness of structured knowledge fusion within a modular design. Error analysis identified omission as the predominant error category, highlighting opportunities to further improve retrieval coverage and context integration. In conclusion, the results highlight the robustness and reliability of the proposed system for practical applications in goat farming.
Tejas Pawar, Sarika Patil, Om Tilekar, Rushikesh Janwade, Vaibhav Helambe
Conversational AI systems often struggle with maintaining coherent,
contextual memory across extended interactions, limiting their ability to
provide personalized and contextually relevant responses. This paper presents
IMDMR (Intelligent Multi-Dimensional Memory Retrieval), a novel system that
addresses these limitations through a multi-dimensional search architecture.
Unlike existing memory systems that rely on single-dimensional approaches,
IMDMR leverages six distinct memory dimensions-semantic, entity, category,
intent, context, and temporal-to provide comprehensive memory retrieval
capabilities. Our system incorporates intelligent query processing with dynamic
strategy selection, cross-memory entity resolution, and advanced memory
integration techniques. Through comprehensive evaluation against five baseline
systems including LangChain RAG, LlamaIndex, MemGPT, and spaCy + RAG, IMDMR
achieves a 3.8x improvement in overall performance (0.792 vs 0.207 for the best
baseline). We present both simulated (0.314) and production (0.792)
implementations, demonstrating the importance of real technology integration
while maintaining superiority over all baseline systems. Ablation studies
demonstrate the effectiveness of multi-dimensional search, with the full system
outperforming individual dimension approaches by 23.3%. Query-type analysis
reveals superior performance across all categories, particularly for
preferences/interests (0.630) and goals/aspirations (0.630) queries.
Comprehensive visualizations and statistical analysis confirm the significance
of these improvements with p < 0.001 across all metrics. The results establish
IMDMR as a significant advancement in conversational AI memory systems,
providing a robust foundation for enhanced user interactions and personalized
experiences.
Authors' comments: 28 pages, 8 figures, submitted to arXiv for open access publication
Mohammad Saqib Hasan, Sayontan Ghosh, Dhruv Verma, Geoff Kuenning, Erez Zadok, Scott A. Smolka, Niranjan Balasubramanian
We study the problem of Open-Vocabulary Constructs(OVCs) -- ones not known
beforehand -- in the context of converting natural language (NL) specifications
into formal languages (e.g., temporal logic or code). Models fare poorly on
OVCs due to a lack of necessary knowledge a priori. In such situations, a
domain expert can provide correct constructs at inference time based on their
preferences or domain knowledge. Our goal is to effectively reuse this
inference-time, expert-provided knowledge for future parses without retraining
the model. We present dynamic knowledge-augmented parsing(DKAP), where in
addition to the input sentence, the model receives (dynamically growing) expert
knowledge as a key-value lexicon that associates NL phrases with correct OVC
constructs. We propose ROLex, a retrieval-augmented parsing approach that uses
this lexicon. A retriever and a generator are trained to find and use the
key-value store to produce the correct parse. A key challenge lies in curating
data for this retrieval-augmented parser. We utilize synthetic data generation
and the data augmentation techniques on annotated (NL sentence, FL statement)
pairs to train the augmented parser. To improve training effectiveness, we
propose multiple strategies to teach models to focus on the relevant subset of
retrieved knowledge. Finally, we introduce a new evaluation paradigm modeled
after the DKAP problem and simulate the scenario across three formalization
tasks (NL2LTL, NL2Code, and NL2CMD). Our evaluations show that DKAP is a
difficult challenge, and ROLex helps improve the performance of baseline models
by using dynamic expert knowledge effectively.
Authors' comments: Accepted to COLM 2024
Amay Jain, Liu Cui, Si Chen
Large language models like ChatGPT are increasingly used in classrooms, but
they often provide outdated or fabricated information that can mislead
students. Retrieval Augmented Generation (RAG) improves reliability of LLMs by
grounding responses in external resources. We investigate two accessible RAG
paradigms, vector-based retrieval and graph-based retrieval to identify best
practices for classroom question answering (QA). Existing comparative studies
fail to account for pedagogical factors such as educational disciplines,
question types, and practical deployment costs. Using a novel dataset,
EduScopeQA, of 3,176 questions across academic subjects, we measure performance
on various educational query types, from specific facts to broad thematic
discussions. We also evaluate system alignment with a dataset of systematically
altered textbooks that contradict the LLM's latent knowledge. We find that
OpenAI Vector Search RAG (representing vector-based RAG) performs well as a
low-cost generalist, especially for quick fact retrieval. On the other hand,
GraphRAG Global excels at providing pedagogically rich answers to thematic
queries, and GraphRAG Local achieves the highest accuracy with the dense,
altered textbooks when corpus integrity is critical. Accounting for the 10-20x
higher resource usage of GraphRAG (representing graph-based RAG), we show that
a dynamic branching framework that routes queries to the optimal retrieval
method boosts fidelity and efficiency. These insights provide actionable
guidelines for educators and system designers to integrate RAG-augmented LLMs
into learning environments effectively.
Authors' comments: This work has been submitted to the IEEE for possible publication
Minghan Li, Miyang Luo, Tianrui Lv, Yishuai Zhang, Siqi Zhao, Ercong Nie, Guodong Zhou
The proliferation of long-form documents presents a fundamental challenge to
information retrieval (IR), as their length, dispersed evidence, and complex
structures demand specialized methods beyond standard passage-level techniques.
This survey provides the first comprehensive treatment of long-document
retrieval (LDR), consolidating methods, challenges, and applications across
three major eras. We systematize the evolution from classical lexical and early
neural models to modern pre-trained (PLM) and large language models (LLMs),
covering key paradigms like passage aggregation, hierarchical encoding,
efficient attention, and the latest LLM-driven re-ranking and retrieval
techniques. Beyond the models, we review domain-specific applications,
specialized evaluation resources, and outline critical open challenges such as
efficiency trade-offs, multimodal alignment, and faithfulness. This survey aims
to provide both a consolidated reference and a forward-looking agenda for
advancing long-document retrieval in the era of foundation models.
Authors' comments: 33 pages, 6 figures
Yi Xie, Ziyuan Yang, Yongqiang Huang, Yinyu Chen, Lei Zhang, Liang Liu, Yi Zhang
Android malware detection continues to face persistent challenges stemming
from long-term concept drift and class imbalance, as evolving malicious
behaviors and shifting usage patterns dynamically reshape feature
distributions. Although continual learning (CL) mitigates drift, existing
replay-based methods suffer from inherent bias. Specifically, their reliance on
classifier uncertainty for sample selection disproportionately prioritizes the
dominant benign class, causing overfitting and reduced generalization to
evolving malware. To address these limitations, we propose a novel
uncertainty-guided CL framework. First, we introduce a hierarchical balanced
sampler that employs a dual-phase uncertainty strategy to dynamically balance
benign and malicious samples while simultaneously selecting high-information,
high-uncertainty instances within each class. This mechanism ensures class
equilibrium across both replay and incremental data, thereby enhancing
adaptability to emerging threats. Second, we augment the framework with a
vector retrieval mechanism that exploits historical malware embeddings to
identify evolved variants via similarity-based retrieval, thereby complementing
classifier updates. Extensive experiments demonstrate that our framework
significantly outperforms state-of-the-art methods under strict low-label
conditions (50 labels per phase). It achieves a true positive rate (TPR) of
92.95\% and a mean accuracy (mACC) of 94.26\%, which validates its efficacy for
sustainable Android malware detection.
Authors' comments: 10 pages,6 figures
Zihan Chen, Lei Shi, Weize Wu, Qiji Zhou, Yue Zhang
Many contemporary data-driven research efforts in the natural sciences, such as chemistry and materials science, require large-scale, high-performance entity recognition from scientific datasets. Large language models (LLMs) have increasingly been adopted to solve the entity recognition task, with the same trend being observed on all-spectrum NLP tasks. The prevailing entity recognition LLMs rely on fine-tuned technology, yet the fine-tuning process often incurs significant cost. To achieve a best performance-cost trade-off, we propose ALLabel, a three-stage framework designed to select the most informative and representative samples in preparing the demonstrations for LLM modeling. The annotated examples are used to construct a ground-truth retrieval corpus for LLM in-context learning. By sequentially employing three distinct active learning strategies, ALLabel consistently outperforms all baselines under the same annotation budget across three specialized domain datasets. Experimental results also demonstrate that selectively annotating only 5\%-10\% of the dataset with ALLabel can achieve performance comparable to the method annotating the entire dataset. Further analyses and ablation studies verify the effectiveness and generalizability of our proposal.
Jiancheng Pan, Muyuan Ma, Qing Ma, Cong Bai, Shengyong Chen
Remote sensing image-text retrieval plays a crucial role in remote sensing interpretation, yet remains challenging under both closed-domain and open-domain scenarios due to semantic noise and domain shifts. To address these issues, we propose a visual prior-guided vision-language model, PriorCLIP, which leverages visual priors for unbiased representation learning and adaptive vision-language alignment. In the closed-domain setting, PriorCLIP introduces two Progressive Attention Encoder (PAE) structures: Spatial-PAE constructs a belief matrix with instruction embeddings to filter key features and mitigate semantic bias. At the same time, Temporal-PAE exploits cyclic activation across time steps to enhance text representation. For the open-domain setting, we design a two-stage prior representation learning strategy, consisting of large-scale pre-training on coarse-grained image-text pairs, followed by fine-tuning on fine-grained pairs using vision-instruction, which enables robust retrieval across long-tail concepts and vocabulary shifts. Furthermore, a cluster-based symmetric contrastive Attribution Loss is proposed to constrain inter-class relations and alleviate semantic confusion in the shared embedding space. Extensive experiments on RSICD and RSITMD benchmarks demonstrate that PriorCLIP achieves substantial improvements, outperforming existing methods by 4.9% and 4.0% in closed-domain retrieval, and by 7.3% and 9.4% in open-domain retrieval, respectively.
Authors' comments: 14 pages, 7 figures
James McGreivy, Blaise Delaney, Anja Beck, Mike Williams
Generative Large Language Models (LLMs) are a promising approach to
structuring knowledge contained within the corpora of research literature
produced by large-scale and long-running scientific collaborations. Within
experimental particle physics, such structured knowledge bases could expedite
methodological and editorial review. Complementarily, within the broader
scientific community, generative LLM systems grounded in published work could
make for reliable companions allowing non-experts to analyze open-access data.
Techniques such as Retrieval Augmented Generation (RAG) rely on semantically
matching localized text chunks, but struggle to maintain coherent context when
relevant information spans multiple segments, leading to a fragmented
representation devoid of global cross-document information. Here, we utilize
the hierarchical organization of experimental physics articles to build a tree
representation of the corpus, and present the SciTreeRAG system that uses this
structure to create contexts that are more focused and contextually rich than
standard RAG. Additionally, we develop methods for using LLMs to transform the
unstructured corpus into a structured knowledge graph representation. We then
implement SciGraphRAG, a retrieval system that leverages this knowledge graph
to access global cross-document relationships eluding standard RAG, thereby
encapsulating domain-specific connections and expertise. We demonstrate
proof-of-concept implementations using the corpus of the LHCb experiment at
CERN.
Authors' comments: 17 pages, 3 figures
Hao Lin, Peitong Xie, Jingxue Chen, Jie Lin, Qingkun Tang, Qianchun Lu
Retrieval-Augmented Generation (RAG) systems rely heavily on the retrieval stage, particularly the coarse-ranking process. Existing coarse-ranking optimization approaches often struggle to balance domain-specific knowledge learning with query enhencement, resulting in suboptimal retrieval performance. To address this challenge, we propose MoLER, a domain-aware RAG method that uses MoL-Enhanced Reinforcement Learning to optimize retrieval. MoLER has a two-stage pipeline: a continual pre-training (CPT) phase using a Mixture of Losses (MoL) to balance domain-specific knowledge with general language capabilities, and a reinforcement learning (RL) phase leveraging Group Relative Policy Optimization (GRPO) to optimize query and passage generation for maximizing document recall. A key innovation is our Multi-query Single-passage Late Fusion (MSLF) strategy, which reduces computational overhead during RL training while maintaining scalable inference via Multi-query Multi-passage Late Fusion (MMLF). Extensive experiments on benchmark datasets show that MoLER achieves state-of-the-art performance, significantly outperforming baseline methods. MoLER bridges the knowledge gap in RAG systems, enabling robust and scalable retrieval in specialized domains.
Emil Demić, Luka Čehovin Zajc
The goal of Scene-level Sketch-Based Image Retrieval is to retrieve natural
images matching the overall semantics and spatial layout of a free-hand sketch.
Unlike prior work focused on architectural augmentations of retrieval models,
we emphasize the inherent ambiguity and noise present in real-world sketches.
This insight motivates a training objective that is explicitly designed to be
robust to sketch variability. We show that with an appropriate combination of
pre-training, encoder architecture, and loss formulation, it is possible to
achieve state-of-the-art performance without the introduction of additional
complexity. Extensive experiments on a challenging FS-COCO and widely-used
SketchyCOCO datasets confirm the effectiveness of our approach and underline
the critical role of training design in cross-modal retrieval tasks, as well as
the need to improve the evaluation scenarios of scene-level SBIR.
Authors' comments: Accepted to BMVC2025
Enrico Palumbo, Gustavo Penha, Alva Liu, Marcus Eltscheminov, Jefferson Carvalho dos Santos, Alice Wang, Hugues Bouchard, Humberto Jesús Corona Pampin et al.
Spotify has recently introduced audiobooks as part of its catalog,
complementing its music and podcast offering. Search is often the first entry
point for users to access new items, and an important goal for Spotify is to
support users in the exploration of the audiobook catalog. More specifically,
we would like to enable users without a specific item in mind to broadly search
by topic, genre, story tropes, decade, and discover audiobooks, authors and
publishers they may like. To do this, we need to 1) inspire users to type more
exploratory queries for audiobooks and 2) augment our retrieval systems to
better deal with exploratory audiobook queries. This is challenging in a
cold-start scenario, where we have a retrievabiliy bias due to the little
amount of user interactions with audiobooks compared to previously available
items such as music and podcast content. To address this, we propose
AudioBoost, a system to boost audiobook retrievability in Spotify's Search via
synthetic query generation. AudioBoost leverages Large Language Models (LLMs)
to generate synthetic queries conditioned on audiobook metadata. The synthetic
queries are indexed both in the Query AutoComplete (QAC) and in the Search
Retrieval engine to improve query formulation and retrieval at the same time.
We show through offline evaluation that synthetic queries increase
retrievability and are of high quality. Moreover, results from an online A/B
test show that AudioBoost leads to a +0.7% in audiobook impressions, +1.22% in
audiobook clicks, and +1.82% in audiobook exploratory query completions.
Authors' comments: EARL Workshop @ RecSys25
Cheng Qian, Hainan Zhang, Yongxin Tong, Hong-Wei Zheng, Zhiming Zheng
Centralized RAG pipelines struggle with heterogeneous and privacy-sensitive
data, especially in distributed healthcare settings where patient data spans
SQL, knowledge graphs, and clinical notes. Clinicians face difficulties
retrieving rare disease cases due to privacy constraints and the limitations of
traditional cloud-based RAG systems in handling diverse formats and edge
devices. To address this, we introduce HyFedRAG, a unified and efficient
Federated RAG framework tailored for Hybrid data modalities. By leveraging an
edge-cloud collaborative mechanism, HyFedRAG enables RAG to operate across
diverse data sources while preserving data privacy. Our key contributions are:
(1) We design an edge-cloud collaborative RAG framework built on Flower, which
supports querying structured SQL data, semi-structured knowledge graphs, and
unstructured documents. The edge-side LLMs convert diverse data into
standardized privacy-preserving representations, and the server-side LLMs
integrates them for global reasoning and generation. (2) We integrate
lightweight local retrievers with privacy-aware LLMs and provide three
anonymization tools that enable each client to produce semantically rich,
de-identified summaries for global inference across devices. (3) To optimize
response latency and reduce redundant computation, we design a three-tier
caching strategy consisting of local cache, intermediate representation cache,
and cloud inference cache. Experimental results on PMC-Patients demonstrate
that HyFedRAG outperforms existing baselines in terms of retrieval quality,
generation consistency, and system efficiency. Our framework offers a scalable
and privacy-compliant solution for RAG over structural-heterogeneous data,
unlocking the potential of LLMs in sensitive and diverse data environments.
Authors' comments: 9 pages, 7 figures
Jeffrey Huang, Yichao Zhang, Sang hyun Bae, Ballal Ahammed, Elif Ertekin, Pinshane Y. Huang
Defects and reconstructions in 2D moiré materials cause out-of-plane deformations which strongly modify their electronic properties but are difficult to experimentally access. Here, we solve the 3D atomic coordinates of twisted bilayer WSe$_2$ with picometer-scale accuracy using multislice electron ptychography (MEP) acquired from a single orientation. The resulting atomic models individually visualize each of the six atomic planes, revealing the curvature of each WSe$_2$ layer, variations in the interlayer spacing, and the 3D locations of individual vacancies -- which lie exclusively in the outer Se planes. We also observe a new, unexpected type of structural disorder consisting of mixed bending -- and breathing-type moiré-induced corrugations that should strongly impact the emergent electronic properties. Broadly, our methods generate 3D atom-by-atom models of a 2D heterointerface from data acquired in about 30 seconds, methods that should unlock routine access to 3D atomic information in 2D systems and catalyze design methods to control out-of-plane deformations.
Authors' comments: 40 pages, 16 figures
Duolin Sun, Dan Yang, Yue Shen, Yihan Jiao, Zhehao Tan, Jie Feng, Lianzhen Zhong, Jian Wang et al.
The Retrieval-Augmented Generation (RAG) approach enhances question-answering systems and dialogue generation tasks by integrating information retrieval (IR) technologies with large language models (LLMs). This strategy, which retrieves information from external knowledge bases to bolster the response capabilities of generative models, has achieved certain successes. However, current RAG methods still face numerous challenges when dealing with multi-hop queries. For instance, some approaches overly rely on iterative retrieval, wasting too many retrieval steps on compound queries. Additionally, using the original complex query for retrieval may fail to capture content relevant to specific sub-queries, resulting in noisy retrieved content. If the noise is not managed, it can lead to the problem of noise accumulation. To address these issues, we introduce HANRAG, a novel heuristic-based framework designed to efficiently tackle problems of varying complexity. Driven by a powerful revelator, HANRAG routes queries, decomposes them into sub-queries, and filters noise from retrieved documents. This enhances the system's adaptability and noise resistance, making it highly capable of handling diverse queries. We compare the proposed framework against other leading industry methods across various benchmarks. The results demonstrate that our framework obtains superior performance in both single-hop and multi-hop question-answering tasks.
Nobin Sarwar
Visual Question Answering requires models to generate accurate answers by integrating visual and textual understanding. However, VQA models still struggle with hallucinations, producing convincing but incorrect answers, particularly in knowledge-driven and Out-of-Distribution scenarios. We introduce FilterRAG, a retrieval-augmented framework that combines BLIP-VQA with Retrieval-Augmented Generation to ground answers in external knowledge sources like Wikipedia and DBpedia. FilterRAG achieves 36.5% accuracy on the OK-VQA dataset, demonstrating its effectiveness in reducing hallucinations and improving robustness in both in-domain and Out-of-Distribution settings. These findings highlight the potential of FilterRAG to improve Visual Question Answering systems for real-world deployment.
Authors' comments: 12 pages, 6 figures and 2 tables; Accepted at ICCV 2025 Workshop on Building Foundation Models You Can Trust (T2FM)
Jinrui Yang, Fan Jiang, Timothy Baldwin
Language fairness in multilingual information retrieval (MLIR) systems is
crucial for ensuring equitable access to information across diverse languages.
This paper sheds light on the issue, based on the assumption that queries in
different languages, but with identical semantics, should yield equivalent
ranking lists when retrieving on the same multilingual documents. We evaluate
the degree of fairness using both traditional retrieval methods, and a DPR
neural ranker based on mBERT and XLM-R. Additionally, we introduce `LaKDA', a
novel loss designed to mitigate language biases in neural MLIR approaches. Our
analysis exposes intrinsic language biases in current MLIR technologies, with
notable disparities across the retrieval methods, and the effectiveness of
LaKDA in enhancing language fairness.
Authors' comments: Accepted at EMNLP MRL 2024
Xinyu Gao, Xiangtao Meng, Yingkai Dong, Zheng Li, Shanqing Guo
While Retrieval-Augmented Generation (RAG) effectively reduces hallucinations by integrating external knowledge bases, it introduces vulnerabilities to membership inference attacks (MIAs), particularly in systems handling sensitive data. Existing MIAs targeting RAG's external databases often rely on model responses but ignore the interference of non-member-retrieved documents on RAG outputs, limiting their effectiveness. To address this, we propose DCMI, a differential calibration MIA that mitigates the negative impact of non-member-retrieved documents. Specifically, DCMI leverages the sensitivity gap between member and non-member retrieved documents under query perturbation. It generates perturbed queries for calibration to isolate the contribution of member-retrieved documents while minimizing the interference from non-member-retrieved documents. Experiments under progressively relaxed assumptions show that DCMI consistently outperforms baselines--for example, achieving 97.42% AUC and 94.35% Accuracy against the RAG system with Flan-T5, exceeding the MBA baseline by over 40%. Furthermore, on real-world RAG platforms such as Dify and MaxKB, DCMI maintains a 10%-20% advantage over the baseline. These results highlight significant privacy risks in RAG systems and emphasize the need for stronger protection mechanisms. We appeal to the community's consideration of deeper investigations, like ours, against the data leakage risks in rapidly evolving RAG systems. Our code is available at https://github.com/Xinyu140203/RAG_MIA.
ZiXuan Zhang, Bowen Hao, Yingjie Li, Hongzhi Yin
Traditional Chinese Medicine (TCM) formulas play a significant role in treating epidemics and complex diseases. Existing models for TCM utilize traditional algorithms or deep learning techniques to analyze formula relationships, yet lack comprehensive results, such as complete formula compositions and detailed explanations. Although recent efforts have used TCM instruction datasets to fine-tune Large Language Models (LLMs) for explainable formula generation, existing datasets lack sufficient details, such as the roles of the formula's sovereign, minister, assistant, courier; efficacy; contraindications; tongue and pulse diagnosis-limiting the depth of model outputs. To address these challenges, we propose ZhiFangDanTai, a framework combining Graph-based Retrieval-Augmented Generation (GraphRAG) with LLM fine-tuning. ZhiFangDanTai uses GraphRAG to retrieve and synthesize structured TCM knowledge into concise summaries, while also constructing an enhanced instruction dataset to improve LLMs' ability to integrate retrieved information. Furthermore, we provide novel theoretical proofs demonstrating that integrating GraphRAG with fine-tuning techniques can reduce generalization error and hallucination rates in the TCM formula task. Experimental results on both collected and clinical datasets demonstrate that ZhiFangDanTai achieves significant improvements over state-of-the-art models. Our model is open-sourced at https://huggingface.co/tczzx6/ZhiFangDanTai1.0.