Zabir Al Nazi, Vagelis Hristidis, Aaron Lawson McLean, Jannat Ara Meem, Md Taukir Azam Chowdhury
Effective Question Answering (QA) on large biomedical document collections requires effective document retrieval techniques. The latter remains a challenging task due to the domain-specific vocabulary and semantic ambiguity in user queries. We propose BMQExpander, a novel ontology-aware query expansion pipeline that combines medical knowledge - definitions and relationships - from the UMLS Metathesaurus with the generative capabilities of large language models (LLMs) to enhance retrieval effectiveness. We implemented several state-of-the-art baselines, including sparse and dense retrievers, query expansion methods, and biomedical-specific solutions. We show that BMQExpander has superior retrieval performance on three popular biomedical Information Retrieval (IR) benchmarks: NFCorpus, TREC-COVID, and SciFact - with improvements of up to 22.1% in NDCG@10 over sparse baselines and up to 6.5% over the strongest baseline. Further, BMQExpander generalizes robustly under query perturbation settings, in contrast to supervised baselines, achieving up to 15.7% improvement over the strongest baseline. As a side contribution, we publish our paraphrased benchmarks. Finally, our qualitative analysis shows that BMQExpander has fewer hallucinations compared to other LLM-based query expansion baselines.
Changjian Wang, Weihong Deng, Weili Guan, Quan Lu, Ning Jiang
Multi-hop question answering (MHQA) requires integrating knowledge scattered across multiple passages to derive the correct answer. Traditional retrieval-augmented generation (RAG) methods primarily focus on coarse-grained textual semantic similarity and ignore structural associations among dispersed knowledge, which limits their effectiveness in MHQA tasks. GraphRAG methods address this by leveraging knowledge graphs (KGs) to capture structural associations, but they tend to overly rely on structural information and fine-grained word- or phrase-level retrieval, resulting in an underutilization of textual semantics. In this paper, we propose a novel RAG approach called HGRAG for MHQA that achieves cross-granularity integration of structural and semantic information via hypergraphs. Structurally, we construct an entity hypergraph where fine-grained entities serve as nodes and coarse-grained passages as hyperedges, and establish knowledge association through shared entities. Semantically, we design a hypergraph retrieval method that integrates fine-grained entity similarity and coarse-grained passage similarity via hypergraph diffusion. Finally, we employ a retrieval enhancement module, which further refines the retrieved results both semantically and structurally, to obtain the most relevant passages as context for answer generation with the LLM. Experimental results on benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in QA performance, and achieves a 6$\times$ speedup in retrieval efficiency.
Xingyu Deng, Xi Wang, Mark Stevenson
Identification of appropriate supporting evidence is critical to the success
of scientific fact checking. However, existing approaches rely on off-the-shelf
Information Retrieval algorithms that rank documents based on relevance rather
than the evidence they provide to support or refute the claim being checked.
This paper proposes +VeriRel which includes verification success in the
document ranking. Experimental results on three scientific fact checking
datasets (SciFact, SciFact-Open and Check-Covid) demonstrate consistently
leading performance by +VeriRel for document evidence retrieval and a positive
impact on downstream verification. This study highlights the potential of
integrating verification feedback to document relevance assessment for
effective scientific fact checking systems. It shows promising future work to
evaluate fine-grained relevance when examining complex documents for advanced
scientific fact checking.
Authors' comments: Accpeted for the 34th ACM International Conference on Information and
Knowledge Management (CIKM'25)
Yaoze Zhang, Rong Wu, Pinlong Cai, Xiaoman Wang, Guohang Yan, Song Mao, Ding Wang, Botian Shi
Retrieval-Augmented Generation (RAG) plays a crucial role in grounding Large Language Models by leveraging external knowledge, whereas the effectiveness is often compromised by the retrieval of contextually flawed or incomplete information. To address this, knowledge graph-based RAG methods have evolved towards hierarchical structures, organizing knowledge into multi-level summaries. However, these approaches still suffer from two critical, unaddressed challenges: high-level conceptual summaries exist as disconnected ``semantic islands'', lacking the explicit relations needed for cross-community reasoning; and the retrieval process itself remains structurally unaware, often degenerating into an inefficient flat search that fails to exploit the graph's rich topology. To overcome these limitations, we introduce LeanRAG, a framework that features a deeply collaborative design combining knowledge aggregation and retrieval strategies. LeanRAG first employs a novel semantic aggregation algorithm that forms entity clusters and constructs new explicit relations among aggregation-level summaries, creating a fully navigable semantic network. Then, a bottom-up, structure-guided retrieval strategy anchors queries to the most relevant fine-grained entities and then systematically traverses the graph's semantic pathways to gather concise yet contextually comprehensive evidence sets. The LeanRAG can mitigate the substantial overhead associated with path retrieval on graphs and minimizes redundant information retrieval. Extensive experiments on four challenging QA benchmarks with different domains demonstrate that LeanRAG significantly outperforming existing methods in response quality while reducing 46\% retrieval redundancy. Code is available at: https://github.com/RaZzzyz/LeanRAG
Lorenz Brehme, Benedikt Dornauer, Thomas Ströhle, Maximilian Ehrhart, Ruth Breu
Retrieval-Augmented Generation (RAG) is a well-established and rapidly evolving field within AI that enhances the outputs of large language models by integrating relevant information retrieved from external knowledge sources. While industry adoption of RAG is now beginning, there is a significant lack of research on its practical application in industrial contexts. To address this gap, we conducted a semistructured interview study with 13 industry practitioners to explore the current state of RAG adoption in real-world settings. Our study investigates how companies apply RAG in practice, providing (1) an overview of industry use cases, (2) a consolidated list of system requirements, (3) key challenges and lessons learned from practical experiences, and (4) an analysis of current industry evaluation methods. Our main findings show that current RAG applications are mostly limited to domain-specific QA tasks, with systems still in prototype stages; industry requirements focus primarily on data protection, security, and quality, while issues such as ethics, bias, and scalability receive less attention; data preprocessing remains a key challenge, and system evaluation is predominantly conducted by humans rather than automated methods.
Authors' comments: This preprint was accepted for presentation at the 17th International Conference on Knowledge Discovery and Information Retrieval (KDIR25)
Hsuan-Kung Yang, Tsu-Ching Hsiao, Ryoichiro Oka, Ryuya Nishino, Satoko Tofukuji, Norimasa Kobori
Delivering intelligent and adaptive navigation assistance in augmented reality (AR) requires more than visual cues, as it demands systems capable of interpreting flexible user intent and reasoning over both spatial and semantic context. Prior AR navigation systems often rely on rigid input schemes or predefined commands, which limit the utility of rich building data and hinder natural interaction. In this work, we propose an embodied AR navigation system that integrates Building Information Modeling (BIM) with a multi-agent retrieval-augmented generation (RAG) framework to support flexible, language-driven goal retrieval and route planning. The system orchestrates three language agents, Triage, Search, and Response, built on large language models (LLMs), which enables robust interpretation of open-ended queries and spatial reasoning using BIM data. Navigation guidance is delivered through an embodied AR agent, equipped with voice interaction and locomotion, to enhance user experience. A real-world user study yields a System Usability Scale (SUS) score of 80.5, indicating excellent usability, and comparative evaluations show that the embodied interface can significantly improves users' perception of system intelligence. These results underscore the importance and potential of language-grounded reasoning and embodiment in the design of user-centered AR navigation systems.
Authors' comments: 11 pages, 9 figures, accepted to IEEE ISMAR 2025
Roy T. Forestano, Konstantin T. Matchev, Katia Matcheva, Eyup B. Unlu
Standard Bayesian retrievals for exoplanet atmospheric parameters from
transmission spectroscopy, while well understood and widely used, are generally
computationally expensive. In the era of the JWST and other upcoming
observatories, machine learning approaches have emerged as viable alternatives
that are both efficient and robust. In this paper we present a systematic study
of several existing machine learning regression techniques and compare their
performance for retrieving exoplanet atmospheric parameters from transmission
spectra. We benchmark the performance of the different algorithms on the
accuracy, precision, and speed. The regression methods tested here include
partial least squares (PLS), support vector machines (SVM), k nearest neighbors
(KNN), decision trees (DT), random forests (RF), voting (VOTE), stacking
(STACK), and extreme gradient boosting (XGB). We also investigate the impact of
different preprocessing methods of the training data on the model performance.
We quantify the model uncertainties across the entire dynamical range of
planetary parameters. The best performing combination of ML model and
preprocessing scheme is validated on a the case study of JWST observation of
WASP-39b.
Authors' comments: 51 pages, 26 figures, Submitted to AAS Journals
Chanyeol Choi, Jihoon Kwon, Alejandro Lopez-Lira, Chaewoon Kim, Minjae Kim, Juneha Hwang, Jaeseon Ha, Hojun Choi et al.
Accurate information retrieval (IR) is critical in the financial domain, where investors must identify relevant information from large collections of documents. Traditional IR methods-whether sparse or dense-often fall short in retrieval accuracy, as it requires not only capturing semantic similarity but also performing fine-grained reasoning over document structure and domain-specific knowledge. Recent advances in large language models (LLMs) have opened up new opportunities for retrieval with multi-step reasoning, where the model ranks passages through iterative reasoning about which information is most relevant to a given query. However, there exists no benchmark to evaluate such capabilities in the financial domain. To address this gap, we introduce FinAgentBench, the first large-scale benchmark for evaluating retrieval with multi-step reasoning in finance -- a setting we term agentic retrieval. The benchmark consists of 3,429 expert-annotated examples on S&P-100 listed firms and assesses whether LLM agents can (1) identify the most relevant document type among candidates, and (2) pinpoint the key passage within the selected document. Our evaluation framework explicitly separates these two reasoning steps to address context limitations. This design enables to provide a quantitative basis for understanding retrieval-centric LLM behavior in finance. We evaluate a suite of state-of-the-art models and further demonstrated how targeted fine-tuning can significantly improve agentic retrieval performance. Our benchmark provides a foundation for studying retrieval-centric LLM behavior in complex, domain-specific tasks for finance. We will release the dataset publicly upon acceptance of the paper and plan to expand and share dataset for the full S&P 500 and beyond.
Authors' comments: 6 pages
Wenxuan Shen, Mingjia Wang, Yaochen Wang, Dongping Chen, Junjie Yang, Yao Wan, Weiwei Lin
Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language
Models (MLLMs) show great promise for complex document understanding, yet their
development is critically hampered by inadequate evaluation. Current benchmarks
often focus on specific part of document RAG system and use synthetic data with
incomplete ground truth and evidence labels, therefore failing to reflect
real-world bottlenecks and challenges. To overcome these limitations, we
introduce Double-Bench: a new large-scale, multilingual, and multimodal
evaluation system that is able to produce fine-grained assessment to each
component within document RAG systems. It comprises 3,276 documents (72,880
pages) and 5,168 single- and multi-hop queries across 6 languages and 4
document types with streamlined dynamic update support for potential data
contamination issues. Queries are grounded in exhaustively scanned evidence
pages and verified by human experts to ensure maximum quality and completeness.
Our comprehensive experiments across 9 state-of-the-art embedding models, 4
MLLMs and 4 end-to-end document RAG frameworks demonstrate the gap between text
and visual embedding models is narrowing, highlighting the need in building
stronger document retrieval models. Our findings also reveal the
over-confidence dilemma within current document RAG frameworks that tend to
provide answer even without evidence support. We hope our fully open-source
Double-Bench provide a rigorous foundation for future research in advanced
document RAG systems. We plan to retrieve timely corpus and release new
benchmarks on an annual basis.
Authors' comments: In submission. Project website: https://double-bench.github.io/
Wenlong Wu, Haofen Wang, Bohan Li, Peixuan Huang, Xinzhe Zhao, Lei Liang
Retrieval Augmented Generation (RAG) has emerged as a promising solution to
address hallucination issues in Large Language Models (LLMs). However, the
integration of multiple retrieval sources, while potentially more informative,
introduces new challenges that can paradoxically exacerbate hallucination
problems. These challenges manifest primarily in two aspects: the sparse
distribution of multi-source data that hinders the capture of logical
relationships and the inherent inconsistencies among different sources that
lead to information conflicts. To address these challenges, we propose
MultiRAG, a novel framework designed to mitigate hallucination in multi-source
retrieval-augmented generation through knowledge-guided approaches. Our
framework introduces two key innovations: (1) a knowledge construction module
that employs multi-source line graphs to efficiently aggregate logical
relationships across different knowledge sources, effectively addressing the
sparse data distribution issue; and (2) a sophisticated retrieval module that
implements a multi-level confidence calculation mechanism, performing both
graph-level and node-level assessments to identify and eliminate unreliable
information nodes, thereby reducing hallucinations caused by inter-source
inconsistencies. Extensive experiments on four multi-domain query datasets and
two multi-hop QA datasets demonstrate that MultiRAG significantly enhances the
reliability and efficiency of knowledge retrieval in complex multi-source
scenarios. \textcolor{blue}{Our code is available in
https://github.com/wuwenlong123/MultiRAG.
Authors' comments: Accepted by ICDE 2025 Research Paper
Shreyank N Gowda, Xiaobo Jin, Christian Wagner
In cross-modal retrieval tasks, such as image-to-report and report-to-image retrieval, accurately aligning medical images with relevant text reports is essential but challenging due to the inherent ambiguity and variability in medical data. Existing models often struggle to capture the nuanced, multi-level semantic relationships in radiology data, leading to unreliable retrieval results. To address these issues, we propose the Prototype-Enhanced Confidence Modeling (PECM) framework, which introduces multi-level prototypes for each modality to better capture semantic variability and enhance retrieval robustness. PECM employs a dual-stream confidence estimation that leverages prototype similarity distributions and an adaptive weighting mechanism to control the impact of high-uncertainty data on retrieval rankings. Applied to radiology image-report datasets, our method achieves significant improvements in retrieval precision and consistency, effectively handling data ambiguity and advancing reliability in complex clinical scenarios. We report results on multiple different datasets and tasks including fully supervised and zero-shot retrieval obtaining performance gains of up to 10.17%, establishing in new state-of-the-art.
Kaiwen Zhao, Bharathan Balaji, Stephen Lee
Product sustainability reports provide valuable insights into the environmental impacts of a product and are often distributed in PDF format. These reports often include a combination of tables and text, which complicates their analysis. The lack of standardization and the variability in reporting formats further exacerbate the difficulty of extracting and interpreting relevant information from large volumes of documents. In this paper, we tackle the challenge of answering questions related to carbon footprints within sustainability reports available in PDF format. Unlike previous approaches, our focus is on addressing the difficulties posed by the unstructured and inconsistent nature of text extracted from PDF parsing. To facilitate this analysis, we introduce CarbonPDF-QA, an open-source dataset containing question-answer pairs for 1735 product report documents, along with human-annotated answers. Our analysis shows that GPT-4o struggles to answer questions with data inconsistencies. To address this limitation, we propose CarbonPDF, an LLM-based technique specifically designed to answer carbon footprint questions on such datasets. We develop CarbonPDF by fine-tuning Llama 3 with our training data. Our results show that our technique outperforms current state-of-the-art techniques, including question-answering (QA) systems finetuned on table and text data.
Pranshu Rastogi
SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim
Retrieval is approached as a Learning-to-Rank task using a bi-encoder model
fine-tuned from a pre-trained transformer optimized for sentence similarity.
Training used both the source languages and their English translations for
multilingual retrieval and only English translations for cross-lingual
retrieval. Using lightweight models with fewer than 500M parameters and
training on Kaggle T4 GPUs, the method achieved 92% Success@10 in multilingual
and 80% Success@10 in 5th in crosslingual and 10th in multilingual tracks.
Authors' comments: 7 pages, 6 tables. Code available at
https://github.com/pranshurastogi29/SemEval-2025-ACL-Multi-and-Crosslingual-Retrieval-using-Bi-encoders
Junlong Ren, Gangjian Zhang, Honghao Fu, Pengcheng Wu, Hao Wang
Text-Motion Retrieval (TMR) aims to retrieve 3D motion sequences semantically relevant to text descriptions. However, matching 3D motions with text remains highly challenging, primarily due to the intricate structure of human body and its spatial-temporal dynamics. Existing approaches often overlook these complexities, relying on general encoding methods that fail to distinguish different body parts and their dynamics, limiting precise semantic alignment. To address this, we propose WaMo, a novel wavelet-based multi-frequency feature extraction framework. It fully captures part-specific and time-varying motion details across multiple resolutions on body joints, extracting discriminative motion features to achieve fine-grained alignment with texts. WaMo has three key components: (1) Trajectory Wavelet Decomposition decomposes motion signals into frequency components that preserve both local kinematic details and global motion semantics. (2) Trajectory Wavelet Reconstruction uses learnable inverse wavelet transforms to reconstruct original joint trajectories from extracted features, ensuring the preservation of essential spatial-temporal information. (3) Disordered Motion Sequence Prediction reorders shuffled motion sequences to improve the learning of inherent temporal coherence, enhancing motion-text alignment. Extensive experiments demonstrate WaMo's superiority, achieving 17.0\% and 18.2\% improvements in $Rsum$ on HumanML3D and KIT-ML datasets, respectively, outperforming existing state-of-the-art (SOTA) methods.
Kristian Kolthoff, Felix Kretzer, Christian Bartelt, Alexander Maedche, Simone Paolo Ponzetto
GUI prototyping is a fundamental component in the development of modern interactive systems, which are now ubiquitous across diverse application domains. GUI prototypes play a critical role in requirements elicitation by enabling stakeholders to visualize, assess, and refine system concepts collaboratively. Moreover, prototypes serve as effective tools for early testing, iterative evaluation, and validation of design ideas with both end users and development teams. Despite these advantages, the process of constructing GUI prototypes remains resource-intensive and time-consuming, frequently demanding substantial effort and expertise. Recent research has sought to alleviate this burden through NL-based GUI retrieval approaches, which typically rely on embedding-based retrieval or tailored ranking models for specific GUI repositories. However, these methods often suffer from limited retrieval performance and struggle to generalize across arbitrary GUI datasets. In this work, we present GUI-ReRank, a novel framework that integrates rapid embedding-based constrained retrieval models with highly effective MLLM-based reranking techniques. GUI-ReRank further introduces a fully customizable GUI repository annotation and embedding pipeline, enabling users to effortlessly make their own GUI repositories searchable, which allows for rapid discovery of relevant GUIs for inspiration or seamless integration into customized LLM-based RAG workflows. We evaluated our approach on an established NL-based GUI retrieval benchmark, demonstrating that GUI-ReRank significantly outperforms SOTA tailored LTR models in both retrieval accuracy and generalizability. Additionally, we conducted a comprehensive cost and efficiency analysis of employing MLLMs for reranking, providing valuable insights regarding the trade-offs between retrieval effectiveness and computational resources. Video: https://youtu.be/_7x9UCh82ug
Haoran Wang, Xiongxiao Xu, Baixiang Huang, Kai Shu
Retrieval-Augmented Generation (RAG) enhances the factual accuracy of large language models (LLMs) by conditioning outputs on external knowledge sources. However, when retrieval involves private or sensitive data, RAG systems are susceptible to extraction attacks that can leak confidential information through generated responses. We propose Privacy-Aware Decoding (PAD), a lightweight, inference-time defense that adaptively injects calibrated Gaussian noise into token logits during generation. PAD integrates confidence-based screening to selectively protect high-risk tokens, efficient sensitivity estimation to minimize unnecessary noise, and context-aware noise calibration to balance privacy with generation quality. A \renyi Differential Privacy (RDP) accountant rigorously tracks cumulative privacy loss, enabling explicit per-response $(\varepsilon, \delta)$-DP guarantees for sensitive outputs. Unlike prior approaches requiring retraining or corpus-level filtering, PAD is model-agnostic and operates entirely at decoding time with minimal computational overhead. Experiments on three real-world datasets demonstrate that PAD substantially reduces private information leakage while preserving response utility, outperforming existing retrieval- and post-processing-based defenses. Our work takes an important step toward mitigating privacy risks in RAG via decoding strategies, paving the way for universal and scalable privacy solutions in sensitive domains. Our code is available: https://github.com/wang2226/PAD.
Zhengxin Pan, Haishuai Wang, Fangyu Wu, Peng Zhang, Jiajun Bu
The past decade has witnessed rapid advancements in cross-modal retrieval,
with significant progress made in accurately measuring the similarity between
cross-modal pairs. However, the persistent hubness problem, a phenomenon where
a small number of targets frequently appear as nearest neighbors to numerous
queries, continues to hinder the precision of similarity measurements. Despite
several proposed methods to reduce hubness, their underlying mechanisms remain
poorly understood. To bridge this gap, we analyze the widely-adopted Inverted
Softmax approach and demonstrate its effectiveness in balancing target
probabilities during retrieval. Building on these insights, we propose a
probability-balancing framework for more effective hubness reduction. We
contend that balancing target probabilities alone is inadequate and, therefore,
extend the framework to balance both query and target probabilities by
introducing Sinkhorn Normalization (SN). Notably, we extend SN to scenarios
where the true query distribution is unknown, showing that current methods,
which rely solely on a query bank to estimate target hubness, produce
suboptimal results due to a significant distributional gap between the query
bank and targets. To mitigate this issue, we introduce Dual Bank Sinkhorn
Normalization (DBSN), incorporating a corresponding target bank alongside the
query bank to narrow this distributional gap. Our comprehensive evaluation
across various cross-modal retrieval tasks, including image-text retrieval,
video-text retrieval, and audio-text retrieval, demonstrates consistent
performance improvements, validating the effectiveness of both SN and DBSN. All
codes are publicly available at https://github.com/ppanzx/DBSN.
Authors' comments: ACMMM 2025
Ostonya Thomas, Muhaimin Bin Munir, Jean-Michel Tine, Mizanur Rahman, Yuchen Cai, Khandakar Ashrafi Akbar, Md Nahiyan Uddin, Latifur Khan et al.
Technological advancements have revolutionized numerous industries, including
transportation. While digitalization, automation, and connectivity have
enhanced safety and efficiency, they have also introduced new vulnerabilities.
With 95% of data breaches attributed to human error, promoting cybersecurity
awareness in transportation is increasingly critical. Despite numerous
cyberattacks on transportation systems worldwide, comprehensive and centralized
records of these incidents remain scarce. To address this gap and enhance cyber
awareness, this paper presents a large language model (LLM) based approach to
extract and organize transportation related cyber incidents from publicly
available datasets. A key contribution of this work is the use of generative AI
to transform unstructured, heterogeneous cyber incident data into structured
formats. Incidents were sourced from the Center for Strategic & International
Studies (CSIS) List of Significant Cyber Incidents, the University of Maryland
Cyber Events Database (UMCED), the European Repository of Cyber Incidents
(EuRepoC), the Maritime Cyber Attack Database (MCAD), and the U.S. DOT
Transportation Cybersecurity and Resiliency (TraCR) Examples of Cyber Attacks
in Transportation (2018 to 2022). These were classified by a fine tuned LLM
into five transportation modes: aviation, maritime, rail, road, and multimodal,
forming a transportation specific cyber incident database. Another key
contribution of this work is the development of a Retrieval Augmented
Generation question answering system, designed to enhance accessibility and
practical use by enabling users to query the curated database for specific
details on transportation related cyber incidents. By leveraging LLMs for both
data extraction and user interaction, this study contributes a novel,
accessible tool for improving cybersecurity awareness in the transportation
sector.
Authors' comments: This paper has been submitted to the Transportation Research Board
(TRB) for consideration for presentation at the 2026 Annual Meeting
Shengbo Gong, Xianfeng Tang, Carl Yang, Wei jin
Retrieval-augmented generation (RAG) is critical for reducing hallucinations
and incorporating external knowledge into Large Language Models (LLMs).
However, advanced RAG systems face a trade-off between performance and
efficiency. Multi-round RAG approaches achieve strong reasoning but incur
excessive LLM calls and token costs, while Graph RAG methods suffer from
computationally expensive, error-prone graph construction and retrieval
redundancy. To address these challenges, we propose T$^2$RAG, a novel framework
that operates on a simple, graph-free knowledge base of atomic triplets.
T$^2$RAG leverages an LLM to decompose questions into searchable triplets with
placeholders, which it then iteratively resolves by retrieving evidence from
the triplet database. Empirical results show that T$^2$RAG significantly
outperforms state-of-the-art multi-round and Graph RAG methods, achieving an
average performance gain of up to 11\% across six datasets while reducing
retrieval costs by up to 45\%. Our code is available at
https://github.com/rockcor/T2RAG
Authors' comments: 19 pages
Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li, Grace Li Zhang
Recent advances in large language models (LLMs) have significantly boosted long-context processing. However, the increasing key-value (KV) cache size poses critical challenges to memory and execution efficiency. Most KV cache compression methods rely on heuristic token eviction using all attention heads in Grouped Query Attention (GQA)-based LLMs. This method ignores the different functionalities of attention heads, leading to the eviction of critical tokens and thus degrades the performance of LLMs. To address the issue above, instead of using all the attention heads in GQA-based LLMs to determine important tokens as in the previous work, we first identify the attention heads in each layer that are not only capable of retrieving the initial and final tokens of a prompt, but also capable of retrieving important tokens within the text and attending to their surrounding semantic context. Afterwards, we exploit such heads to determine the important tokens and retain their corresponding KV cache pairs. Furthermore, we analyze the cache eviction error of each layer individually and introduce a layer-adaptive KV cache allocation strategy. Experimental results demonstrate the proposed CompressKV consistently outperforms state-of-the-art approaches under various memory budgets on LongBench and Needle-in-a-Haystack benchmarks. Our code is publicly available at: https://github.com/TUDa-HWAI/CompressKV.git.