Taher Yacoub, Camille Depenveiller, Atsushi Tatsuma, Tin Barisin, Eugen Rusakov, Udo Gobel, Yuxu Peng, Shiqiang Deng et al.
This SHREC 2025 track dedicated to protein surface shape retrieval involved 9
participating teams. We evaluated the performance in retrieval of 15 proposed
methods on a large dataset of 11,555 protein surfaces with calculated
electrostatic potential (a key molecular surface descriptor). The performance
in retrieval of the proposed methods was evaluated through different metrics
(Accuracy, Balanced accuracy, F1 score, Precision and Recall). The best
retrieval performance was achieved by the proposed methods that used the
electrostatic potential complementary to molecular surface shape. This
observation was also valid for classes with limited data which highlights the
importance of taking into account additional molecular surface descriptors.
Authors' comments: Published in Computers & Graphics, Elsevier. 59 pages, 12 figures
Hanqing Li, Kiran Sheena Jyothi, Henry Liang, Sharika Mahadevan, Diego Klabjan
We propose a new, training-free method, Graph Reasoning via Retrieval Augmented Framework (GRRAF), that harnesses retrieval-augmented generation (RAG) alongside the code-generation capabilities of large language models (LLMs) to address a wide range of graph reasoning tasks. In GRRAF, the target graph is stored in a graph database, and the LLM is prompted to generate executable code queries that retrieve the necessary information. This approach circumvents the limitations of existing methods that require extensive finetuning or depend on predefined algorithms, and it incorporates an error feedback loop with a time-out mechanism to ensure both correctness and efficiency. Experimental evaluations on the GraphInstruct dataset reveal that GRRAF achieves 100% accuracy on most graph reasoning tasks, including cycle detection, bipartite graph checks, shortest path computation, and maximum flow, while maintaining consistent token costs regardless of graph sizes. Imperfect but still very high performance is observed on subgraph matching. Notably, GRRAF scales effectively to large graphs with up to 10,000 nodes.
Zequn Xie, Chuxin Wang, Sihang Cai, Yeqiang Wang, Shulei Wang, Tao Jin
Text-based person search (TBPS) enables the retrieval of person images from
large-scale databases using natural language descriptions, offering critical
value in surveillance applications. However, a major challenge lies in the
labor-intensive process of obtaining high-quality textual annotations, which
limits scalability and practical deployment. To address this, we introduce two
complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text
Interaction (MTI). MTG generates rich pseudo-labels through simulated dialogues
with MLLMs, producing fine-grained and diverse visual descriptions without
manual supervision. MTI refines user queries at inference time through dynamic,
dialogue-based reasoning, enabling the system to interpret and resolve vague,
incomplete, or ambiguous descriptions - characteristics often seen in
real-world search scenarios. Together, MTG and MTI form a unified and
annotation-free framework that significantly improves retrieval accuracy,
robustness, and usability. Extensive evaluations demonstrate that our method
achieves competitive or superior results while eliminating the need for manual
captions, paving the way for scalable and practical deployment of TBPS systems.
Authors' comments: Accepted by EMNLP 2025. 13 pages, 3 figures
Wonbin Kweon, SeongKu Kang, Runchu Tian, Pengcheng Jiang, Jiawei Han, Hwanjo Yu
The effectiveness of in-context learning relies heavily on selecting
demonstrations that provide all the necessary information for a given test
input. To achieve this, it is crucial to identify and cover fine-grained
knowledge requirements. However, prior methods often retrieve demonstrations
based solely on embedding similarity or generation probability, resulting in
irrelevant or redundant examples. In this paper, we propose TopicK, a topic
coverage-based retrieval framework that selects demonstrations to
comprehensively cover topic-level knowledge relevant to both the test input and
the model. Specifically, TopicK estimates the topics required by the input and
assesses the model's knowledge on those topics. TopicK then iteratively selects
demonstrations that introduce previously uncovered required topics, in which
the model exhibits low topical knowledge. We validate the effectiveness of
TopicK through extensive experiments across various datasets and both open- and
closed-source LLMs. Our source code is available at
https://github.com/WonbinKweon/TopicK_EMNLP2025.
Authors' comments: EMNLP 2025 Main
Ilya Tyagin, Saeideh Valipour, Aliaksandra Sikirzhytskaya, Michael Shtutman, Ilya Safro
We introduce an explainability method for biomedical hypothesis generation
systems, built on top of the novel Hypothesis Generation Context Retriever
framework. Our approach combines semantic graph-based retrieval and relevant
data-restrictive training to simulate real-world discovery constraints.
Integrated with large language models (LLMs) via retrieval-augmented
generation, the system explains hypotheses with contextual evidence using
published scientific literature. We also propose a novel feedback loop
approach, which iteratively identifies and corrects flawed parts of
LLM-generated explanations, refining both the evidence paths and supporting
context. We demonstrate the performance of our method with multiple large
language models and evaluate the explanation and context retrieval quality
through both expert-curated assessment and large-scale automated analysis. Our
code is available at: https://github.com/IlyaTyagin/HGCR.
Authors' comments: 30 pages, 10 figures,
Wensheng Lu, Keyu Chen, Ruizhi Qiao, Xing Sun
Retrieval-Augmented Generation (RAG) enhances the response capabilities of
language models by integrating external knowledge sources. However, document
chunking as an important part of RAG system often lacks effective evaluation
tools. This paper first analyzes why existing RAG evaluation benchmarks are
inadequate for assessing document chunking quality, specifically due to
evidence sparsity. Based on this conclusion, we propose HiCBench, which
includes manually annotated multi-level document chunking points, synthesized
evidence-dense quetion answer(QA) pairs, and their corresponding evidence
sources. Additionally, we introduce the HiChunk framework, a multi-level
document structuring framework based on fine-tuned LLMs, combined with the
Auto-Merge retrieval algorithm to improve retrieval quality. Experiments
demonstrate that HiCBench effectively evaluates the impact of different
chunking methods across the entire RAG pipeline. Moreover, HiChunk achieves
better chunking quality within reasonable time consumption, thereby enhancing
the overall performance of RAG systems.
Authors' comments: 17 pages, 5 figures, 6 tables
Mustapha Adamu, Qi Zhang, Huitong Pan, Longin Jan Latecki, Eduard C. Dragut
The growing complexity and volume of climate science literature make it
increasingly difficult for researchers to find relevant information across
models, datasets, regions, and variables. This paper introduces a
domain-specific Knowledge Graph (KG) built from climate publications and
broader scientific texts, aimed at improving how climate knowledge is accessed
and used. Unlike keyword based search, our KG supports structured, semantic
queries that help researchers discover precise connections such as which models
have been validated in specific regions or which datasets are commonly used
with certain teleconnection patterns. We demonstrate how the KG answers such
questions using Cypher queries, and outline its integration with large language
models in RAG systems to improve transparency and reliability in
climate-related question answering. This work moves beyond KG construction to
show its real world value for climate researchers, model developers, and others
who rely on accurate, contextual scientific information.
Authors' comments: ACM SIGIR 2025 Workshop MANILA
Rodrigo Braz Teixeira, Izaak Neri, Pablo Sartori
Liquid mixtures can separate into phases with distinct composition. This phenomenon has recently come back to prominence due to its role in complex biological liquids, such as the cytoplasm, which contain thousands of components. For simple two-component mixtures phase-separated states are global free energy minima. However, local free energy minima, i.e. metastable states, are known to play a dominant role in complex systems with many components. For example, Hopfield neural networks can retrieve information from partial cues via relaxation to metastable states. Under what conditions can phase separated states be metastable, and what are the implications for information processing in multicomponent liquids? In this work we develop the general thermodynamic formalism of metastable phase separation. We then apply this formalism to an illustrative toy example inspired by recent experiments, binary mixtures with high-order interactions. Finally, as core application of the formalism, we study metastability in Hopfield liquids, a class of multicomponent mixtures capable of storing information on the composition of phases. We show that these phases can be retrieved from partial cues via metastable phase separation. Spatial simulations of liquids with a large number of components match our analytical solution. Our work suggests that complex biological mixtures can perform information retrieval through metastable phase separation.
Authors' comments: 26 pages, 8 figures, 16 pages of supplement
Iman Barati, Mostafa Amiri, Heshaam Faili
Supervised Fine-Tuning (SFT) is essential for training large language models (LLMs), significantly enhancing critical capabilities such as instruction following and in-context learning. Nevertheless, creating suitable training datasets tailored for specific domains remains challenging due to unique domain constraints and data scarcity. In this paper, we propose SearchInstruct, an innovative method explicitly designed to construct high quality instruction datasets for SFT. Our approach begins with a limited set of domain specific, human generated questions, which are systematically expanded using a large language model. Subsequently, domain relevant resources are dynamically retrieved to generate accurate and contextually appropriate answers for each augmented question. Experimental evaluation demonstrates that SearchInstruct enhances both the diversity and quality of SFT datasets, leading to measurable improvements in LLM performance within specialized domains. Additionally, we show that beyond dataset generation, the proposed method can also effectively facilitate tasks such as model editing, enabling efficient updates to existing models. To facilitate reproducibility and community adoption, we provide full implementation details, the complete set of generated instruction response pairs, and the source code in a publicly accessible Git repository: [https://github.com/mostafaamiri/SearchInstruct](https://github.com/mostafaamiri/SearchInstruct)
Guohang Yan, Yue Zhang, Pinlong Cai, Ding Wang, Song Mao, Hongwei Zhang, Yaoze Zhang, Hairong Zhang et al.
Retrieval-augmented generation (RAG) has become a dominant paradigm for mitigating knowledge hallucination and staleness in large language models (LLMs) while preserving data security. By retrieving relevant evidence from private, domain-specific corpora and injecting it into carefully engineered prompts, RAG delivers trustworthy responses without the prohibitive cost of fine-tuning. Traditional retrieval-augmented generation (RAG) systems are text-only and often rely on a single storage backend, most commonly a vector database. In practice, this monolithic design suffers from unavoidable trade-offs: vector search captures semantic similarity yet loses global context; knowledge graphs excel at relational precision but struggle with recall; full-text indexes are fast and exact yet semantically blind; and relational engines such as MySQL provide strong transactional guarantees but no semantic understanding. We argue that these heterogeneous retrieval paradigms are complementary, and propose a principled fusion scheme to orchestrate them synergistically, mitigating the weaknesses of any single modality. In this work we introduce HetaRAG, a hybrid, deep-retrieval augmented generation framework that orchestrates cross-modal evidence from heterogeneous data stores. We plan to design a system that unifies vector indices, knowledge graphs, full-text engines, and structured databases into a single retrieval plane, dynamically routing and fusing evidence to maximize recall, precision, and contextual fidelity. To achieve this design goal, we carried out preliminary explorations and constructed an initial RAG pipeline; this technical report provides a brief overview. The partial code is available at https://github.com/KnowledgeXLab/HetaRAG.
Authors' comments: 15 pages, 4 figures
Zakaria El Kassimi, Fares Fourati, Mohamed-Slim Alouini
We study question answering in the domain of radio regulations, a legally sensitive and high-stakes area. We propose a telecom-specific Retrieval-Augmented Generation (RAG) pipeline and introduce, to our knowledge, the first multiple-choice evaluation set for this domain, constructed from authoritative sources using automated filtering and human validation. To assess retrieval quality, we define a domain-specific retrieval metric, under which our retriever achieves approximately 97% accuracy. Beyond retrieval, our approach consistently improves generation accuracy across all tested models. In particular, while naively inserting documents without structured retrieval yields only marginal gains for GPT-4o (less than 1%), applying our pipeline results in nearly a 12% relative improvement. These findings demonstrate that carefully targeted grounding provides a simple yet strong baseline and an effective domain-specific solution for regulatory question answering. All code and evaluation scripts, along with our derived question-answer dataset, are available at https://github.com/Zakaria010/Radio-RAG.
Hassan Gharoun, Mohammad Sadegh Khorshidi, Kasra Ranjbarigderi, Fang Chen, Amir H. Gandomi
This work proposes an evidence-retrieval mechanism for uncertainty-aware
decision-making that replaces a single global cutoff with an
evidence-conditioned, instance-adaptive criterion. For each test instance,
proximal exemplars are retrieved in an embedding space; their predictive
distributions are fused via Dempster-Shafer theory. The resulting fused belief
acts as a per-instance thresholding mechanism. Because the supporting evidences
are explicit, decisions are transparent and auditable. Experiments on
CIFAR-10/100 with BiT and ViT backbones show higher or comparable
uncertainty-aware performance with materially fewer confidently incorrect
outcomes and a sustainable review load compared with applying threshold on
prediction entropy. Notably, only a few evidences are sufficient to realize
these gains; increasing the evidence set yields only modest changes. These
results indicate that evidence-conditioned tagging provides a more reliable and
interpretable alternative to fixed prediction entropy thresholds for
operational uncertainty-aware decision-making.
Authors' comments: 15 pages, 4 figures, 3 tables
Haichao Zhang, Chong Zhang, Peiyu Hu, Shi Qiu, Jia Wang
Modern recommender systems face a critical challenge in complying with
privacy regulations like the 'right to be forgotten': removing a user's data
without disrupting recommendations for others. Traditional unlearning methods
address this by partial model updates, but introduce propagation bias--where
unlearning one user's data distorts recommendations for behaviorally similar
users, degrading system accuracy. While retraining eliminates bias, it is
computationally prohibitive for large-scale systems. To address this challenge,
we propose CRAGRU, a novel framework leveraging Retrieval-Augmented Generation
(RAG) for efficient, user-specific unlearning that mitigates bias while
preserving recommendation quality. CRAGRU decouples unlearning into distinct
retrieval and generation stages. In retrieval, we employ three tailored
strategies designed to precisely isolate the target user's data influence,
minimizing collateral impact on unrelated users and enhancing unlearning
efficiency. Subsequently, the generation stage utilizes an LLM, augmented with
user profiles integrated into prompts, to reconstruct accurate and personalized
recommendations without needing to retrain the entire base model. Experiments
on three public datasets demonstrate that CRAGRU effectively unlearns targeted
user data, significantly mitigating unlearning bias by preventing adverse
impacts on non-target users, while maintaining recommendation performance
comparable to fully trained original models. Our work highlights the promise of
RAG-based architectures for building robust and privacy-preserving recommender
systems. The source code is available at:
https://github.com/zhanghaichao520/LLM_rec_unlearning.
Authors' comments: 10 pages, 4 figures. Accepted ICDM 2025 (IEEE International
Conference on Data Mining)
Minjong Yoo, Jinwoo Jang, Wei-jin Park, Honguk Woo
This study presents an Exploratory Retrieval-Augmented Planning (ExRAP)
framework, designed to tackle continual instruction following tasks of embodied
agents in dynamic, non-stationary environments. The framework enhances Large
Language Models' (LLMs) embodied reasoning capabilities by efficiently
exploring the physical environment and establishing the environmental context
memory, thereby effectively grounding the task planning process in time-varying
environment contexts. In ExRAP, given multiple continual instruction following
tasks, each instruction is decomposed into queries on the environmental context
memory and task executions conditioned on the query results. To efficiently
handle these multiple tasks that are performed continuously and simultaneously,
we implement an exploration-integrated task planning scheme by incorporating
the {information-based exploration} into the LLM-based planning process.
Combined with memory-augmented query evaluation, this integrated scheme not
only allows for a better balance between the validity of the environmental
context memory and the load of environment exploration, but also improves
overall task performance. Furthermore, we devise a {temporal consistency
refinement} scheme for query evaluation to address the inherent decay of
knowledge in the memory. Through experiments with VirtualHome, ALFRED, and
CARLA, our approach demonstrates robustness against a variety of embodied
instruction following scenarios involving different instruction scales and
types, and non-stationarity degrees, and it consistently outperforms other
state-of-the-art LLM-based task planning approaches in terms of both goal
success rate and execution efficiency.
Authors' comments: 21 pages. NeurIPS 2024
Kai Ye, Liangcai Su, Chenxiong Qian
Code generation has emerged as a pivotal capability of Large Language
Models(LLMs), revolutionizing development efficiency for programmers of all
skill levels. However, the complexity of data structures and algorithmic logic
often results in functional deficiencies and security vulnerabilities in
generated code, reducing it to a prototype requiring extensive manual
debugging. While Retrieval-Augmented Generation (RAG) can enhance correctness
and security by leveraging external code manuals, it simultaneously introduces
new attack surfaces.
In this paper, we pioneer the exploration of attack surfaces in
Retrieval-Augmented Code Generation (RACG), focusing on malicious dependency
hijacking. We demonstrate how poisoned documentation containing hidden
malicious dependencies (e.g., matplotlib_safe) can subvert RACG, exploiting
dual trust chains: LLM reliance on RAG and developers' blind trust in LLM
suggestions. To construct poisoned documents, we propose ImportSnare, a novel
attack framework employing two synergistic strategies: 1)Position-aware beam
search optimizes hidden ranking sequences to elevate poisoned documents in
retrieval results, and 2)Multilingual inductive suggestions generate
jailbreaking sequences to manipulate LLMs into recommending malicious
dependencies. Through extensive experiments across Python, Rust, and
JavaScript, ImportSnare achieves significant attack success rates (over 50% for
popular libraries such as matplotlib and seaborn) in general, and is also able
to succeed even when the poisoning ratio is as low as 0.01%, targeting both
custom and real-world malicious packages. Our findings reveal critical supply
chain risks in LLM-powered development, highlighting inadequate security
alignment for code generation tasks. To support future research, we will
release the multilingual benchmark suite and datasets. The project homepage is
https://importsnare.github.io.
Authors' comments: This paper has been accepted by the ACM Conference on Computer and
Communications Security (CCS) 2025
PrzemysÅaw StokÅosa, Janusz A. Starzyk, PaweÅ Raif, Adrian Horzyk, Marcin Kowalik
The paper addresses challenges in storing and retrieving sequences in contexts like anomaly detection, behavior prediction, and genetic information analysis. Associative Knowledge Graphs (AKGs) offer a promising approach by leveraging sparse graph structures to encode sequences. The objective was to develop a method for sequence storage and retrieval using AKGs that maintain high memory capacity and context-based retrieval accuracy while introducing algorithms for efficient element ordering. The study utilized Sequential Structural Associative Knowledge Graphs (SSAKGs). These graphs encode sequences as transitive tournaments with nodes representing objects and edges defining the order. Four ordering algorithms were developed and tested: Simple Sort, Node Ordering, Enhanced Node Ordering, and Weighted Edges Node Ordering. The evaluation was conducted on synthetic datasets consisting of random sequences of varying lengths and distributions, and real-world datasets, including sentence-based sequences from the NLTK library and miRNA sequences mapped symbolically with a window-based approach. Metrics such as precision, sensitivity, and specificity were employed to assess performance. SSAKGs exhibited quadratic growth in memory capacity relative to graph size. This study introduces a novel structural approach for sequence storage and retrieval. Key advantages include no training requirements, flexible context-based reconstruction, and high efficiency in sparse memory graphs. With broad applications in computational neuroscience and bioinformatics, the approach offers scalable solutions for sequence-based memory tasks.
Authors' comments: 13 pages, 6 figures
Wooseong Yang, Weizhi Zhang, Yuqing Liu, Yuwei Han, Yu Wang, Junhyun Lee, Philip S. Yu
Cold-start items remain a persistent challenge in recommender systems due to their lack of historical user interactions, which collaborative models rely on. While recent zero-shot methods leverage large language models (LLMs) to address this, they often struggle with sparse metadata and hallucinated or incomplete knowledge. We propose ColdRAG, a retrieval-augmented generation approach that builds a domain-specific knowledge graph dynamically to enhance LLM-based recommendation in cold-start scenarios, without requiring task-specific fine-tuning. ColdRAG begins by converting structured item attributes into rich natural-language profiles, from which it extracts entities and relationships to construct a unified knowledge graph capturing item semantics. Given a user's interaction history, it scores edges in the graph using an LLM, retrieves candidate items with supporting evidence, and prompts the LLM to rank them. By enabling multi-hop reasoning over this graph, ColdRAG grounds recommendations in verifiable evidence, reducing hallucinations and strengthening semantic connections. Experiments on three public benchmarks demonstrate that ColdRAG surpasses existing zero-shot baselines in both Recall and NDCG. This framework offers a practical solution to cold-start recommendation by combining knowledge-graph reasoning with retrieval-augmented LLM generation.
Authors' comments: 10 pages
Özgür Uğur, Musa Yılmaz, Esra Şavirdi, Özay Ezerceli, Mahmut El Huseyni, Selva Taş, Reyhan Bayraktar
The integration of Large Language Models (LLMs) into various applications has driven the need for structured and reliable responses. A key challenge in Retrieval-Augmented Generation (RAG) systems is ensuring that outputs align with expected formats while minimizing hallucinations. This study examines the role of guided decoding in RAG systems, comparing three methods, Outlines, XGrammar, and LM Format Enforcer, across different multi-turn prompting setups (0-turn, 1-turn, and 2-turn). By evaluating success rates, hallucination rates, and output quality, we provide insights into their performance and applicability. Our findings reveal how multi-turn interactions influence guided decoding, uncovering unexpected performance variations that can inform method selection for specific use cases. This work advances the understanding of structured output generation in RAG systems, offering both theoretical insights and practical guidance for LLM deployment.
Haike Xu, Tong Chen
The widely used retrieve-and-rerank pipeline faces two critical limitations: they are constrained by the initial retrieval quality of the top-k documents, and the growing computational demands of LLM-based rerankers restrict the number of documents that can be effectively processed. We introduce Reranker-Guided-Search (RGS), a novel approach that bypasses these limitations by directly retrieving documents according to reranker preferences rather than following the traditional sequential reranking method. Our method uses a greedy search on proximity graphs generated by approximate nearest neighbor algorithms, strategically prioritizing promising documents for reranking based on document similarity. Experimental results demonstrate substantial performance improvements across multiple benchmarks: 3.5 points on BRIGHT, 2.9 on FollowIR, and 5.1 on M-BEIR, all within a constrained reranker budget of 100 documents. Our analysis suggests that, given a fixed pair of embedding and reranker models, strategically selecting documents to rerank can significantly improve retrieval accuracy under limited reranker budget.
Xixi Wu, Yanchao Tan, Nan Hou, Ruiyang Zhang, Hong Cheng
Document Understanding is a foundational AI capability with broad
applications, and Document Question Answering (DocQA) is a key evaluation task.
Traditional methods convert the document into text for processing by Large
Language Models (LLMs), but this process strips away critical multi-modal
information like figures. While Large Vision-Language Models (LVLMs) address
this limitation, their constrained input size makes multi-page document
comprehension infeasible. Retrieval-augmented generation (RAG) methods mitigate
this by selecting relevant pages, but they rely solely on semantic relevance,
ignoring logical connections between pages and the query, which is essential
for reasoning.
To this end, we propose MoLoRAG, a logic-aware retrieval framework for
multi-modal, multi-page document understanding. By constructing a page graph
that captures contextual relationships between pages, a lightweight VLM
performs graph traversal to retrieve relevant pages, including those with
logical connections often overlooked. This approach combines semantic and
logical relevance to deliver more accurate retrieval. After retrieval, the
top-$K$ pages are fed into arbitrary LVLMs for question answering. To enhance
flexibility, MoLoRAG offers two variants: a training-free solution for easy
deployment and a fine-tuned version to improve logical relevance checking.
Experiments on four DocQA datasets demonstrate average improvements of 9.68% in
accuracy over LVLM direct inference and 7.44% in retrieval precision over
baselines. Codes and datasets are released at
https://github.com/WxxShirley/MoLoRAG.
Authors' comments: EMNLP Main 2025