Junnan Dong, Siyu An, Yifei Yu, Qian-Wen Zhang, Linhao Luo, Xiao Huang, Yunsheng Wu, Di Yin et al.
Graph retrieval-augmented generation (GraphRAG) has effectively enhanced
large language models in complex reasoning by organizing fragmented knowledge
into explicitly structured graphs. Prior efforts have been made to improve
either graph construction or graph retrieval in isolation, yielding suboptimal
performance, especially when domain shifts occur. In this paper, we propose a
vertically unified agentic paradigm, Youtu-GraphRAG, to jointly connect the
entire framework as an intricate integration. Specifically, (i) a seed graph
schema is introduced to bound the automatic extraction agent with targeted
entity types, relations and attribute types, also continuously expanded for
scalability over unseen domains; (ii) To obtain higher-level knowledge upon the
schema, we develop novel dually-perceived community detection, fusing
structural topology with subgraph semantics for comprehensive knowledge
organization. This naturally yields a hierarchical knowledge tree that supports
both top-down filtering and bottom-up reasoning with community summaries; (iii)
An agentic retriever is designed to interpret the same graph schema to
transform complex queries into tractable and parallel sub-queries. It
iteratively performs reflection for more advanced reasoning; (iv) To alleviate
the knowledge leaking problem in pre-trained LLM, we propose a tailored
anonymous dataset and a novel 'Anonymity Reversion' task that deeply measures
the real performance of the GraphRAG frameworks. Extensive experiments across
six challenging benchmarks demonstrate the robustness of Youtu-GraphRAG,
remarkably moving the Pareto frontier with up to 90.71% saving of token costs
and 16.62% higher accuracy over state-of-the-art baselines. The results
indicate our adaptability, allowing seamless domain transfer with minimal
intervention on schema.
Authors' comments: 19 pages, 7 figures, 6 tables
Yixuan Tang, Yuanyuan Shi, Yiqun Sun, Anthony Kum Hoe Tung
Access to diverse perspectives is essential for understanding real-world
events, yet most news retrieval systems prioritize textual relevance, leading
to redundant results and limited viewpoint exposure. We propose NEWSCOPE, a
two-stage framework for diverse news retrieval that enhances event coverage by
explicitly modeling semantic variation at the sentence level. The first stage
retrieves topically relevant content using dense retrieval, while the second
stage applies sentence-level clustering and diversity-aware re-ranking to
surface complementary information. To evaluate retrieval diversity, we
introduce three interpretable metrics, namely Average Pairwise Distance,
Positive Cluster Coverage, and Information Density Ratio, and construct two
paragraph-level benchmarks: LocalNews and DSGlobal. Experiments show that
NEWSCOPE consistently outperforms strong baselines, achieving significantly
higher diversity without compromising relevance. Our results demonstrate the
effectiveness of fine-grained, interpretable modeling in mitigating redundancy
and promoting comprehensive event understanding. The data and code are
available at https://github.com/tangyixuan/NEWSCOPE.
Authors' comments: Accepted by EMNLP 2025
Wenhao Li, Yuxin Zhang, Gen Luo, Haiyuan Wan, Ziyang Gong, Fei Chao, Rongrong Ji
Reducing the key-value (KV) cache burden in Large Language Models (LLMs) significantly accelerates inference. Dynamically selecting critical KV caches during decoding helps maintain performance. Existing methods use random linear hashing to identify important tokens, but this approach is inefficient due to the orthogonal distribution of queries and keys within two narrow cones in LLMs. We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys, enhancing coding efficiency and robustness. We also developed a lightweight, stable training framework using a Bradley-Terry ranking-based loss, enabling optimization of the non-linear hashing module on GPUs with 16GB memory in 8 hours. Experimental results show that Spotlight Attention drastically improves retrieval precision while shortening the length of the hash code at least 5$\times$ compared to traditional linear hashing. Finally, we exploit the computational advantages of bitwise operations by implementing specialized CUDA kernels, achieving hashing retrieval for 512K tokens in under 100$\mu$s on a single A100 GPU, with end-to-end throughput up to 3$\times$ higher than vanilla decoding.
Yang Sun, Lixin Zou, Dan Luo, Zhiyong Xie, Long Zhang, Liming Dong, Yunwei Zhao, Xixun Lin et al.
Retrieval-augmented generation (RAG) incorporates external knowledge into large language models (LLMs), improving their adaptability to downstream tasks and enabling information updates. Surprisingly, recent empirical evidence demonstrates that injecting noise into retrieved relevant documents paradoxically facilitates exploitation of external knowledge and improves generation quality. Although counterintuitive and challenging to apply in practice, this phenomenon enables granular control and rigorous analysis of how LLMs integrate external knowledge. Therefore, in this paper, we intervene on noise injection and establish a layer-specific functional demarcation within the LLM: shallow layers specialize in local context modeling, intermediate layers focus on integrating long-range external factual knowledge, and deeper layers primarily rely on parametric internal knowledge. Building on this insight, we propose Layer Fused Decoding (LFD), a simple decoding strategy that directly combines representations from an intermediate layer with final-layer decoding outputs to fully exploit the external factual knowledge. To identify the optimal intermediate layer, we introduce an internal knowledge score (IKS) criterion that selects the layer with the lowest IKS value in the latter half of layers. Experimental results across multiple benchmarks demonstrate that LFD helps RAG systems more effectively surface retrieved context knowledge with minimal cost.
Francesco Ghiandoni, Massimo Giulietti, Enrico Mezzano, Marco Timpanella
Private information retrieval (PIR) addresses the problem of retrieving a desired message from distributed databases without revealing which message is being requested. Recent works have shown that cross-subspace alignment (CSA) codes constructed from algebraic geometry (AG) codes on high-genus curves can improve PIR rates over classical constructions. In this paper, we propose a new PIR scheme based on AG codes from the Hermitian curve, a well-known example of an $F_\ell$-maximal curve, that is, a curve defined over the finite field with $\ell$ elements which attains the Hasse-Weil upper bound on the number of its $F_\ell$-rational points. The large number of rational points enables longer code constructions, leading to higher retrieval rates than schemes based on genus 0, genus 1, and hyperelliptic curves of arbitrary genus. Our results highlight the potential of maximal curves as a natural source of efficient PIR constructions.
Peiran Zhou, Junnan Zhu, Yichen Shen, Ruoxi Yu
Large Language Models (LLMs) excel in language tasks but are prone to hallucinations and outdated knowledge. Retrieval-Augmented Generation (RAG) mitigates these by grounding LLMs in external knowledge. However, in complex domains involving multiple, lengthy, or conflicting documents, traditional RAG suffers from information overload and inefficient synthesis, leading to inaccurate and untrustworthy answers. To address this, we propose CASC (Context-Adaptive Synthesis and Compression), a novel framework that intelligently processes retrieved contexts. CASC introduces a Context Analyzer & Synthesizer (CAS) module, powered by a fine-tuned smaller LLM, which performs key information extraction, cross-document consistency checking and conflict resolution, and question-oriented structured synthesis. This process transforms raw, scattered information into a highly condensed, structured, and semantically rich context, significantly reducing the token count and cognitive load for the final Reader LLM. We evaluate CASC on SciDocs-QA, a new challenging multi-document question answering dataset designed for complex scientific domains with inherent redundancies and conflicts. Our extensive experiments demonstrate that CASC consistently outperforms strong baselines.
Mathew Henrickson
This research presents a Retrieval-Augmented Generation (RAG) framework for art provenance studies, focusing on the Getty Provenance Index. Provenance research establishes the ownership history of artworks, which is essential for verifying authenticity, supporting restitution and legal claims, and understanding the cultural and historical context of art objects. The process is complicated by fragmented, multilingual archival data that hinders efficient retrieval. Current search portals require precise metadata, limiting exploratory searches. Our method enables natural-language and multilingual searches through semantic retrieval and contextual summarization, reducing dependence on metadata structures. We assess RAG's capability to retrieve and summarize auction records using a 10,000-record sample from the Getty Provenance Index - German Sales. The results show this approach provides a scalable solution for navigating art market archives, offering a practical tool for historians and cultural heritage professionals conducting historically sensitive research.
Yi Pan, Yujia Zhang, Michael Kampffmeyer, Xiaoguang Zhao
Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task
that involves retrieving videos based on queries relevant to only specific
segments. While existing works follow the paradigm of developing models to
process unimodal features, powerful pretrained vision-language models like CLIP
remain underexplored in this field. To bridge this gap, we propose ProPy, a
model with systematic architectural adaption of CLIP specifically designed for
PRVR. Drawing insights from the semantic relevance of multi-granularity events,
ProPy introduces two key innovations: (1) A Prompt Pyramid structure that
organizes event prompts to capture semantics at multiple granularity levels,
and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that
enables dynamic semantic interaction among events. With these designs, ProPy
achieves SOTA performance on three public datasets, outperforming previous
models by significant margins. Code is available at
https://github.com/BUAAPY/ProPy.
Authors' comments: Accepted by EMNLP 2025 Findings
Karanbir Singh, Deepak Muppiri, William Ngu
Large Language Models (LLMs) have transformed the field of artificial
intelligence by unlocking the era of generative applications. Built on top of
generative AI capabilities, Agentic AI represents a major shift toward
autonomous, goal-driven systems that can reason, retrieve, and act. However,
they also inherit the bias present in both internal and external information
sources. This significantly affects the fairness and balance of retrieved
information, and hence reduces user trust. To address this critical challenge,
we introduce a novel Bias Mitigation Agent, a multi-agent system designed to
orchestrate the workflow of bias mitigation through specialized agents that
optimize the selection of sources to ensure that the retrieved content is both
highly relevant and minimally biased to promote fair and balanced knowledge
dissemination. The experimental results demonstrate an 81.82\% reduction in
bias compared to a baseline naive retrieval strategy.
Authors' comments: Accepted at KDD'2025 Agent4IR workshop
Yue Jiang, Chenxi Liu, Yile Chen, Qin Chao, Shuai Liu, Gao Cong
Urban forecasting models often face a severe data imbalance problem: only a few cities have dense, long-span records, while many others expose short or incomplete histories. Direct transfer from data-rich to data-scarce cities is unreliable because only a limited subset of source patterns truly benefits the target domain, whereas indiscriminate transfer risks introducing noise and negative transfer. We present STRATA-TS (Selective TRAnsfer via TArget-aware retrieval for Time Series), a framework that combines domain-adapted retrieval with reasoning-capable large models to improve forecasting in scarce data regimes. STRATA-TS employs a patch-based temporal encoder to identify source subsequences that are semantically and dynamically aligned with the target query. These retrieved exemplars are then injected into a retrieval-guided reasoning stage, where an LLM performs structured inference over target inputs and retrieved support. To enable efficient deployment, we distill the reasoning process into a compact open model via supervised fine-tuning. Extensive experiments on three parking availability datasets across Singapore, Nottingham, and Glasgow demonstrate that STRATA-TS consistently outperforms strong forecasting and transfer baselines, while providing interpretable knowledge transfer pathways.
Pardis Moradbeiki, Nasser Ghadiri, Sayed Jalal Zahabi, Uffe Kock Wiil, Kristoffer Kittelmann Brockhattingen, Ali Ebrahimi
Accurate sarcopenia diagnosis via ultrasound remains challenging due to subtle imaging cues, limited labeled data, and the absence of clinical context in most models. We propose MedVQA-TREE, a multimodal framework that integrates a hierarchical image interpretation module, a gated feature-level fusion mechanism, and a novel multi-hop, multi-query retrieval strategy. The vision module includes anatomical classification, region segmentation, and graph-based spatial reasoning to capture coarse, mid-level, and fine-grained structures. A gated fusion mechanism selectively integrates visual features with textual queries, while clinical knowledge is retrieved through a UMLS-guided pipeline accessing PubMed and a sarcopenia-specific external knowledge base. MedVQA-TREE was trained and evaluated on two public MedVQA datasets (VQA-RAD and PathVQA) and a custom sarcopenia ultrasound dataset. The model achieved up to 99% diagnostic accuracy and outperformed previous state-of-the-art methods by over 10%. These results underscore the benefit of combining structured visual understanding with guided knowledge retrieval for effective AI-assisted diagnosis in sarcopenia.
Hung-Chun Hsu, Yuan-Ching Kuo, Chao-Han Huck Yang, Szu-Wei Fu, Hanrong Ye, Hongxu Yin, Yu-Chiang Frank Wang, Ming-Feng Tsai et al.
The rapid evolution of e-commerce has exposed the limitations of traditional product retrieval systems in managing complex, multi-turn user interactions. Recent advances in multimodal generative retrieval -- particularly those leveraging multimodal large language models (MLLMs) as retrievers -- have shown promise. However, most existing methods are tailored to single-turn scenarios and struggle to model the evolving intent and iterative nature of multi-turn dialogues when applied naively. Concurrently, test-time scaling has emerged as a powerful paradigm for improving large language model (LLM) performance through iterative inference-time refinement. Yet, its effectiveness typically relies on two conditions: (1) a well-defined problem space (e.g., mathematical reasoning), and (2) the model's ability to self-correct -- conditions that are rarely met in conversational product search. In this setting, user queries are often ambiguous and evolving, and MLLMs alone have difficulty grounding responses in a fixed product corpus. Motivated by these challenges, we propose a novel framework that introduces test-time scaling into conversational multimodal product retrieval. Our approach builds on a generative retriever, further augmented with a test-time reranking (TTR) mechanism that improves retrieval accuracy and better aligns results with evolving user intent throughout the dialogue. Experiments across multiple benchmarks show consistent improvements, with average gains of 14.5 points in MRR and 10.6 points in nDCG@1.
Manlai Liang, Mandi Liu, Jiangzhou Ji, Huaijun Li, Haobo Yang, Yaohan He, Jinlong Li
Large Language Models (LLMs) have demonstrated success across many benchmarks. However, they still exhibit limitations in long-context scenarios, primarily due to their short effective context length, quadratic computational complexity, and high memory overhead when processing lengthy inputs. To mitigate these issues, we introduce a novel context compression pipeline, called Intermediate Layer Retrieval (ILRe), which determines one intermediate decoder layer offline, encodes context by streaming chunked prefill only up to that layer, and recalls tokens by the attention scores between the input query and full key cache in that specified layer. In particular, we propose a multi-pooling kernels allocating strategy in the token recalling process to maintain the completeness of semantics. Our approach not only reduces the prefilling complexity from $O(L^2)$ to $O(L)$, but also achieves performance comparable to or better than the full context in the long context scenarios. Without additional post training or operator development, ILRe can process a single $1M$ tokens request in less than half a minute (speedup $\approx 180\times$) and scores RULER-$1M$ benchmark of $\approx 79.8$ with model Llama-3.1-UltraLong-8B-1M-Instruct on a Huawei Ascend 910B NPU.
Caroline Piaulet-Ghorayeb
Transmission spectroscopy is a key avenue for the near-term study of small-planet atmospheres and the most promising method when it comes to searching for atmospheres on temperate rocky worlds, which are often too cold for planetary emission to be detectable. At the same time, the small planets that are most amenable for such atmospheric probes orbit small and cool M dwarf stars. As the field becomes increasingly ambitious in the search for signs of even thin atmospheres on small exoplanets, the transit light source effect (TLSE), caused by unocculted stellar surface heterogeneities, is becoming a limiting factor: it is imperative to develop robust inference methods to disentangle planetary and stellar contributions to the observed spectra. Here, I present STCTM, the STellar ConTamination Modeling framework, a flexible Bayesian retrieval framework to model the impact of the TLSE on any exoplanet transmission spectrum, and infer the range of stellar surface parameters that are compatible with the observations in the absence of any planetary contribution. With the "exotune" sub-module, users can also perform retrievals directly on out-of-transit stellar spectra in order to place data-driven priors on the extent to which the TLSE can impact any planet's transmission spectrum. The input data formats, stellar models, and fitted parameters are easily tunable using human-readable files and the code is fully parallelized to enable fast inferences. [shortened for arxiv; see full summary in the PDF]
Authors' comments: 5 pages, re-submitted to JOSS following pre-review comments
Nafis Tanveer Islam, Zhiming Zhao
Classical search engines using indexing methods in data infrastructures primarily allow keyword-based queries to retrieve content. While these indexing-based methods are highly scalable and efficient, due to a lack of an appropriate evaluation dataset and a limited understanding of semantics, they often fail to capture the user's intent and generate incomplete responses during evaluation. This problem also extends to domain-specific search systems that utilize a Knowledge Base (KB) to access data from various research infrastructures. Research infrastructures (RIs) from the environmental and earth science domain, which encompass the study of ecosystems, biodiversity, oceanography, and climate change, generate, share, and reuse large volumes of data. While there are attempts to provide a centralized search service using Elasticsearch as a knowledge base, they also face similar challenges in understanding queries with multiple intents. To address these challenges, we proposed an automated method to curate a domain-specific evaluation dataset to analyze the capability of a search system. Furthermore, we incorporate the Retrieval of Augmented Generation (RAG), powered by Large Language Models (LLMs), for high-quality retrieval of environmental domain data using natural language queries. Our quantitative and qualitative analysis of the evaluation dataset shows that LLM-based systems for information retrieval return results with higher precision when understanding queries with multiple intents, compared to Elasticsearch-based systems.
Authors' comments: Accepted at FAIEMA Conference 2025. DOI will be provided once the conference publishes the paper
Yiming Xu, Junfeng Jiao
Accurately predicting travel mode choice is essential for effective transportation planning, yet traditional statistical and machine learning models are constrained by rigid assumptions, limited contextual reasoning, and reduced generalizability. This study explores the potential of Large Language Models (LLMs) as a more flexible and context-aware approach to travel mode choice prediction, enhanced by Retrieval-Augmented Generation (RAG) to ground predictions in empirical data. We develop a modular framework for integrating RAG into LLM-based travel mode choice prediction and evaluate four retrieval strategies: basic RAG, RAG with balanced retrieval, RAG with a cross-encoder for re-ranking, and RAG with balanced retrieval and cross-encoder for re-ranking. These strategies are tested across three LLM architectures (OpenAI GPT-4o, o4-mini, and o3) to examine the interaction between model reasoning capabilities and retrieval methods. Using the 2023 Puget Sound Regional Household Travel Survey data, we conduct a series of experiments to evaluate model performance. The results demonstrate that RAG substantially enhances predictive accuracy across a range of models. Notably, the GPT-4o model combined with balanced retrieval and cross-encoder re-ranking achieves the highest accuracy of 80.8%, exceeding that of conventional statistical and machine learning baselines. Furthermore, LLM-based models exhibit superior generalization abilities relative to these baselines. Findings highlight the critical interplay between LLM reasoning capabilities and retrieval strategies, demonstrating the importance of aligning retrieval strategies with model capabilities to maximize the potential of LLM-based travel behavior modeling.
Aleksandar Pramov, Jiangqin Ma, Bina Patel
Claim normalization is an integral part of any automatic fact-check
verification system. It parses the typically noisy claim data, such as social
media posts into normalized claims, which are then fed into downstream veracity
classification tasks. The CheckThat! 2025 Task 2 focuses specifically on claim
normalization and spans 20 languages under monolingual and zero-shot
conditions. Our proposed solution consists of a lightweight
\emph{retrieval-first, LLM-backed} pipeline, in which we either dynamically
prompt a GPT-4o-mini with in-context examples, or retrieve the closest
normalization from the train dataset directly. On the official test set, the
system ranks near the top for most monolingual tracks, achieving first place in
7 out of of the 13 languages. In contrast, the system underperforms in the
zero-shot setting, highlighting the limitation of the proposed solution.
Authors' comments: CLEF 2025 Working Notes, Madrid, Spain
Nir Mazor, Tom Hope
Clinical decision-making often involves interpreting images (e.g., radiology) for making diagnoses. Retrieving relevant visual information from medical literature and hospital records could enhance diagnostic accuracy. In this paper, we develop a model in which a multimodal retriever is jointly optimized with an LVLM for medical diagnosis, unlike standard RAG where LVLM error signal is not propagated down to the retriever. We show that using only general-purpose backbones, with only lightweight fine-tuning, our model is able to achieve competitive results with medically-pretrained models across clinical multi-label classification and visual question answering tasks. In a novel analysis, we additionally find that in many cases different top retrieved images each lead to different predictions for a given target, and that these cases are empirically challenging for all models, even for non-retrieval models. Our joint retrieval optimization significantly improves these challenging cases over standard RAG. However, oracle analysis reveals that while the correct diagnosis is frequently achievable using one of the top retrieved images, in practice there is a large performance gap from the oracle, and rerankers using frontier LVLMs do not close this gap -- leaving ample room for improvement by future methods. Code will be made publicly available.
Zhihao Ding, Yongkang Sun, Jieming Shi
Tables are a prevalent format for structured data, yet their metadata, such
as semantic types and column relationships, is often incomplete or ambiguous.
Column annotation tasks, including Column Type Annotation (CTA) and Column
Property Annotation (CPA), address this by leveraging table context, which are
critical for data management. Existing methods typically serialize all columns
in a table into pretrained language models to incorporate context, but this
coarse-grained approach often degrades performance in wide tables with many
irrelevant or misleading columns. To address this, we propose a novel
retrieve-and-verify context selection framework for accurate column annotation,
introducing two methods: REVEAL and REVEAL+. In REVEAL, we design an efficient
unsupervised retrieval technique to select compact, informative column contexts
by balancing semantic relevance and diversity, and develop context-aware
encoding techniques with role embeddings and target-context pair training to
effectively differentiate target and context columns. To further improve
performance, in REVEAL+, we design a verification model that refines the
selected context by directly estimating its quality for specific annotation
tasks. To achieve this, we formulate a novel column context verification
problem as a classification task and then develop the verification model.
Moreover, in REVEAL+, we develop a top-down verification inference technique to
ensure efficiency by reducing the search space for high-quality context subsets
from exponential to quadratic. Extensive experiments on six benchmark datasets
demonstrate that our methods consistently outperform state-of-the-art
baselines.
Authors' comments: Accepted at SIGMOD 2026
Gunjan Jalori, Preetika Verma, Sercan à Arık
Time series Forecasting with large languagemodels (LLMs) requires bridging numericalpatterns and natural language. Effective fore-casting on LLM often relies on extensive pre-processing and fine-tuning.Recent studiesshow that a frozen LLM can rival specializedforecasters when supplied with a carefully en-gineered natural-language prompt, but craft-ing such a prompt for each task is itself oner-ous and ad-hoc. We introduce FLAIRR-TS, atest-time prompt optimization framework thatutilizes an agentic system: a Forecaster-agentgenerates forecasts using an initial prompt,which is then refined by a refiner agent, in-formed by past outputs and retrieved analogs.This adaptive prompting generalizes across do-mains using creative prompt templates andgenerates high-quality forecasts without inter-mediate code generation.Experiments onbenchmark datasets show improved accuracyover static prompting and retrieval-augmentedbaselines, approaching the performance ofspecialized prompts.FLAIRR-TS providesa practical alternative to tuning, achievingstrong performance via its agentic approach toadaptive prompt refinement and retrieval.
Authors' comments: EMNLP