Fred Mutisya, Shikoh Gitau, Christine Syovata, Diana Oigara, Ibrahim Matende, Muna Aden, Munira Ali, Ryan Nyotu et al.
Large Language Models(LLMs) hold promise for improving healthcare access in
low-resource settings, but their effectiveness in African primary care remains
underexplored. We present a methodology for creating a benchmark dataset and
evaluation framework focused on Kenyan Level 2 and 3 clinical care. Our
approach uses retrieval augmented generation (RAG) to ground clinical questions
in Kenya's national guidelines, ensuring alignment with local standards. These
guidelines were digitized, chunked, and indexed for semantic retrieval. Gemini
Flash 2.0 Lite was then prompted with guideline excerpts to generate realistic
clinical scenarios, multiple-choice questions, and rationale based answers in
English and Swahili. Kenyan physicians co-created and refined the dataset, and
a blinded expert review process ensured clinical accuracy, clarity, and
cultural appropriateness. The resulting Alama Health QA dataset includes
thousands of regulator-aligned question answer pairs across common outpatient
conditions. Beyond accuracy, we introduce evaluation metrics that test clinical
reasoning, safety, and adaptability such as rare case detection (Needle in the
Haystack), stepwise logic (Decision Points), and contextual adaptability.
Initial results reveal significant performance gaps when LLMs are applied to
localized scenarios, consistent with findings that LLM accuracy is lower on
African medical content than on US-based benchmarks. This work offers a
replicable model for guideline-driven, dynamic benchmarking to support safe AI
deployment in African health systems.
Authors' comments: 29 pages, 6 figs, 6 tables. Companion methods paper forthcoming
Yuexuan Kong, Vincent Lostanlen, Romain Hennequin, Mathieu Lagrange, Gabriel Meseguer-Brocal
Contrastive learning and equivariant learning are effective methods for self-supervised learning (SSL) for audio content analysis. Yet, their application to music information retrieval (MIR) faces a dilemma: the former is more effective on tagging (e.g., instrument recognition) but less effective on structured prediction (e.g., tonality estimation); The latter can match supervised methods on the specific task it is designed for, but it does not generalize well to other tasks. In this article, we adopt a best-of-both-worlds approach by training a deep neural network on both kinds of pretext tasks at once. The proposed new architecture is a Vision Transformer with 1-D spectrogram patches (ViT-1D), equipped with two class tokens, which are specialized to different self-supervised pretext tasks but optimized through the same model: hence the qualification of self-supervised multi-class-token multitask (MT2). The former class token optimizes cross-power spectral density (CPSD) for equivariant learning over the circle of fifths, while the latter optimizes normalized temperature-scaled cross-entropy (NT-Xent) for contrastive learning. MT2 combines the strengths of both pretext tasks and outperforms consistently both single-class-token ViT-1D models trained with either contrastive or equivariant learning. Averaging the two class tokens further improves performance on several tasks, highlighting the complementary nature of the representations learned by each class token. Furthermore, using the same single-linear-layer probing method on the features of last layer, MT2 outperforms MERT on all tasks except for beat tracking; achieving this with 18x fewer parameters thanks to its multitasking capabilities. Our SSL benchmark demonstrates the versatility of our multi-class-token multitask learning approach for MIR applications.
Muhammad Javed, Sedigh Khademi Habibabadi, Christopher Palmer, Hazel Clothier, Jim Buttery, Gerardo Luis Dimaguila
Vaccine hesitancy threatens public health, leading to delayed or rejected vaccines. Social media is a vital source for understanding public concerns, and traditional methods like topic modelling often struggle to capture nuanced opinions. Though trained for query answering, large Language Models (LLMs) often miss current events and community concerns. Additionally, hallucinations in LLMs can compromise public health communication. To address these limitations, we developed a tool (VaxPulse Query Corner) using the Retrieval Augmented Generation technique. It addresses complex queries about public vaccine concerns on various online platforms, aiding public health administrators and stakeholders in understanding public concerns and implementing targeted interventions to boost vaccine confidence. Analysing 35,103 Shingrix social media posts, it achieved answer faithfulness (0.96) and relevance (0.94).
Hongxu Ma, Guanshuo Wang, Fufu Yu, Qiong Jia, Shouhong Ding
Video Moment Retrieval (MR) and Highlight Detection (HD) aim to pinpoint
specific moments and assess clip-wise relevance based on the text query. While
DETR-based joint frameworks have made significant strides, there remains
untapped potential in harnessing the intricate relationships between temporal
motion and spatial semantics within video content. In this paper, we propose
the Motion-Semantics DETR (MS-DETR), a framework that captures rich
motion-semantics features through unified learning for MR/HD tasks. The encoder
first explicitly models disentangled intra-modal correlations within motion and
semantics dimensions, guided by the given text queries. Subsequently, the
decoder utilizes the task-wise correlation across temporal motion and spatial
semantics dimensions to enable precise query-guided localization for MR and
refined highlight boundary delineation for HD. Furthermore, we observe the
inherent sparsity dilemma within the motion and semantics dimensions of MR/HD
datasets. To address this issue, we enrich the corpus from both dimensions by
generation strategies and propose contrastive denoising learning to ensure the
above components learn robustly and effectively. Extensive experiments on four
MR/HD benchmarks demonstrate that our method outperforms existing
state-of-the-art models by a margin. Our code is available at
https://github.com/snailma0229/MS-DETR.git.
Authors' comments: Accepted by ACM MM'25
Qingyun Sun, Jiaqi Yuan, Shan He, Xiao Guan, Haonan Yuan, Xingcheng Fu, Jianxin Li, Philip S. Yu
Graph Retrieval-Augmented Generation has emerged as a powerful paradigm for grounding large language models with external structured knowledge. However, existing Graph RAG methods struggle with temporal reasoning, due to their inability to model the evolving structure and order of real-world events. In this work, we introduce DyG-RAG, a novel event-centric dynamic graph retrieval-augmented generation framework designed to capture and reason over temporal knowledge embedded in unstructured text. To eliminate temporal ambiguity in traditional retrieval units, DyG-RAG proposes Dynamic Event Units (DEUs) that explicitly encode both semantic content and precise temporal anchors, enabling accurate and interpretable time-aware retrieval. To capture temporal and causal dependencies across events, DyG-RAG constructs an event graph by linking DEUs that share entities and occur close in time, supporting efficient and meaningful multi-hop reasoning. To ensure temporally consistent generation, DyG-RAG introduces an event timeline retrieval pipeline that retrieves event sequences via time-aware traversal, and proposes a Time Chain-of-Thought strategy for temporally grounded answer generation. This unified pipeline enables DyG-RAG to retrieve coherent, temporally ordered event sequences and to answer complex, time-sensitive queries that standard RAG systems cannot resolve. Extensive experiments on temporal QA benchmarks demonstrate that DyG-RAG significantly improves the accuracy and recall of three typical types of temporal reasoning questions, paving the way for more faithful and temporal-aware generation. DyG-RAG is available at https://github.com/RingBDStack/DyG-RAG.
Tien P. T. Le, Anh M. T. Bui, Huy N. D. Pham, Alessio Bucaioni, Phuong T. Nguyen
Automatically generating concise, informative comments for source code can lighten documentation effort and accelerate program comprehension. Retrieval-augmented approaches first fetch code snippets with existing comments and then synthesize a new comment, yet retrieval and generation are typically optimized in isolation, allowing irrelevant neighbors topropagate noise downstream. To tackle the issue, we propose a novel approach named RAGSum with the aim of both effectiveness and efficiency in recommendations. RAGSum is built on top offuse retrieval and generation using a single CodeT5 backbone. We report preliminary results on a unified retrieval-generation framework built on CodeT5. A contrastive pre-training phase shapes code embeddings for nearest-neighbor search; these weights then seed end-to-end training with a composite loss that (i) rewards accurate top-k retrieval; and (ii) minimizes comment-generation error. More importantly, a lightweight self-refinement loop is deployed to polish the final output. We evaluated theframework on three cross-language benchmarks (Java, Python, C), and compared it with three well-established baselines. The results show that our approach substantially outperforms thebaselines with respect to BLEU, METEOR, and ROUTE-L. These findings indicate that tightly coupling retrieval and generationcan raise the ceiling for comment automation and motivateforthcoming replications and qualitative developer studies.
Authors' comments: The paper has been peer-reviewed and accepted for publication in the proceedings of the 19th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM 2025)
Adam Yang, Gustavo Penha, Enrico Palumbo, Hugues Bouchard
With the breakthroughs in large language models (LLMs), query generation techniques that expand documents and queries with related terms are becoming increasingly popular in the information retrieval field. Such techniques have been shown to improve the effectiveness of traditional lexical retrieval methods by dealing with the vocabulary mismatch problem. Recent work has found that generating queries with a greedy decoding strategy can produce sub-optimal queries, including hallucinations, and proposed to filter out queries before expansion. This `generate-then-filter' approach is costly, as it requires generating multiple queries and applying a relevance model to all of them and does not teach the LLM which of the generated queries is more effective for expansion. To overcome such limitations, we propose Aligned Query Expansion (AQE), a novel approach to enhance query expansion for passage retrieval in open-domain question answering. AQE leverages recent techniques in LLM alignment to fine-tune models for generating query expansions that directly optimize the effectiveness of the retrieval task, eliminating the need for additional filtering steps. This alignment ensures that queries are more relevant, reducing computational costs while improving retrieval effectiveness. Empirical evaluations show that AQE outperforms baseline models for query expansion in both in-domain and out-of-domain settings, demonstrating significant improvements in retrieval effectiveness.
Sangwoo Park, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang
Scientific paper retrieval, particularly framed as document-to-document retrieval, aims to identify relevant papers in response to a long-form query paper, rather than a short query string. Previous approaches to this task have focused on abstracts, embedding them into dense vectors as surrogates for full documents and calculating similarity across them, although abstracts provide only sparse and high-level summaries. To address this, we propose PRISM, a novel document-to-document retrieval method that introduces multiple, fine-grained representations for both the query and candidate papers. In particular, each query paper is decomposed into multiple aspect-specific views and individually embedded, which are then matched against candidate papers similarity segmented to consider their multifaceted dimensions. Moreover, we present SciFullBench, a novel benchmark in which the complete and segmented context of full papers for both queries and candidates is available. Then, experimental results show that PRISM improves performance by an average of 4.3% over existing retrieval baselines.
Tuan-Luc Huynh, Thuy-Trang Vu, Weiqing Wang, Trung Le, Dragan Gašević, Yuan-Fang Li, Thanh-Toan Do
Continually updating model-based indexes in generative retrieval with new documents remains challenging, as full retraining is computationally expensive and impractical under resource constraints. We propose MixLoRA-DSI, a novel framework that combines an expandable mixture of Low-Rank Adaptation experts with a layer-wise out-of-distribution (OOD)-driven expansion strategy. Instead of allocating new experts for each new corpus, our proposed expansion strategy enables sublinear parameter growth by selectively introducing new experts only when significant number of OOD documents are detected. Experiments on NQ320k and MS MARCO Passage demonstrate that MixLoRA-DSI outperforms full-model update baselines, with minimal parameter overhead and substantially lower training costs.
Mehmet Onurcan Kaya, Figen S. Oktem
Phase retrieval involves recovering a signal from intensity-only measurements, crucial in many fields such as imaging, holography, optical computing, crystallography, and microscopy. Although there are several well-known phase retrieval algorithms, including classical iterative solvers, the reconstruction performance often remains sensitive to initialization and measurement noise. Recently, image-to-image diffusion models have gained traction in various image reconstruction tasks, yielding significant theoretical insights and practical breakthroughs. In this work, we introduce a novel phase retrieval approach based on an image-to-image diffusion framework called Inversion by Direct Iteration. Our method begins with an enhanced initialization stage that leverages a hybrid iterative technique, combining the Hybrid Input-Output and Error Reduction methods and incorporating a novel acceleration mechanism to obtain a robust crude estimate. Then, it iteratively refines this initial crude estimate using the learned image-to-image pipeline. Our method achieves substantial improvements in both training efficiency and reconstruction quality. Furthermore, our approach utilizes aggregation techniques to refine quality metrics and demonstrates superior results compared to both classical and contemporary techniques. This highlights its potential for effective and efficient phase retrieval across various applications.
Kirill Khrylchenko, Vladimir Baikalov, Sergei Makeev, Artem Matveev, Sergei Liamaev
Two-tower neural networks are a popular architecture for the retrieval stage
in recommender systems. These models are typically trained with a softmax loss
over the item catalog. However, in web-scale settings, the item catalog is
often prohibitively large, making full softmax infeasible. A common solution is
sampled softmax, which approximates the full softmax using a small number of
sampled negatives.
One practical and widely adopted approach is to use in-batch negatives, where
negatives are drawn from items in the current mini-batch. However, this
introduces a bias: items that appear more frequently in the batch (i.e.,
popular items) are penalized more heavily.
To mitigate this issue, a popular industry technique known as logQ correction
adjusts the logits during training by subtracting the log-probability of an
item appearing in the batch. This correction is derived by analyzing the bias
in the gradient and applying importance sampling, effectively twice, using the
in-batch distribution as a proposal distribution. While this approach improves
model quality, it does not fully eliminate the bias.
In this work, we revisit the derivation of logQ correction and show that it
overlooks a subtle but important detail: the positive item in the denominator
is not Monte Carlo-sampled - it is always present with probability 1. We
propose a refined correction formula that accounts for this. Notably, our loss
introduces an interpretable sample weight that reflects the model's uncertainty
- the probability of misclassification under the current parameters. We
evaluate our method on both public and proprietary datasets, demonstrating
consistent improvements over the standard logQ correction.
Authors' comments: Accepted at ACM RecSys 2025. Author's version. To appear in the
Proceedings of the 18th ACM Conference on Recommender Systems
Huihui Huang, Ratnadira Widyasari, Ting Zhang, Ivana Clairine Irsan, Jieke Shi, Han Wei Ang, Frank Liauw, Eng Lieh Ouh et al.
Issue-commit linking, which connects issues with commits that fix them, is crucial for software maintenance. Existing approaches have shown promise in automatically recovering these links. Evaluations of these techniques assess their ability to identify genuine links from plausible but false links. However, these evaluations overlook the fact that, in reality, when a repository has more commits, the presence of more plausible yet unrelated commits may interfere with the tool in differentiating the correct fix commits. To address this, we propose the Realistic Distribution Setting (RDS) and use it to construct a more realistic evaluation dataset that includes 20 open-source projects. By evaluating tools on this dataset, we observe that the performance of the state-of-the-art deep learning-based approach drops by more than half, while the traditional Information Retrieval method, VSM, outperforms it. Inspired by these observations, we propose EasyLink, which utilizes a vector database as a modern Information Retrieval technique. To address the long-standing problem of the semantic gap between issues and commits, EasyLink leverages a large language model to rerank the commits retrieved from the database. Under our evaluation, EasyLink achieves an average Precision@1 of 75.91%, improving over the state-of-the-art by over four times. Additionally, this paper provides practical guidelines for advancing research in issue-commit link recovery.
Shuo Yang, Zijian Yu, Zhenzhe Ying, Yuqin Dai, Guoqing Wang, Jun Lan, Jinfeng Xu, Jinze Li et al.
The rapid proliferation of multimodal misinformation presents significant challenges for automated fact-checking systems, especially when claims are ambiguous or lack sufficient context. We introduce RAMA, a novel retrieval-augmented multi-agent framework designed for verifying multimedia misinformation. RAMA incorporates three core innovations: (1) strategic query formulation that transforms multimodal claims into precise web search queries; (2) cross-verification evidence aggregation from diverse, authoritative sources; and (3) a multi-agent ensemble architecture that leverages the complementary strengths of multiple multimodal large language models and prompt variants. Extensive experiments demonstrate that RAMA achieves superior performance on benchmark datasets, particularly excelling in resolving ambiguous or improbable claims by grounding verification in retrieved factual evidence. Our findings underscore the necessity of integrating web-based evidence and multi-agent reasoning for trustworthy multimedia verification, paving the way for more reliable and scalable fact-checking solutions. RAMA will be publicly available at https://github.com/kalendsyang/RAMA.git.
Zhengding Hu, Vibha Murthy, Zaifeng Pan, Wanlu Li, Xiaoyi Fang, Yufei Ding, Yuke Wang
This paper addresses emerging system-level challenges in heterogeneous
retrieval-augmented generation (RAG) serving, where complex multi-stage
workflows and diverse request patterns complicate efficient execution. We
present HedraRAG, a runtime system built on a graph-based abstraction that
exposes optimization opportunities across stage-level parallelism,
intra-request similarity, and inter-request skewness. These opportunities are
realized through dynamic graph transformations, such as node splitting,
reordering, edge addition, and dependency rewiring, applied to wavefronts of
subgraphs spanning concurrent requests. The resulting execution plans are
mapped onto hybrid CPU-GPU pipelines to improve resource utilization and reduce
latency. Evaluations across a wide range of RAG workflows demonstrate speedups
exceeding 1.5x and reaching up to 5x over existing frameworks, showcasing the
effectiveness of coordinated generation and retrieval in serving environments.
Authors' comments: Accepted by SOSP 2025
Anthony Miyaguchi, Conor Johnston, Aaryan Potdar
Large Language Models (LLMs) demonstrate strong conversational abilities. In this Working Paper, we study them in the context of debating in two ways: their ability to perform in a structured debate along with a dataset of arguments to use and their ability to evaluate utterances throughout the debate. We deploy six leading publicly available models from three providers for the Retrieval-Augmented Debate and Evaluation. The evaluation is performed by measuring four key metrics: Quality, Quantity, Manner, and Relation. Throughout this task, we found that although LLMs perform well in debates when given related arguments, they tend to be verbose in responses yet consistent in evaluation. The accompanying source code for this paper is located at https://github.com/dsgt-arc/touche-2025-rad.
Savini Kashmira, Jayanaka L. Dantanarayana, Krisztián Flautner, Lingjia Tang, Jason Mars
Conventional Retrieval Augmented Generation (RAG) approaches are common in text-based applications. However, they struggle with structured, interconnected datasets like knowledge graphs, where understanding underlying relationships is crucial for accurate retrieval. A common direction in graph-based retrieval employs iterative, rule-based traversal guided by Large Language Models (LLMs). Such existing iterative methods typically combine reasoning with single hop traversal at each step, making them vulnerable to LLM reasoning errors and hallucinations that ultimately hinder the retrieval of relevant information. To address these limitations, we propose GraphRunner, a novel graph-based retrieval framework that operates in three distinct stages: planning, verification, and execution. This introduces high-level traversal actions that enable multi-hop exploration in a single step. It also generates a holistic traversal plan, which is verified against the graph structure and pre-defined traversal actions, reducing reasoning errors and detecting hallucinations before execution. GraphRunner significantly reduces LLM reasoning errors and detects hallucinations through validation. Our evaluation using the GRBench dataset shows that GraphRunner consistently outperforms existing approaches, achieving 10-50% performance improvements over the strongest baseline while reducing inference cost by 3.0-12.9x and response generation time by 2.5-7.1x, making it significantly more robust and efficient for graph-based retrieval tasks.
Inye Na, Nejung Rue, Jiwon Chung, Hyunjin Park
Medical image retrieval is a valuable field for supporting clinical
decision-making, yet current methods primarily support 2D images and require
fully annotated queries, limiting clinical flexibility. To address this, we
propose RadiomicsRetrieval, a 3D content-based retrieval framework bridging
handcrafted radiomics descriptors with deep learning-based embeddings at the
tumor level. Unlike existing 2D approaches, RadiomicsRetrieval fully exploits
volumetric data to leverage richer spatial context in medical images. We employ
a promptable segmentation model (e.g., SAM) to derive tumor-specific image
embeddings, which are aligned with radiomics features extracted from the same
tumor via contrastive learning. These representations are further enriched by
anatomical positional embedding (APE). As a result, RadiomicsRetrieval enables
flexible querying based on shape, location, or partial feature sets. Extensive
experiments on both lung CT and brain MRI public datasets demonstrate that
radiomics features significantly enhance retrieval specificity, while APE
provides global anatomical context essential for location-based searches.
Notably, our framework requires only minimal user prompts (e.g., a single
point), minimizing segmentation overhead and supporting diverse clinical
scenarios. The capability to query using either image embeddings or selected
radiomics attributes highlights its adaptability, potentially benefiting
diagnosis, treatment planning, and research on large-scale medical imaging
repositories. Our code is available at
https://github.com/nainye/RadiomicsRetrieval.
Authors' comments: Accepted at MICCAI 2025
Youngjoon Jang, Junyoung Son, Taemin Lee, Seongtae Hong, Heuiseok Lim
With the increasing utilization of multilingual text information, Cross-Lingual Information Retrieval (CLIR) has become a crucial research area. However, the impact of training data composition on both CLIR and Mono-Lingual Information Retrieval (IR) performance remains under-explored. To systematically investigate this data-centric aspect, we construct linguistically parallel Korean-English datasets and train retrieval models with various language combinations. Our experiments reveal that the language composition of training data significantly influences IR performance, exhibiting important inter-lingual correlations: CLIR performance improves with specific language pairs, while Mono-Lingual IR performance declines. Our work demonstrates that Model Merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-Lingual IR capabilities. Our findings underscore the effects of linguistic configuration of training data on both CLIR and Mono-Lingual IR, and present Model Merging as a viable strategy to optimize performance across these tasks.
Georgios Balanos, Evangelos Chasanis, Konstantinos Skianis, Evaggelia Pitoura
Retrieval-Augmented Generation (RAG) enhances language models by grounding responses in external information, yet explainability remains a critical challenge, particularly when retrieval relies on unstructured text. Knowledge graphs (KGs) offer a solution by introducing structured, semantically rich representations of entities and their relationships, enabling transparent retrieval paths and interpretable reasoning. In this work, we present KGRAG-Ex, a RAG system that improves both factual grounding and explainability by leveraging a domain-specific KG constructed via prompt-based information extraction. Given a user query, KGRAG-Ex identifies relevant entities and semantic paths in the graph, which are then transformed into pseudo-paragraphs: natural language representations of graph substructures that guide corpus retrieval. To improve interpretability and support reasoning transparency, we incorporate perturbation-based explanation methods that assess the influence of specific KG-derived components on the generated answers. We conduct a series of experiments to analyze the sensitivity of the system to different perturbation methods, the relationship between graph component importance and their structural positions, the influence of semantic node types, and how graph metrics correspond to the influence of components within the explanations process.
Gustavo Correa Publio, José Emilio Labra Gayo
Shapes Constraint Language (SHACL) is a powerful language for validating RDF
data. Given the recent industry attention to Knowledge Graphs (KGs), more users
need to validate linked data properly. However, traditional SHACL validation
engines often provide terse reports in English that are difficult for
non-technical users to interpret and act upon. This paper presents xpSHACL, an
explainable SHACL validation system that addresses this issue by combining
rule-based justification trees with retrieval-augmented generation (RAG) and
large language models (LLMs) to produce detailed, multilanguage, human-readable
explanations for constraint violations. A key feature of xpSHACL is its usage
of a Violation KG to cache and reuse explanations, improving efficiency and
consistency.
Authors' comments: Accepted for publication in the 2nd LLM+Graph Workshop, colocated at
VLDB'25