Vipula Rawte, Rajarshi Roy, Gurpreet Singh, Danush Khanna, Yaswanth Narsupalli, Basab Ghosh, Abhay Gupta, Argha Kamal Samanta et al.
As Large Language Models (LLMs) continue to advance, Retrieval-Augmented Generation (RAG) has emerged as a vital technique to enhance factual accuracy by integrating external knowledge into the generation process. However, LLMs often fail to faithfully integrate retrieved evidence into their generated responses, leading to factual inconsistencies. To quantify this gap, we introduce Entity-Context Divergence (ECD), a metric that measures the extent to which retrieved information is accurately reflected in model outputs. We systematically evaluate contemporary LLMs on their ability to preserve factual consistency in retrieval-augmented settings, a capability we define as RAG-ability. Our empirical analysis reveals that RAG-ability remains low across most LLMs, highlighting significant challenges in entity retention and context fidelity. This paper introduces Radiant (Retrieval AugmenteD entIty-context AligNmenT), a novel framework that merges RAG with alignment designed to optimize the interplay between retrieved evidence and generated content. Radiant extends Direct Preference Optimization (DPO) to teach LLMs how to integrate provided additional information into subsequent generations. As a behavior correction mechanism, Radiant boosts RAG performance across varied retrieval scenarios, such as noisy web contexts, knowledge conflicts, and hallucination reduction. This enables more reliable, contextually grounded, and factually coherent content generation.
Chi Minh Bui, Ngoc Mai Thieu, Van Vinh Nguyen, Jason J. Jung, Khac-Hoai Nam Bui
The integration of knowledge graphs (KGs) with large language models (LLMs) offers significant potential to improve the retrieval phase of retrieval-augmented generation (RAG) systems. In this study, we propose KG-CQR, a novel framework for Contextual Query Retrieval (CQR) that enhances the retrieval phase by enriching the contextual representation of complex input queries using a corpus-centric KG. Unlike existing methods that primarily address corpus-level context loss, KG-CQR focuses on query enrichment through structured relation representations, extracting and completing relevant KG subgraphs to generate semantically rich query contexts. Comprising subgraph extraction, completion, and contextual generation modules, KG-CQR operates as a model-agnostic pipeline, ensuring scalability across LLMs of varying sizes without additional training. Experimental results on RAGBench and MultiHop-RAG datasets demonstrate KG-CQR's superior performance, achieving a 4-6% improvement in mAP and a 2-3% improvement in Recall@25 over strong baseline models. Furthermore, evaluations on challenging RAG tasks such as multi-hop question answering show that, by incorporating KG-CQR, the performance consistently outperforms the existing baseline in terms of retrieval effectiveness
Authors' comments: Accepted at Main EMNLP 2025
Tian Xie, Huanfeng Shen, Menghui Jiang, Juan-Carlos Jiménez-Muñoz, José A. Sobrino, Huifang Li, Chao Zeng
Land surface temperature (LST) is vital for land-atmosphere interactions and climate processes. Accurate LST retrieval remains challenging under heterogeneous land cover and extreme atmospheric conditions. Traditional split window (SW) algorithms show biases in humid environments; purely machine learning (ML) methods lack interpretability and generalize poorly with limited data. We propose a coupled mechanism model-ML (MM-ML) framework integrating physical constraints with data-driven learning for robust LST retrieval. Our approach fuses radiative transfer modeling with data components, uses MODTRAN simulations with global atmospheric profiles, and employs physics-constrained optimization. Validation against 4,450 observations from 29 global sites shows MM-ML achieves MAE=1.84K, RMSE=2.55K, and R-squared=0.966, outperforming conventional methods. Under extreme conditions, MM-ML reduces errors by over 50%. Sensitivity analysis indicates LST estimates are most sensitive to sensor radiance, then water vapor, and less to emissivity, with MM-ML showing superior stability. These results demonstrate the effectiveness of our coupled modeling strategy for retrieving geophysical parameters. The MM-ML framework combines physical interpretability with nonlinear modeling capacity, enabling reliable LST retrieval in complex environments and supporting climate monitoring and ecosystem studies.
Bangxiang Lan, Ruobing Xie, Ruixiang Zhao, Xingwu Sun, Zhanhui Kang, Gang Yang, Xirong Li
The Text-to-Video Retrieval (T2VR) task aims to retrieve unlabeled videos by
textual queries with the same semantic meanings. Recent CLIP-based approaches
have explored two frameworks: Two-Tower versus Single-Tower framework, yet the
former suffers from low effectiveness, while the latter suffers from low
efficiency. In this study, we explore a new Hybrid-Tower framework that can
hybridize the advantages of the Two-Tower and Single-Tower framework, achieving
high effectiveness and efficiency simultaneously. We propose a novel hybrid
method, Fine-grained Pseudo-query Interaction and Generation for T2VR, ie, PIG,
which includes a new pseudo-query generator designed to generate a pseudo-query
for each video. This enables the video feature and the textual features of
pseudo-query to interact in a fine-grained manner, similar to the Single-Tower
approaches to hold high effectiveness, even before the real textual query is
received. Simultaneously, our method introduces no additional storage or
computational overhead compared to the Two-Tower framework during the inference
stage, thus maintaining high efficiency. Extensive experiments on five commonly
used text-video retrieval benchmarks demonstrate that our method achieves a
significant improvement over the baseline, with an increase of $1.6\% \sim
3.9\%$ in R@1. Furthermore, our method matches the efficiency of Two-Tower
models while achieving near state-of-the-art performance, highlighting the
advantages of the Hybrid-Tower framework.
Authors' comments: Accepted to ICCV2025
Mansi Garg, Lee-Chi Wang, Bhavesh Ghanchi, Sanjana Dumpala, Shreyash Kakde, Yen Chih Chen
This work presents a Biomedical Literature Question Answering (Q&A) system based on a Retrieval-Augmented Generation (RAG) architecture, designed to improve access to accurate, evidence-based medical information. Addressing the shortcomings of conventional health search engines and the lag in public access to biomedical research, the system integrates diverse sources, including PubMed articles, curated Q&A datasets, and medical encyclopedias ,to retrieve relevant information and generate concise, context-aware responses. The retrieval pipeline uses MiniLM-based semantic embeddings and FAISS vector search, while answer generation is performed by a fine-tuned Mistral-7B-v0.3 language model optimized using QLoRA for efficient, low-resource training. The system supports both general medical queries and domain-specific tasks, with a focused evaluation on breast cancer literature demonstrating the value of domain-aligned retrieval. Empirical results, measured using BERTScore (F1), show substantial improvements in factual consistency and semantic relevance compared to baseline models. The findings underscore the potential of RAG-enhanced language models to bridge the gap between complex biomedical literature and accessible public health knowledge, paving the way for future work on multilingual adaptation, privacy-preserving inference, and personalized medical AI systems.
Authors' comments: 10 pages, 6 figures, 3 tables
Hao Ju, Hu Zhang, Zhedong Zheng
With growing public safety demands, text-based person anomaly search has emerged as a critical task, aiming to retrieve individuals with abnormal behaviors via natural language descriptions. Unlike conventional person search, this task presents two unique challenges: (1) fine-grained cross-modal alignment between textual anomalies and visual behaviors, and (2) anomaly recognition under sparse real-world samples. While Large Multi-modal Models (LMMs) excel in multi-modal understanding, their potential for fine-grained anomaly retrieval remains underexplored, hindered by: (1) a domain gap between generative knowledge and discriminative retrieval, and (2) the absence of efficient adaptation strategies for deployment. In this work, we propose AnomalyLMM, the first framework that harnesses LMMs for text-based person anomaly search. Our key contributions are: (1) A novel coarse-to-fine pipeline integrating LMMs to bridge generative world knowledge with retrieval-centric anomaly detection; (2) A training-free adaptation cookbook featuring masked cross-modal prompting, behavioral saliency prediction, and knowledge-aware re-ranking, enabling zero-shot focus on subtle anomaly cues. As the first study to explore LMMs for this task, we conduct a rigorous evaluation on the PAB dataset, the only publicly available benchmark for text-based person anomaly search, with its curated real-world anomalies covering diverse scenarios (e.g., falling, collision, and being hit). Experiments show the effectiveness of the proposed method, surpassing the competitive baseline by +0.96% Recall@1 accuracy. Notably, our method reveals interpretable alignment between textual anomalies and visual behaviors, validated via qualitative analysis. Our code and models will be released for future research.
Yuqing Huang, Rongyang Zhang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Xuyang Zhi et al.
Recent advancements in large language models (LLMs) have revolutionized natural language processing through their remarkable capabilities in understanding and executing diverse tasks. While supervised fine-tuning, particularly in Retrieval-Augmented Generation (RAG) scenarios, effectively enhances task-specific performance, it often leads to catastrophic forgetting, where models lose their previously acquired knowledge and general capabilities. Existing solutions either require access to general instruction data or face limitations in preserving the model's original distribution. To overcome these limitations, we propose SelfAug, a self-distribution alignment method that aligns input sequence logits to preserve the model's semantic distribution, thereby mitigating catastrophic forgetting and improving downstream performance. Extensive experiments demonstrate that SelfAug achieves a superior balance between downstream learning and general capability retention. Our comprehensive empirical analysis reveals a direct correlation between distribution shifts and the severity of catastrophic forgetting in RAG scenarios, highlighting how the absence of RAG capabilities in general instruction tuning leads to significant distribution shifts during fine-tuning. Our findings not only advance the understanding of catastrophic forgetting in RAG contexts but also provide a practical solution applicable across diverse fine-tuning scenarios. Our code is publicly available at https://github.com/USTC-StarTeam/SelfAug.
Samantha J. Alloo, Ying Ying How, Jannis N. Ahlers, David M. Paganin, Michelle K. Croughan, Kaye S. Morgan
Sensitive to scattering from unresolved sample structures, the dark-field channel in full-field X-ray imaging provides complementary information to that offered by conventional attenuation and phase-contrast methods. A range of experimental dark-field techniques and retrieval algorithms have been recently developed to extract this signal by directly resolving dark-field-associated local image blurring with a high-resolution camera. While the underlying physical mechanism that generates dark-field contrast is generally defined similarly across the methods, no comparison of these dark-field techniques using identical samples has been conducted. In this paper, dark-field imaging data from two samples were acquired using three emerging dark-field setups at a synchrotron: propagation-based, single-grid, and speckle-based X-ray imaging. Dark-field images were then reconstructed using a variety of retrieval algorithms--some requiring only a single sample exposure, others multiple; some performing local, pixel-wise analysis, and others operating globally on entire images. We find that the dominant contribution to dark-field contrast--arising from diffuse scattering from unresolved microstructures or multiple refractions through larger, potentially resolved structures--is consistently recovered across all approaches, demonstrating mutual agreement. However, some differences emerge for structures that are spatially varying. We attribute these differences to the idea that each technique has a different internal ruler, a sensitivity scale for dark-field retrieval influenced by both the experimental design and algorithmic assumptions of the technique. This study is intended to guide dark-field imaging users in selecting the most appropriate technique for their imaging goals and to motivate future research into dark-field sensitivity and sources of dark-field contrast across different methods.
Shakiba Amirshahi, Amin Bigdeli, Charles L. A. Clarke, Amira Ghenai
Retrieval augmented generation (RAG) systems provide a method for factually grounding the responses of a Large Language Model (LLM) by providing retrieved evidence, or context, as support. Guided by this context, RAG systems can reduce hallucinations and expand the ability of LLMs to accurately answer questions outside the scope of their training data. Unfortunately, this design introduces a critical vulnerability: LLMs may absorb and reproduce misinformation present in retrieved evidence. This problem is magnified if retrieved evidence contains adversarial material explicitly intended to promulgate misinformation. This paper presents a systematic evaluation of RAG robustness in the health domain and examines alignment between model outputs and ground-truth answers. We focus on the health domain due to the potential for harm caused by incorrect responses, as well as the availability of evidence-based ground truth for many common health-related questions. We conduct controlled experiments using common health questions, varying both the type and composition of the retrieved documents (helpful, harmful, and adversarial) as well as the framing of the question by the user (consistent, neutral, and inconsistent). Our findings reveal that adversarial documents substantially degrade alignment, but robustness can be preserved when helpful evidence is also present in the retrieval pool. These findings offer actionable insights for designing safer RAG systems in high-stakes domains by highlighting the need for retrieval safeguards. To enable reproducibility and facilitate future research, all experimental results are publicly available in our github repository. https://github.com/shakibaam/RAG_ROBUSTNESS_EVAL
MinJu Jeon, Si-Woo Kim, Ye-Chan Kim, HyunGee Kim, Dong-Jin Kim
Dense video captioning aims to temporally localize events in video and generate captions for each event. While recent works propose end-to-end models, they suffer from two limitations: (1) applying timestamp supervision only to text while treating all video frames equally, and (2) retrieving captions from fixed-size video chunks, overlooking scene transitions. To address these, we propose Sali4Vid, a simple yet effective saliency-aware framework. We introduce Saliency-aware Video Reweighting, which converts timestamp annotations into sigmoid-based frame importance weights, and Semantic-based Adaptive Caption Retrieval, which segments videos by frame similarity to capture scene transitions and improve caption retrieval. Sali4Vid achieves state-of-the-art results on YouCook2 and ViTT, demonstrating the benefit of jointly improving video weighting and retrieval for dense video captioning
Authors' comments: Accepted in EMNLP 2025
Connor Walker, Koorosh Aslansefat, Mohammad Naveed Akram, Yiannis Papadopoulos
Accuracy and safety are paramount in Offshore Wind (OSW) maintenance, yet conventional Large Language Models (LLMs) often fail when confronted with highly specialised or unexpected scenarios. We introduce RAGuard, an enhanced Retrieval-Augmented Generation (RAG) framework that explicitly integrates safety-critical documents alongside technical manuals.By issuing parallel queries to two indices and allocating separate retrieval budgets for knowledge and safety, RAGuard guarantees both technical depth and safety coverage. We further develop a SafetyClamp extension that fetches a larger candidate pool, "hard-clamping" exact slot guarantees to safety. We evaluate across sparse (BM25), dense (Dense Passage Retrieval) and hybrid retrieval paradigms, measuring Technical Recall@K and Safety Recall@K. Both proposed extensions of RAG show an increase in Safety Recall@K from almost 0\% in RAG to more than 50\% in RAGuard, while maintaining Technical Recall above 60\%. These results demonstrate that RAGuard and SafetyClamp have the potential to establish a new standard for integrating safety assurance into LLM-powered decision support in critical maintenance contexts.
Zahra Zehtabi Sabeti Moghaddam, Zeinab Dehghani, Maneeha Rani, Koorosh Aslansefat, Bhupesh Kumar Mishra, Rameez Raja Kureshi, Dhavalkumar Thakker
Generative AI, such as Large Language Models (LLMs), has achieved impressive progress but still produces hallucinations and unverifiable claims, limiting reliability in sensitive domains. Retrieval-Augmented Generation (RAG) improves accuracy by grounding outputs in external knowledge, especially in domains like healthcare, where precision is vital. However, RAG remains opaque and essentially a black box, heavily dependent on data quality. We developed a method-agnostic, perturbation-based framework that provides token and component-level interoperability for Graph RAG using SMILE and named it as Knowledge-Graph (KG)-SMILE. By applying controlled perturbations, computing similarities, and training weighted linear surrogates, KG-SMILE identifies the graph entities and relations most influential to generated outputs, thereby making RAG more transparent. We evaluate KG-SMILE using comprehensive attribution metrics, including fidelity, faithfulness, consistency, stability, and accuracy. Our findings show that KG-SMILE produces stable, human-aligned explanations, demonstrating its capacity to balance model effectiveness with interpretability and thereby fostering greater transparency and trust in machine learning technologies.
Yifan Xu, Qianwei Wang, Vineet Kamat, Carol Menassa
Indoor built environments like homes and offices often present complex and
cluttered layouts that pose significant challenges for individuals who are
blind or visually impaired, especially when performing tasks that involve
locating and gathering multiple objects. While many existing assistive
technologies focus on basic navigation or obstacle avoidance, few systems
provide scalable and efficient multi-object search capabilities in real-world,
partially observable settings. To address this gap, we introduce OpenGuide, an
assistive mobile robot system that combines natural language understanding with
vision-language foundation models (VLM), frontier-based exploration, and a
Partially Observable Markov Decision Process (POMDP) planner. OpenGuide
interprets open-vocabulary requests, reasons about object-scene relationships,
and adaptively navigates and localizes multiple target items in novel
environments. Our approach enables robust recovery from missed detections
through value decay and belief-space reasoning, resulting in more effective
exploration and object localization. We validate OpenGuide in simulated and
real-world experiments, demonstrating substantial improvements in task success
rate and search efficiency over prior methods. This work establishes a
foundation for scalable, human-centered robotic assistance in assisted living
environments.
Authors' comments: 32 pages, 6 figures
Yicong Zhao, Shisong Chen, Jiacheng Zhang, Zhixu Li
Recent advances in large language models (LLMs) have demonstrated impressive
capabilities in code-related tasks, such as code generation and automated
program repair. Despite their promising performance, most existing approaches
for code repair suffer from high training costs or computationally expensive
inference. Retrieval-augmented generation (RAG), with its efficient in-context
learning paradigm, offers a more scalable alternative. However, conventional
retrieval strategies, which are often based on holistic code-text embeddings,
fail to capture the structural intricacies of code, resulting in suboptimal
retrieval quality. To address the above limitations, we propose ReCode, a
fine-grained retrieval-augmented in-context learning framework designed for
accurate and efficient code repair. Specifically, ReCode introduces two key
innovations: (1) an algorithm-aware retrieval strategy that narrows the search
space using preliminary algorithm type predictions; and (2) a modular
dual-encoder architecture that separately processes code and textual inputs,
enabling fine-grained semantic matching between input and retrieved contexts.
Furthermore, we propose RACodeBench, a new benchmark constructed from
real-world user-submitted buggy code, which addresses the limitations of
synthetic benchmarks and supports realistic evaluation. Experimental results on
RACodeBench and competitive programming datasets demonstrate that ReCode
achieves higher repair accuracy with significantly reduced inference cost,
highlighting its practical value for real-world code repair scenarios.
Authors' comments: Accepted by CIKM 2025
Juhyeon Lee, Wonduk Seo, Hyunjin An, Seunghyun Lee, Yi Bu
Automatic prompt optimization has recently emerged as a strategy for
improving the quality of prompts used in Large Language Models (LLMs), with the
goal of generating more accurate and useful responses. However, most prior work
focuses on direct prompt refinement or model fine-tuning, overlooking the
potential of leveraging LLMs' inherent reasoning capability to learn from
contrasting examples. In this paper, we present Contrastive Reasoning Prompt
Optimization (CRPO), a novel framework that formulates prompt optimization as a
retrieval augmented reasoning process. Our approach retrieves top k reference
prompts from the HelpSteer2 dataset, an open-source collection annotated for
helpfulness, correctness, coherence, complexity, and verbosity, and constructs
two complementary optimization paradigms: (1) tiered contrastive reasoning,
where the LLM compares high, medium, and low quality prompts to refine its own
generation through reflective reasoning, and (2) multi-metric contrastive
reasoning, where the LLM analyzes the best prompts along each evaluation
dimension and integrates their strengths into an optimized prompt. By
explicitly contrasting high and low quality exemplars, CRPO enables the model
to deduce why certain prompts succeed while others fail, thereby achieving more
robust and interpretable optimization. Experimental results on the HelpSteer2
benchmark demonstrate that CRPO significantly outperforms baselines. Our
findings highlight the promise of contrastive, retrieval-augmented reasoning
for advancing automatic prompt optimization.
Authors' comments: Preprint
Che Liu, Zheng Jiang, Chengyu Fang, Heng Guo, Yan-Jie Zhou, Jiaqi Qu, Le Lu, Minfeng Xu
Medical image retrieval is essential for clinical decision-making and
translational research, relying on discriminative visual representations. Yet,
current methods remain fragmented, relying on separate architectures and
training strategies for 2D, 3D, and video-based medical data. This
modality-specific design hampers scalability and inhibits the development of
unified representations. To enable unified learning, we curate a large-scale
hybrid-modality dataset comprising 867,653 medical imaging samples, including
2D X-rays and ultrasounds, RGB endoscopy videos, and 3D CT scans. Leveraging
this dataset, we train M3Ret, a unified visual encoder without any
modality-specific customization. It successfully learns transferable
representations using both generative (MAE) and contrastive (SimDINO)
self-supervised learning (SSL) paradigms. Our approach sets a new
state-of-the-art in zero-shot image-to-image retrieval across all individual
modalities, surpassing strong baselines such as DINOv3 and the text-supervised
BMC-CLIP. More remarkably, strong cross-modal alignment emerges without paired
data, and the model generalizes to unseen MRI tasks, despite never observing
MRI during pretraining, demonstrating the generalizability of purely visual
self-supervision to unseen modalities. Comprehensive analyses further validate
the scalability of our framework across model and data sizes. These findings
deliver a promising signal to the medical imaging community, positioning M3Ret
as a step toward foundation models for visual SSL in multimodal medical image
understanding.
Authors' comments: Technical Report
Yunqing Liu, Nan Zhang, Zhiming Tan
Effective specification-aware part retrieval within complex CAD assemblies is essential for automated design verification and downstream engineering tasks. However, directly using LLMs/VLMs to this task presents some challenges: the input sequences may exceed model token limits, and even after processing, performance remains unsatisfactory. Moreover, fine-tuning LLMs/VLMs requires significant computational resources, and for many high-performing general-use proprietary models (e.g., GPT or Gemini), fine-tuning access is not available. In this paper, we propose a novel part retrieval framework that requires no extra training, but using Error Notebooks + RAG for refined prompt engineering to help improve the existing general model's retrieval performance. The construction of Error Notebooks consists of two steps: (1) collecting historical erroneous CoTs and their incorrect answers, and (2) connecting these CoTs through reflective corrections until the correct solutions are obtained. As a result, the Error Notebooks serve as a repository of tasks along with their corrected CoTs and final answers. RAG is then employed to retrieve specification-relevant records from the Error Notebooks and incorporate them into the inference process. Another major contribution of our work is a human-in-the-loop CAD dataset, which is used to evaluate our method. In addition, the engineering value of our novel framework lies in its ability to effectively handle 3D models with lengthy, non-natural language metadata. Experiments with proprietary models, including GPT-4o and the Gemini series, show substantial gains, with GPT-4o (Omni) achieving up to a 23.4% absolute accuracy improvement on the human preference dataset. Moreover, ablation studies confirm that CoT reasoning provides benefits especially in challenging cases with higher part counts (>10).
Yunus Serhat Bicakci, Joseph Shingleton, Anahid Basiri
Street-level geolocalization from images is crucial for a wide range of essential applications and services, such as navigation, location-based recommendations, and urban planning. With the growing popularity of social media data and cameras embedded in smartphones, applying traditional computer vision techniques to localize images has become increasingly challenging, yet highly valuable. This paper introduces a novel approach that integrates open-weight and publicly accessible multimodal large language models with retrieval-augmented generation. The method constructs a vector database using the SigLIP encoder on two large-scale datasets (EMP-16 and OSV-5M). Query images are augmented with prompts containing both similar and dissimilar geolocation information retrieved from this database before being processed by the multimodal large language models. Our approach has demonstrated state-of-the-art performance, achieving higher accuracy compared against three widely used benchmark datasets (IM2GPS, IM2GPS3k, and YFCC4k). Importantly, our solution eliminates the need for expensive fine-tuning or retraining and scales seamlessly to incorporate new data sources. The effectiveness of retrieval-augmented generation-based multimodal large language models in geolocation estimation demonstrated by this paper suggests an alternative path to the traditional methods which rely on the training models from scratch, opening new possibilities for more accessible and scalable solutions in GeoAI.
Jiawei Cao, Jie Ouyang, Zhaomeng Zhou, Mingyue Cheng, Yupeng Li, Jiaxian Yan, Qi Liu
Temporal Information Retrieval (TIR) is a critical yet unresolved task for modern search systems, retrieving documents that not only satisfy a query's information need but also adhere to its temporal constraints. This task is shaped by two challenges: Relevance, ensuring alignment with the query's explicit temporal requirements, and Recency, selecting the freshest document among multiple versions. Existing methods often address the two challenges in isolation, relying on brittle heuristics that fail in scenarios where temporal requirements and staleness resistance are intertwined. To address this gap, we introduce Re2Bench, a benchmark specifically designed to disentangle and evaluate Relevance, Recency, and their hybrid combination. Building on this foundation, we propose Re3, a unified and lightweight framework that dynamically balances semantic and temporal information through a query-aware gating mechanism. On Re2Bench, Re3 achieves state-of-the-art results, leading in R@1 across all three subsets. Ablation studies with backbone sensitivity tests confirm robustness, showing strong generalization across diverse encoders and real-world settings. This work provides both a generalizable solution and a principled evaluation suite, advancing the development of temporally aware retrieval systems. Re3 and Re2Bench are available online: https://anonymous.4open.science/r/Re3-0C5A
Thinh-Phuc Nguyen, Thanh-Hai Nguyen, Gia-Huy Dinh, Lam-Huy Nguyen, Minh-Triet Tran, Trung-Nghia Le
Image captioning systems often produce generic descriptions that fail to
capture event-level semantics which are crucial for applications like news
reporting and digital archiving. We present ReCap, a novel pipeline for
event-enriched image retrieval and captioning that incorporates broader
contextual information from relevant articles to generate narrative-rich,
factually grounded captions. Our approach addresses the limitations of standard
vision-language models that typically focus on visible content while missing
temporal, social, and historical contexts. ReCap comprises three integrated
components: (1) a robust two-stage article retrieval system using DINOv2
embeddings with global feature similarity for initial candidate selection
followed by patch-level mutual nearest neighbor similarity re-ranking; (2) a
context extraction framework that synthesizes information from article
summaries, generic captions, and original source metadata; and (3) a large
language model-based caption generation system with Semantic Gaussian
Normalization to enhance fluency and relevance. Evaluated on the OpenEvents V1
dataset as part of Track 1 in the EVENTA 2025 Grand Challenge, ReCap achieved a
strong overall score of 0.54666, ranking 2nd on the private test set. These
results highlight ReCap's effectiveness in bridging visual perception with
real-world knowledge, offering a practical solution for context-aware image
understanding in high-stakes domains. The code is available at
https://github.com/Noridom1/EVENTA2025-Event-Enriched-Image-Captioning.
Authors' comments: ACM Multimedia 2025