Tinh-Anh Nguyen-Nhu, Huu-Loc Tran, Nguyen-Khang Le, Minh-Nhat Nguyen, Tien-Huy Nguyen, Hoang-Long Nguyen-Huu, Huu-Phong Phan-Nguyen, Huy-Thach Pham et al.
The exponential growth of digital video content has posed critical challenges in moment-level video retrieval, where existing methodologies struggle to efficiently localize specific segments within an expansive video corpus. Current retrieval systems are constrained by computational inefficiencies, temporal context limitations, and the intrinsic complexity of navigating video content. In this paper, we address these limitations through a novel Interactive Video Corpus Moment Retrieval framework that integrates a SuperGlobal Reranking mechanism and Adaptive Bidirectional Temporal Search (ABTS), strategically optimizing query similarity, temporal stability, and computational resources. By preprocessing a large corpus of videos using a keyframe extraction model and deduplication technique through image hashing, our approach provides a scalable solution that significantly reduces storage requirements while maintaining high localization precision across diverse video repositories.
Muhammad Imam Luthfi Balaka, David Alexander, Qiming Wang, Yue Gong, Adila Krisnadhi, Raul Castro Fernandez
Finding relevant tables among databases, lakes, and repositories is the first
step in extracting value from data. Such a task remains difficult because
assessing whether a table is relevant to a problem does not always depend only
on its content but also on the context, which is usually tribal knowledge known
to the individual or team. While tools like data catalogs and academic data
discovery systems target this problem, they rely on keyword search or more
complex interfaces, limiting non-technical users' ability to find relevant
data. The advent of large language models (LLMs) offers a unique opportunity
for users to ask questions directly in natural language, making dataset
discovery more intuitive, accessible, and efficient.
In this paper, we introduce Pneuma, a retrieval-augmented generation (RAG)
system designed to efficiently and effectively discover tabular data. Pneuma
leverages large language models (LLMs) for both table representation and table
retrieval. For table representation, Pneuma preserves schema and row-level
information to ensure comprehensive data understanding. For table retrieval,
Pneuma augments LLMs with traditional information retrieval techniques, such as
full-text and vector search, harnessing the strengths of both to improve
retrieval performance. To evaluate Pneuma, we generate comprehensive benchmarks
that simulate table discovery workload on six real-world datasets including
enterprise data, scientific databases, warehousing data, and open data. Our
results demonstrate that Pneuma outperforms widely used table search systems
(such as full-text search and state-of-the-art RAG systems) in accuracy and
resource efficiency.
Authors' comments: SIGMOD 2025 Paper
Peiru Yang, Xintian Li, Zhiyang Hu, Jiapeng Wang, Jinhua Yin, Huili Wang, Lizhi He, Shuai Yang et al.
Retrieval-augmented generation (RAG) methods can enhance the performance of
LLMs by incorporating retrieved knowledge chunks into the generation process.
In general, the retrieval and generation steps usually have different
requirements for these knowledge chunks. The retrieval step benefits from
comprehensive information to improve retrieval accuracy, whereas excessively
long chunks may introduce redundant contextual information, thereby diminishing
both the effectiveness and efficiency of the generation process. However,
existing RAG methods typically employ identical representations of knowledge
chunks for both retrieval and generation, resulting in suboptimal performance.
In this paper, we propose a heterogeneous RAG framework (\myname) that
decouples the representations of knowledge chunks for retrieval and generation,
thereby enhancing the LLMs in both effectiveness and efficiency. Specifically,
we utilize short chunks to represent knowledge to adapt the generation step and
utilize the corresponding chunk with its contextual information from
multi-granular views to enhance retrieval accuracy. We further introduce an
adaptive prompt tuning method for the retrieval model to adapt the
heterogeneous retrieval augmented generation process. Extensive experiments
demonstrate that \myname achieves significant improvements compared to
baselines.
Authors' comments: 10 pages, 5 figures
Mandeep Rathee, V Venktesh, Sean MacAvaney, Avishek Anand
Advanced relevance models, such as those that use large language models
(LLMs), provide highly accurate relevance estimations. However, their
computational costs make them infeasible for processing large document corpora.
To address this, retrieval systems often employ a telescoping approach, where
computationally efficient but less precise lexical and semantic retrievers
filter potential candidates for further ranking. However, this approach heavily
depends on the quality of early-stage retrieval, which can potentially exclude
relevant documents early in the process. In this work, we propose a novel
paradigm for re-ranking called online relevance estimation that continuously
updates relevance estimates for a query throughout the ranking process. Instead
of re-ranking a fixed set of top-k documents in a single step, online relevance
estimation iteratively re-scores smaller subsets of the most promising
documents while adjusting relevance scores for the remaining pool based on the
estimations from the final model using an online bandit-based algorithm. This
dynamic process mitigates the recall limitations of telescoping systems by
re-prioritizing documents initially deemed less relevant by earlier stages --
including those completely excluded by earlier-stage retrievers. We validate
our approach on TREC benchmarks under two scenarios: hybrid retrieval and
adaptive retrieval. Experimental results demonstrate that our method is
sample-efficient and significantly improves recall, highlighting the
effectiveness of our online relevance estimation framework for modern search
systems.
Authors' comments: Accepted for publication at SIGIR'25 . 11 pages,5 figures, 4 tables
Hang Ni, Fan Liu, Xinyu Ma, Lixin Su, Shuaiqiang Wang, Dawei Yin, Hui Xiong, Hao Liu
Large language models (LLMs) have shown promise in automating travel planning, yet they often fall short in addressing nuanced spatiotemporal rationality. While existing benchmarks focus on basic plan validity, they neglect critical aspects such as route efficiency, POI appeal, and real-time adaptability. This paper introduces TP-RAG, the first benchmark tailored for retrieval-augmented, spatiotemporal-aware travel planning. Our dataset includes 2,348 real-world travel queries, 85,575 fine-grain annotated POIs, and 18,784 high-quality travel trajectory references sourced from online tourist documents, enabling dynamic and context-aware planning. Through extensive experiments, we reveal that integrating reference trajectories significantly improves spatial efficiency and POI rationality of the travel plan, while challenges persist in universality and robustness due to conflicting references and noisy data. To address these issues, we propose EvoRAG, an evolutionary framework that potently synergizes diverse retrieved trajectories with LLMs' intrinsic reasoning. EvoRAG achieves state-of-the-art performance, improving spatiotemporal compliance and reducing commonsense violation compared to ground-up and retrieval-augmented baselines. Our work underscores the potential of hybridizing Web knowledge with LLM-driven optimization, paving the way for more reliable and adaptive travel planning agents.
Shiyi Ding, Ying Chen
Recent advances in large language models (LLMs) provide new opportunities for
context understanding in virtual reality (VR). However, VR contexts are often
highly localized and personalized, limiting the effectiveness of
general-purpose LLMs. To address this challenge, we present RAG-VR, the first
3D question-answering system for VR that incorporates retrieval-augmented
generation (RAG), which augments an LLM with external knowledge retrieved from
a localized knowledge database to improve the answer quality. RAG-VR includes a
pipeline for extracting comprehensive knowledge about virtual environments and
user conditions for accurate answer generation. To ensure efficient retrieval,
RAG-VR offloads the retrieval process to a nearby edge server and uses only
essential information during retrieval. Moreover, we train the retriever to
effectively distinguish among relevant, irrelevant, and hard-to-differentiate
information in relation to questions. RAG-VR improves answer accuracy by
17.9%-41.8% and reduces end-to-end latency by 34.5%-47.3% compared with two
baseline systems.
Authors' comments: Proceedings of the 2025 IEEE Conference on Virtual Reality and 3D
User Interfaces (VR), March 2025
Zheng Zhang, Ning Li, Qi Liu, Rui Li, Weibo Gao, Qingyang Mao, Zhenya Huang, Baosheng Yu et al.
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by
retrieving relevant document from external knowledge sources. By referencing
this external knowledge, RAG effectively reduces the generation of factually
incorrect content and addresses hallucination issues within LLMs. Recently,
there has been growing attention to improving the performance and efficiency of
RAG systems from various perspectives. While these advancements have yielded
significant results, the application of RAG in domains with considerable
societal implications raises a critical question about fairness: What impact
does the introduction of the RAG paradigm have on the fairness of LLMs? To
address this question, we conduct extensive experiments by varying the LLMs,
retrievers, and retrieval sources. Our experimental analysis reveals that the
scale of the LLMs plays a significant role in influencing fairness outcomes
within the RAG framework. When the model scale is smaller than 8B, the
integration of retrieval mechanisms often exacerbates unfairness in small-scale
LLMs (e.g., LLaMA3.2-1B, Mistral-7B, and LLaMA3-8B). To mitigate the fairness
issues introduced by RAG for small-scale LLMs, we propose two approaches,
FairFT and FairFilter. Specifically, in FairFT, we align the retriever with the
LLM in terms of fairness, enabling it to retrieve documents that facilitate
fairer model outputs. In FairFilter, we propose a fairness filtering mechanism
to filter out biased content after retrieval. Finally, we validate our proposed
approaches on real-world datasets, demonstrating their effectiveness in
improving fairness while maintaining performance.
Authors' comments: 12 pages
Yixiang Chen, Penglei Sun, Xiang Li, Xiaowen Chu
In recent years, accurately and quickly deploying medical large language models (LLMs) has become a significant trend. Among these, retrieval-augmented generation (RAG) has garnered significant attention due to its features of rapid deployment and privacy protection. However, existing medical RAG frameworks still have shortcomings. Most existing medical RAG frameworks are designed for single-round question answering tasks and are not suitable for multi-round diagnostic dialogue. On the other hand, existing medical multi-round RAG frameworks do not consider the interconnections between potential diseases to inquire precisely like a doctor. To address these issues, we propose a Multi-Round Diagnostic RAG (MRD-RAG) framework that mimics the doctor's diagnostic process. This RAG framework can analyze diagnosis information of potential diseases and accurately conduct multi-round diagnosis like a doctor. To evaluate the effectiveness of our proposed frameworks, we conduct experiments on two modern medical datasets and two traditional Chinese medicine datasets, with evaluations by GPT and human doctors on different methods. The results indicate that our RAG framework can significantly enhance the diagnostic performance of LLMs, highlighting the potential of our approach in medical diagnosis. The code and data can be found in our project website https://github.com/YixiangCh/MRD-RAG/tree/master.
Zehong Ma, Hao Chen, Wei Zeng, Limin Su, Shiliang Zhang
Fine-grained text-to-image retrieval aims to retrieve a fine-grained target
image with a given text query. Existing methods typically assume that each
training image is accurately depicted by its textual descriptions. However,
textual descriptions can be ambiguous and fail to depict discriminative visual
details in images, leading to inaccurate representation learning. To alleviate
the effects of text ambiguity, we propose a Multi-Modal Reference learning
framework to learn robust representations. We first propose a multi-modal
reference construction module to aggregate all visual and textual details of
the same object into a comprehensive multi-modal reference. The multi-modal
reference hence facilitates the subsequent representation learning and
retrieval similarity computation. Specifically, a reference-guided
representation learning module is proposed to use multi-modal references to
learn more accurate visual and textual representations. Additionally, we
introduce a reference-based refinement method that employs the object
references to compute a reference-based similarity that refines the initial
retrieval results. Extensive experiments are conducted on five fine-grained
text-to-image retrieval datasets for different text-to-image retrieval tasks.
The proposed method has achieved superior performance over state-of-the-art
methods. For instance, on the text-to-person image retrieval dataset RSTPReid,
our method achieves the Rank1 accuracy of 56.2\%, surpassing the recent CFine
by 5.6\%.
Authors' comments: TMM25
Kyoyun Choi, Byungmu Yoon, Soobum Kim, Jonggwon Park
Automated radiology report generation (RRG) holds potential to reduce radiologists' workload, especially as recent advancements in large language models (LLMs) enable the development of multimodal models for chest X-ray (CXR) report generation. However, multimodal LLMs (MLLMs) are resource-intensive, requiring vast datasets and substantial computational cost for training. To address these challenges, we propose a retrieval-augmented generation approach that leverages multimodal retrieval and LLMs to generate radiology reports while mitigating hallucinations and reducing computational demands. Our method uses LLMs to extract key phrases from radiology reports, effectively focusing on essential diagnostic information. Through exploring effective training strategies, including image encoder structure search, adding noise to text embeddings, and additional training objectives, we combine complementary pre-trained image encoders and adopt contrastive learning between text and semantic image embeddings. We evaluate our approach on MIMIC-CXR dataset, achieving state-of-the-art results on CheXbert metrics and competitive RadGraph F1 metric alongside MLLMs, without requiring LLM fine-tuning. Our method demonstrates robust generalization for multi-view RRG, making it suitable for comprehensive clinical applications.
Tian Xie, Menghui Jiang, Huanfeng Shen, Huifang Li, Chao Zeng, Jun Ma, Guanhao Zhang, Liangpei Zhang
Land surface temperature (LST) retrieval from remote sensing data is pivotal for analyzing climate processes and surface energy budgets. However, LST retrieval is an ill-posed inverse problem, which becomes particularly severe when only a single band is available. In this paper, we propose a deeply coupled framework integrating mechanistic modeling and machine learning to enhance the accuracy and generalizability of single-channel LST retrieval. Training samples are generated using a physically-based radiative transfer model and a global collection of 5810 atmospheric profiles. A physics-informed machine learning framework is proposed to systematically incorporate the first principles from classical physical inversion models into the learning workflow, with optimization constrained by radiative transfer equations. Global validation demonstrated a 30% reduction in root-mean-square error versus standalone methods. Under extreme humidity, the mean absolute error decreased from 4.87 K to 2.29 K (53% improvement). Continental-scale tests across five continents confirmed the superior generalizability of this model.
Chad Melton, Alex Sorokine, Steve Peterson
Applications of generative Large Language Models LLMs are rapidly expanding
across various domains, promising significant improvements in workflow
efficiency and information retrieval. However, their implementation in
specialized, high-stakes domains such as hazardous materials transportation is
challenging due to accuracy and reliability concerns. This study evaluates the
performance of three fine-tuned generative models, ChatGPT, Google's Vertex AI,
and ORNL Retrieval Augmented Generation augmented LLaMA 2 and LLaMA in
retrieving regulatory information essential for hazardous material
transportation compliance in the United States. Utilizing approximately 40
publicly available federal and state regulatory documents, we developed 100
realistic queries relevant to route planning and permitting requirements.
Responses were qualitatively rated based on accuracy, detail, and relevance,
complemented by quantitative assessments of semantic similarity between model
outputs. Results demonstrated that the RAG-augmented LLaMA models significantly
outperformed Vertex AI and ChatGPT, providing more detailed and generally
accurate information, despite occasional inconsistencies. This research
introduces the first known application of RAG in transportation safety,
emphasizing the need for domain-specific fine-tuning and rigorous evaluation
methodologies to ensure reliability and minimize the risk of inaccuracies in
high-stakes environments.
Authors' comments: 14 pages, 3 Figures, 3 tables
Anirudhan Badrinath, Prabhat Agarwal, Laksh Bhasin, Jaewon Yang, Jiajing Xu, Charles Rosenberg
Generative retrieval methods utilize generative sequential modeling
techniques, such as transformers, to generate candidate items for recommender
systems. These methods have demonstrated promising results in academic
benchmarks, surpassing traditional retrieval models like two-tower
architectures. However, current generative retrieval methods lack the
scalability required for industrial recommender systems, and they are
insufficiently flexible to satisfy the multiple metric requirements of modern
systems. This paper introduces PinRec, a novel generative retrieval model
developed for applications at Pinterest. PinRec utilizes outcome-conditioned
generation, enabling modelers to specify how to balance various outcome
metrics, such as the number of saves and clicks, to effectively align with
business goals and user exploration. Additionally, PinRec incorporates
multi-token generation to enhance output diversity while optimizing generation.
Our experiments demonstrate that PinRec can successfully balance performance,
diversity, and efficiency, delivering a significant positive impact to users
using generative models. This paper marks a significant milestone in generative
retrieval, as it presents, to our knowledge, the first rigorous study on
implementing generative retrieval at the scale of Pinterest.
Authors' comments: Submitted to KDD ADS 2025
Yuehan Qin, Shawn Li, Yi Nian, Xinyan Velocity Yu, Yue Zhao, Xuezhe Ma
Large language models (LLMs) have shown substantial capacity for generating fluent, contextually appropriate responses. However, they can produce hallucinated outputs, especially when a user query includes one or more false premises-claims that contradict established facts. Such premises can mislead LLMs into offering fabricated or misleading details. Existing approaches include pretraining, fine-tuning, and inference-time techniques that often rely on access to logits or address hallucinations after they occur. These methods tend to be computationally expensive, require extensive training data, or lack proactive mechanisms to prevent hallucination before generation, limiting their efficiency in real-time applications. We propose a retrieval-based framework that identifies and addresses false premises before generation. Our method first transforms a user's query into a logical representation, then applies retrieval-augmented generation (RAG) to assess the validity of each premise using factual sources. Finally, we incorporate the verification results into the LLM's prompt to maintain factual consistency in the final output. Experiments show that this approach effectively reduces hallucinations, improves factual accuracy, and does not require access to model logits or large-scale fine-tuning.
Zulun Zhu, Tiancheng Huang, Kai Wang, Junda Ye, Xinghe Chen, Siqiang Luo
Large language models (LLMs) struggle with the factual error during inference due to the lack of sufficient training data and the most updated knowledge, leading to the hallucination problem. Retrieval-Augmented Generation (RAG) has gained attention as a promising solution to address the limitation of LLMs, by retrieving relevant information from external source to generate more accurate answers to the questions. Given the pervasive presence of structured knowledge in the external source, considerable strides in RAG have been made to employ the techniques related to graphs and achieve more complex reasoning based on the topological information between knowledge entities. However, there is currently neither unified review examining the diverse roles of graphs in RAG, nor a comprehensive resource to help researchers navigate and contribute to this evolving field. This survey offers a novel perspective on the functionality of graphs within RAG and their impact on enhancing performance across a wide range of graph-structured data. It provides a detailed breakdown of the roles that graphs play in RAG, covering database construction, algorithms, pipelines, and tasks. Finally, it identifies current challenges and outline future research directions, aiming to inspire further developments in this field. Our graph-centered analysis highlights the commonalities and differences in existing methods, setting the stage for future researchers in areas such as graph learning, database systems, and natural language processing.
Yan Zhang, Zhong Ji, Changxu Meng, Yanwei Pang, Jungong Han
Recent studies focus on the Remote Sensing Image-Text Retrieval (RSITR), which aims at searching for the corresponding targets based on the given query. Among these efforts, the application of Foundation Models (FMs), such as CLIP, to the domain of remote sensing has yielded encouraging outcomes. However, existing FM based methodologies neglect the negative impact of weakly correlated sample pairs and fail to account for the key distinctions among remote sensing texts, leading to biased and superficial exploration of sample pairs. To address these challenges, we propose an approach named iEBAKER (an Improved Eliminate Before Align strategy with Keyword Explicit Reasoning framework) for RSITR. Specifically, we propose an innovative Eliminate Before Align (EBA) strategy to filter out the weakly correlated sample pairs, thereby mitigating their deviations from optimal embedding space during alignment.Further, two specific schemes are introduced from the perspective of whether local similarity and global similarity affect each other. On this basis, we introduce an alternative Sort After Reversed Retrieval (SAR) strategy, aims at optimizing the similarity matrix via reverse retrieval. Additionally, we incorporate a Keyword Explicit Reasoning (KER) module to facilitate the beneficial impact of subtle key concept distinctions. Without bells and whistles, our approach enables a direct transition from FM to RSITR task, eliminating the need for additional pretraining on remote sensing data. Extensive experiments conducted on three popular benchmark datasets demonstrate that our proposed iEBAKER method surpasses the state-of-the-art models while requiring less training data. Our source code will be released at https://github.com/zhangy0822/iEBAKER.
Alfred Clemedtson, Borun Shi
Large language models have shown remarkable language processing and reasoning ability but are prone to hallucinate when asked about private data. Retrieval-augmented generation (RAG) retrieves relevant data that fit into an LLM's context window and prompts the LLM for an answer. GraphRAG extends this approach to structured Knowledge Graphs (KGs) and questions regarding entities multiple hops away. The majority of recent GraphRAG methods either overlook the retrieval step or have ad hoc retrieval processes that are abstract or inefficient. This prevents them from being adopted when the KGs are stored in graph databases supporting graph query languages. In this work, we present GraphRAFT, a retrieve-and-reason framework that finetunes LLMs to generate provably correct Cypher queries to retrieve high-quality subgraph contexts and produce accurate answers. Our method is the first such solution that can be taken off-the-shelf and used on KGs stored in native graph DBs. Benchmarks suggest that our method is sample-efficient and scales with the availability of training data. Our method achieves significantly better results than all state-of-the-art models across all four standard metrics on two challenging Q\&As on large text-attributed KGs.
Jianling Lu, Mingqi Lv, Tieming Chen
The performance of large language models (LLMs) in Q&A task increased
substantially through Retrieval-Augmented Generation (RAG) which brings in
external knowledge. However, the main difficulty lies in balancing the inherent
self-knowledge of LLMs with external information retrieval (IR). The current
threshold-based methods apply one-dimensional static mechanisms with single
criterion. As a result, their IR decisions might be irrelevant to the LLMs'
response under difficult queries. To alleviate this problem, we propose
Cognitive Convection of Self-Knowledge (CCSK). Different from traditional
methods that maintain single fixed IR activation criteria, CCSK implements a
dynamic joint decision process via a Siamese Network module and a Response
Quality Model. The Siamese Network calculates the cosine similarity between the
current query and the historical queries. The Response Quality Model evaluates
the responses of LLMs through LightGBM. The final decision of the CCSK is
derived from the outputs of the two modules, as well as text features fused
using a multi-head attention mechanism. Extensive experiments on real-world
datasets show that CCSK significantly enhances the model's effectiveness in
information retrieval.
Authors' comments: All authors of this paper have unanimously decided to withdraw its
preprint from arXiv. As one of the authors, I cannot unilaterally decide its
retention. In accordance with the collective decision, we formally request
the complete deletion of the paper from arXiv
Dongzhuoran Zhou, Yuqicheng Zhu, Yuan He, Jiaoyan Chen, Evgeny Kharlamov, Steffen Staab
Knowledge Graph based Retrieval-Augmented Generation (KG-RAG) is a technique
that enhances Large Language Model (LLM) inference in tasks like Question
Answering (QA) by retrieving relevant information from knowledge graphs (KGs).
However, real-world KGs are often incomplete, meaning that essential
information for answering questions may be missing. Existing benchmarks do not
adequately capture the impact of KG incompleteness on KG-RAG performance. In
this paper, we systematically evaluate KG-RAG methods under incomplete KGs by
removing triples using different methods and analyzing the resulting effects.
We demonstrate that KG-RAG methods are sensitive to KG incompleteness,
highlighting the need for more robust approaches in realistic settings.
Authors' comments: Under Review
Saeid Ario Vaghefi, Aymane Hachcham, Veronica Grasso, Jiska Manicus, Nakiete Msemo, Chiara Colesanti Senni, Markus Leippold
Tracking financial investments in climate adaptation is a complex and expertise-intensive task, particularly for Early Warning Systems (EWS), which lack standardized financial reporting across multilateral development banks (MDBs) and funds. To address this challenge, we introduce an LLM-based agentic AI system that integrates contextual retrieval, fine-tuning, and multi-step reasoning to extract relevant financial data, classify investments, and ensure compliance with funding guidelines. Our study focuses on a real-world application: tracking EWS investments in the Climate Risk and Early Warning Systems (CREWS) Fund. We analyze 25 MDB project documents and evaluate multiple AI-driven classification methods, including zero-shot and few-shot learning, fine-tuned transformer-based classifiers, chain-of-thought (CoT) prompting, and an agent-based retrieval-augmented generation (RAG) approach. Our results show that the agent-based RAG approach significantly outperforms other methods, achieving 87\% accuracy, 89\% precision, and 83\% recall. Additionally, we contribute a benchmark dataset and expert-annotated corpus, providing a valuable resource for future research in AI-driven financial tracking and climate finance transparency.