Hervé Déjean
In this paper, we investigate how efficiently large language models (LLM) can be trained to check whether an answer is already stored in their parametric memory. We distill an LLM-as-a-judge to compute the IK (I Know) score. We found that this method is particularly beneficial in the context of retrieval-assisted augmented generation (RAG), with a respectable accuracy of 80%. It enables a significant reduction (more than 50%) in the number of search and reranking steps required for certain data sets. We have also introduced the IK score, which serves as a useful tool for characterising datasets by facilitating the classification task. Interestingly, through the inclusion of response tokens as input, our results suggest that only about 20,000 training samples are required to achieve good performance. The central element of this work is the use of a teacher model - the LLM as a judge - to generate training data. We also assess the robustness of the IK classifier by evaluating it with various types of teachers, including both string-based methods and LLMs, with the latter providing better results.
Mohamed Basem, Islam Oshallah, Baraa Hikal, Ali Hamdi, Ammar Mohamed
Understanding the deep meanings of the Qur'an and bridging the language gap between modern standard Arabic and classical Arabic is essential to improve the question-and-answer system for the Holy Qur'an. The Qur'an QA 2023 shared task dataset had a limited number of questions with weak model retrieval. To address this challenge, this work updated the original dataset and improved the model accuracy. The original dataset, which contains 251 questions, was reviewed and expanded to 629 questions with question diversification and reformulation, leading to a comprehensive set of 1895 categorized into single-answer, multi-answer, and zero-answer types. Extensive experiments fine-tuned transformer models, including AraBERT, RoBERTa, CAMeLBERT, AraELECTRA, and BERT. The best model, AraBERT-base, achieved a MAP@10 of 0.36 and MRR of 0.59, representing improvements of 63% and 59%, respectively, compared to the baseline scores (MAP@10: 0.22, MRR: 0.37). Additionally, the dataset expansion led to improvements in handling "no answer" cases, with the proposed approach achieving a 75% success rate for such instances, compared to the baseline's 25%. These results demonstrate the effect of dataset improvement and model architecture optimization in increasing the performance of QA systems for the Holy Qur'an, with higher accuracy, recall, and precision.
Zelong Sun, Dong Jing, Guoxing Yang, Nanyi Fei, Zhiwu Lu
Composed Image Retrieval (CIR) aims to retrieve target images from candidate
set using a hybrid-modality query consisting of a reference image and a
relative caption that describes the user intent. Recent studies attempt to
utilize Vision-Language Pre-training Models (VLPMs) with various fusion
strategies for addressing the task.However, these methods typically fail to
simultaneously meet two key requirements of CIR: comprehensively extracting
visual information and faithfully following the user intent. In this work, we
propose CIR-LVLM, a novel framework that leverages the large vision-language
model (LVLM) as the powerful user intent-aware encoder to better meet these
requirements. Our motivation is to explore the advanced reasoning and
instruction-following capabilities of LVLM for accurately understanding and
responding the user intent. Furthermore, we design a novel hybrid intent
instruction module to provide explicit intent guidance at two levels: (1) The
task prompt clarifies the task requirement and assists the model in discerning
user intent at the task level. (2) The instance-specific soft prompt, which is
adaptively selected from the learnable prompt pool, enables the model to better
comprehend the user intent at the instance level compared to a universal prompt
for all instances. CIR-LVLM achieves state-of-the-art performance across three
prominent benchmarks with acceptable inference efficiency. We believe this
study provides fundamental insights into CIR-related fields.
Authors' comments: Accepted by AAAI 2025
Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A. Rossi, Dinesh Manocha
Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings with rich multimodal content, including tables, charts, and presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG, combining robust visual retrieval capabilities with sophisticated linguistic reasoning. VisDoMRAG employs a multi-step reasoning process encompassing evidence curation and chain-of-thought reasoning for concurrent textual and visual RAG pipelines. A key novelty of VisDoMRAG is its consistency-constrained modality fusion mechanism, which aligns the reasoning processes across modalities at inference time to produce a coherent final answer. This leads to enhanced accuracy in scenarios where critical information is distributed across modalities and improved answer verifiability through implicit context attribution. Through extensive experiments involving open-source and proprietary large language models, we benchmark state-of-the-art document QA methods on VisDoMBench. Extensive results show that VisDoMRAG outperforms unimodal and long-context LLM baselines for end-to-end multimodal document QA by 12-20%.
Haoyu Jiang, Zhi-Qi Cheng, Gabriel Moreira, Jiawen Zhu, Jingdong Sun, Bukun Ren, Jun-Yan He, Qi Dai et al.
Universal Cross-Domain Retrieval (UCDR) retrieves relevant images from unseen
domains and classes without semantic labels, ensuring robust generalization.
Existing methods commonly employ prompt tuning with pre-trained vision-language
models but are inherently limited by static prompts, reducing adaptability. We
propose UCDR-Adapter, which enhances pre-trained models with adapters and
dynamic prompt generation through a two-phase training strategy. First, Source
Adapter Learning integrates class semantics with domain-specific visual
knowledge using a Learnable Textual Semantic Template and optimizes Class and
Domain Prompts via momentum updates and dual loss functions for robust
alignment. Second, Target Prompt Generation creates dynamic prompts by
attending to masked source prompts, enabling seamless adaptation to unseen
domains and classes. Unlike prior approaches, UCDR-Adapter dynamically adapts
to evolving data distributions, enhancing both flexibility and generalization.
During inference, only the image branch and generated prompts are used,
eliminating reliance on textual inputs for highly efficient retrieval.
Extensive benchmark experiments show that UCDR-Adapter consistently outperforms
ProS in most cases and other state-of-the-art methods on UCDR, U(c)CDR, and
U(d)CDR settings.
Authors' comments: Accepted to WACV 2025. Project link:
https://github.com/fine68/UCDR2024
Yujin Wang, Quanfeng Liu, Jiaqi Fan, Jinlong Hong, Hongqing Chu, Mengjian Tian, Bingzhao Gao, Hong Chen
Understanding and addressing corner cases is essential for ensuring the
safety and reliability of autonomous driving systems. Vision-language models
(VLMs) play a crucial role in enhancing scenario comprehension, yet they face
significant challenges, such as hallucination and insufficient real-world
grounding, which compromise their performance in critical driving scenarios. In
this work, RAC3, a novel framework designed to enhance the performance of VLMs
in corner case comprehension, is proposed. RAC3 integrates a frequency-spatial
fusion (FSF) image encoder, a cross-modal alignment training method for
embedding models with hard and semi-hard negative mining, and a fast querying
and retrieval pipeline based on K-Means clustering and hierarchical navigable
small world (HNSW) indexing. A multimodal chain-of-thought (CoT) prompting
strategy to guide analogical reasoning and reduce hallucinations during
inference is introduced. Moreover, an update mechanism is integrated into RAC3
to ensure continual learning within the framework. Extensive experiments on the
CODA and nuScenes datasets demonstrate that RAC3 significantly improves corner
case comprehension across multiple downstream tasks. Compared to prior
state-of-the-art methods, RAC3 achieves the highest final score of 74.46 on the
CODA-LM benchmark and shows consistent performance gains when integrated with
end-to-end frameworks like DriveLM. These results demonstrate the effectiveness
of retrieval-augmented strategies and cross-modal alignment for safer and more
interpretable autonomous driving.
Authors' comments: 14 pages, 7 figures
Yash Malviya, Karan Dhingra, Maneesh Singh
Regulatory documents are rich in nuanced terminology and specialized semantics. FRAG systems: Frozen retrieval-augmented generators utilizing pre-trained (or, frozen) components face consequent challenges with both retriever and answering performance. We present a system that adapts the retriever performance to the target domain using a multi-stage tuning (MST) strategy. Our retrieval approach, called MST-R (a) first fine-tunes encoders used in vector stores using hard negative mining, (b) then uses a hybrid retriever, combining sparse and dense retrievers using reciprocal rank fusion, and then (c) adapts the cross-attention encoder by fine-tuning only the top-k retrieved results. We benchmark the system performance on the dataset released for the RIRAG challenge (as part of the RegNLP workshop at COLING 2025). We achieve significant performance gains obtaining a top rank on the RegNLP challenge leaderboard. We also show that a trivial answering approach games the RePASs metric outscoring all baselines and a pre-trained Llama model. Analyzing this anomaly, we present important takeaways for future research.
Hyeonseok Lim, Dongjae Shin, Seohyun Song, Inho Won, Minjun Kim, Junghun Yuk, Haneol Jang, KyungTae Lim
We propose the VLR-Bench, a visual question answering (VQA) benchmark for
evaluating vision language models (VLMs) based on retrieval augmented
generation (RAG). Unlike existing evaluation datasets for external
knowledge-based VQA, the proposed VLR-Bench includes five input passages. This
allows testing of the ability to determine which passage is useful for
answering a given query, a capability lacking in previous research. In this
context, we constructed a dataset of 32,000 automatically generated
instruction-following examples, which we denote as VLR-IF. This dataset is
specifically designed to enhance the RAG capabilities of VLMs by enabling them
to learn how to generate appropriate answers based on input passages. We
evaluated the validity of the proposed benchmark and training data and verified
its performance using the state-of-the-art Llama3-based VLM, the Llava-Llama-3
model. The proposed VLR-Bench and VLR-IF datasets are publicly available
online.
Authors' comments: The 31st International Conference on Computational Linguistics
(COLING 2025), 19 pages
Xiao Zhang, Qianru Meng, Johan Bos
Open-domain semantic parsing remains a challenging task, as models often rely
on heuristics and struggle to handle unseen concepts. In this paper, we
investigate the potential of large language models (LLMs) for this task and
introduce Retrieval-Augmented Semantic Parsing (RASP), a simple yet effective
approach that integrates external lexical knowledge into the parsing process.
Our experiments not only show that LLMs outperform previous encoder-decoder
baselines for semantic parsing, but that RASP further enhances their ability to
predict unseen concepts, nearly doubling the performance of previous models on
out-of-distribution concepts. These findings highlight the promise of
leveraging large language models and retrieval mechanisms for robust and
open-domain semantic parsing.
Authors' comments: Submitted to ARR
Iman Munire Bilal, Zheng Fang, Miguel Arana-Catania, Felix-Anselm van Lier, Juliana Outes Velarde, Harry Bregazzi, Eleanor Carter, Mara Airoldi et al.
As academic literature proliferates, traditional review methods are increasingly challenged by the sheer volume and diversity of available research. This article presents a study that aims to address these challenges by enhancing the efficiency and scope of systematic reviews in the social sciences through advanced machine learning (ML) and natural language processing (NLP) tools. In particular, we focus on automating stages within the systematic reviewing process that are time-intensive and repetitive for human annotators and which lend themselves to immediate scalability through tools such as information retrieval and summarisation guided by expert advice. The article concludes with a summary of lessons learnt regarding the integrated approach towards systematic reviews and future directions for improvement, including explainability.
Nikolay Banar, Ehsan Lotfi, Walter Daelemans
Zero-shot evaluation of information retrieval (IR) models is often performed
using BEIR; a large and heterogeneous benchmark composed of multiple datasets,
covering different retrieval tasks across various domains. Although BEIR has
become a standard benchmark for the zero-shot setup, its exclusively English
content reduces its utility for underrepresented languages in IR, including
Dutch. To address this limitation and encourage the development of Dutch IR
models, we introduce BEIR-NL by automatically translating the publicly
accessible BEIR datasets into Dutch. Using BEIR-NL, we evaluated a wide range
of multilingual dense ranking and reranking models, as well as the lexical BM25
method. Our experiments show that BM25 remains a competitive baseline, and is
only outperformed by the larger dense models trained for retrieval. When
combined with reranking models, BM25 achieves performance on par with the best
dense ranking models. In addition, we explored the impact of translation on the
data by back-translating a selection of datasets to English, and observed a
performance drop for both dense and lexical methods, indicating the limitations
of translation for creating benchmarks. BEIR-NL is publicly available on the
Hugging Face hub.
Authors' comments: To be presented at BUCC 2025 (COLING)
Yuchen Hui, Fengran Mo, Milan Mao, Jian-Yun Nie
The Recherche Appliquee en Linguistique Informatique (RALI) team participated
in the 2024 TREC Interactive Knowledge Assistance (iKAT) Track. In personalized
conversational search, effectively capturing a user's complex search intent
requires incorporating both contextual information and key elements from the
user profile into query reformulation. The user profile often contains many
relevant pieces, and each could potentially complement the user's information
needs. It is difficult to disregard any of them, whereas introducing an
excessive number of these pieces risks drifting from the original query and
hinders search performance. This is a challenge we denote as
over-personalization. To address this, we propose different strategies by
fusing ranking lists generated from the queries with different levels of
personalization.
Authors' comments: Work presented at NIST Text Retrieval Conference 2024.
https://www.nist.gov/news-events/events/2024/11/trec2024
Kartik Sharma, Peeyush Kumar, Yunqing Li
This paper presents OG-RAG, an Ontology-Grounded Retrieval Augmented Generation method designed to enhance LLM-generated responses by anchoring retrieval processes in domain-specific ontologies. While LLMs are widely used for tasks like question answering and search, they struggle to adapt to specialized knowledge, such as industrial workflows or knowledge work, without expensive fine-tuning or sub-optimal retrieval methods. Existing retrieval-augmented models, such as RAG, offer improvements but fail to account for structured domain knowledge, leading to suboptimal context generation. Ontologies, which conceptually organize domain knowledge by defining entities and their interrelationships, offer a structured representation to address this gap. OG-RAG constructs a hypergraph representation of domain documents, where each hyperedge encapsulates clusters of factual knowledge grounded using domain-specific ontology. An optimization algorithm then retrieves the minimal set of hyperedges that constructs a precise, conceptually grounded context for the LLM. This method enables efficient retrieval while preserving the complex relationships between entities. OG-RAG applies to domains where fact-based reasoning is essential, particularly in tasks that require workflows or decision-making steps to follow predefined rules and procedures. These include industrial workflows in healthcare, legal, and agricultural sectors, as well as knowledge-driven tasks such as news journalism, investigative research, consulting and more. Our evaluations demonstrate that OG-RAG increases the recall of accurate facts by 55% and improves response correctness by 40% across four different LLMs. Additionally, OG-RAG enables 30% faster attribution of responses to context and boosts fact-based reasoning accuracy by 27% compared to baseline methods.
Wanwen Chen, Adam Schmidt, Eitan Prisman, Septimiu E. Salcudean
Purpose: Intraoperative ultrasound (US) can enhance real-time visualization
in transoral robotic surgery. The surgeon creates a mental map with a
pre-operative scan. Then, a surgical assistant performs freehand US scanning
during the surgery while the surgeon operates at the remote surgical console.
Communicating the target scanning plane in the surgeon's mental map is
difficult. Automatic image retrieval can help match intraoperative images to
preoperative scans, guiding the assistant to adjust the US probe toward the
target plane. Methods: We propose a self-supervised contrastive learning
approach to match intraoperative US views to a preoperative image database. We
introduce a novel contrastive learning strategy that leverages intra-sweep
similarity and US probe location to improve feature encoding. Additionally, our
model incorporates a flexible threshold to reject unsatisfactory matches.
Results: Our method achieves 92.30% retrieval accuracy on simulated data and
outperforms state-of-the-art temporal-based contrastive learning approaches.
Our ablation study demonstrates that using probe location in the optimization
goal improves image representation, suggesting that semantic information can be
extracted from probe location. We also present our approach on real patient
data to show the feasibility of the proposed US probe localization system
despite tissue deformation from tongue retraction. Conclusion: Our contrastive
learning method, which utilizes intra-sweep similarity and US probe location,
enhances US image representation learning. We also demonstrate the feasibility
of using our image retrieval method to provide neck US localization on real
patient US after tongue retraction.
Authors' comments: 12 pages, 5 figures
Xiaqiang Tang, Jian Li, Nan Du, Sihong Xie
Despite the superior performance of Large language models on many NLP tasks,
they still face significant limitations in memorizing extensive world
knowledge. Recent studies have demonstrated that leveraging the
Retrieval-Augmented Generation (RAG) framework, combined with Knowledge Graphs
that encapsulate extensive factual data in a structured format, robustly
enhances the reasoning capabilities of LLMs. However, deploying such systems in
real-world scenarios presents challenges: the continuous evolution of
non-stationary environments may lead to performance degradation and user
satisfaction requires a careful balance of performance and responsiveness. To
address these challenges, we introduce a Multi-objective Multi-Armed Bandit
enhanced RAG framework, supported by multiple retrieval methods with diverse
capabilities under rich and evolving retrieval contexts in practice. Within
this framework, each retrieval method is treated as a distinct ``arm''. The
system utilizes real-time user feedback to adapt to dynamic environments, by
selecting the appropriate retrieval method based on input queries and the
historical multi-objective performance of each arm. Extensive experiments
conducted on two benchmark KGQA datasets demonstrate that our method
significantly outperforms baseline methods in non-stationary settings while
achieving state-of-the-art performance in stationary environments. Code and
data are available at https://github.com/FUTUREEEEEE/Dynamic-RAG.git
Authors' comments: AAAI 2025
Yang Xiong, Ruichen Zhang, Yinqiu Liu, Dusit Niyato, Zehui Xiong, Ying-Chang Liang, Shiwen Mao
The rapid development of next-generation networking technologies underscores
their transformative role in revolutionizing modern communication systems,
enabling faster, more reliable, and highly interconnected solutions. However,
such development has also brought challenges to network optimizations. Thanks
to the emergence of Large Language Models (LLMs) in recent years, tools
including Retrieval Augmented Generation (RAG) have been developed and applied
in various fields including networking, and have shown their effectiveness.
Taking one step further, the integration of knowledge graphs into RAG
frameworks further enhanced the performance of RAG in networking applications
such as Intent-Driven Networks (IDNs) and spectrum knowledge maps by providing
more contextually relevant responses through more accurate retrieval of related
network information. This paper introduces the RAG framework that integrates
knowledge graphs in its database and explores such framework's application in
networking. We begin by exploring RAG's applications in networking and the
limitations of conventional RAG and present the advantages that knowledge
graphs' structured knowledge representation brings to the retrieval and
generation processes. Next, we propose a detailed GraphRAG-based framework for
networking, including a step-by-step tutorial on its construction. Our
evaluation through a case study on channel gain prediction demonstrates
GraphRAG's enhanced capability in generating accurate, contextually rich
responses, surpassing traditional RAG models. Finally, we discuss key future
directions for applying knowledge-graphs-empowered RAG frameworks in
networking, including robust updates, mitigation of hallucination, and enhanced
security measures for networking applications.
Authors' comments: 9 pages, 4 figures
M. Hamza Mughal, Rishabh Dabral, Merel C. J. Scholman, Vera Demberg, Christian Theobalt
Non-verbal communication often comprises of semantically rich gestures that
help convey the meaning of an utterance. Producing such semantic co-speech
gestures has been a major challenge for the existing neural systems that can
generate rhythmic beat gestures, but struggle to produce semantically
meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based
gesture generation approach that leverages Retrieval Augmented Generation (RAG)
to produce natural-looking and semantically rich gestures. Our neuro-explicit
gesture generation approach is designed to produce semantic gestures grounded
in interpretable linguistic knowledge. We achieve this by using explicit domain
knowledge to retrieve exemplar motions from a database of co-speech gestures.
Once retrieved, we then inject these semantic exemplar gestures into our
diffusion-based gesture generation pipeline using DDIM inversion and retrieval
guidance at the inference time without any need of training. Further, we
propose a control paradigm for guidance, that allows the users to modulate the
amount of influence each retrieval insertion has over the generated sequence.
Our comparative evaluations demonstrate the validity of our approach against
recent gesture generation approaches. The reader is urged to explore the
results on our project page.
Authors' comments: Preprint. Project page:
https://vcai.mpi-inf.mpg.de/projects/RAG-Gesture/
Jebish Purbey, Drishti Sharma, Siddhant Gupta, Khawaja Murad, Siddartha Pullakhandam, Ram Mohan Rao Kadiyala
This paper presents the system description of our entry for the COLING 2025
RegNLP RIRAG (Regulatory Information Retrieval and Answer Generation)
challenge, focusing on leveraging advanced information retrieval and answer
generation techniques in regulatory domains. We experimented with a combination
of embedding models, including Stella, BGE, CDE, and Mpnet, and leveraged
fine-tuning and reranking for retrieving relevant documents in top ranks. We
utilized a novel approach, LeSeR, which achieved competitive results with a
recall@10 of 0.8201 and map@10 of 0.6655 for retrievals. This work highlights
the transformative potential of natural language processing techniques in
regulatory applications, offering insights into their capabilities for
implementing a retrieval augmented generation system while identifying areas
for future improvement in robustness and domain adaptation.
Authors' comments: 5 pages, Accepted to RegNLP @ COLING 2025
Aniruddha Salve, Saba Attar, Mahesh Deshmukh, Sayali Shivpuje, Arnab Mitra Utsab
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by
incorporating external, domain-specific data into the generative process. While
LLMs are highly capable, they often rely on static, pre-trained datasets,
limiting their ability to integrate dynamic or private data. Traditional RAG
systems typically use a single-agent architecture to handle query generation,
data retrieval, and response synthesis. However, this approach becomes
inefficient when dealing with diverse data sources, such as relational
databases, document stores, and graph databases, often leading to performance
bottlenecks and reduced accuracy. This paper proposes a multi-agent RAG system
to address these limitations. Specialized agents, each optimized for a specific
data source, handle query generation for relational, NoSQL, and document-based
systems. These agents collaborate within a modular framework, with query
execution delegated to an environment designed for compatibility across various
database types. This distributed approach enhances query efficiency, reduces
token overhead, and improves response accuracy by ensuring that each agent
focuses on its specialized task. The proposed system is scalable and adaptable,
making it ideal for generative AI workflows that require integration with
diverse, dynamic, or private data sources. By leveraging specialized agents and
a modular execution environment, the system provides an efficient and robust
solution for handling complex, heterogeneous data environments in generative AI
applications.
Authors' comments: 16 pages, 3 figures. This preprint introduces a multi-agent framework
for Retrieval-Augmented Generation (RAG), enhancing Large Language Models
(LLMs) for efficient integration of diverse data sources. Relevant for
researchers in AI, ML, generative AI, and database systems
Kaustubh D. Dhole, Kai Shu, Eugene Agichtein
Computational argumentation, which involves generating answers or summaries for controversial topics like abortion bans and vaccination, has become increasingly important in today's polarized environment. Sophisticated LLM capabilities offer the potential to provide nuanced, evidence-based answers to such questions through Retrieval-Augmented Argumentation (RAArg), leveraging real-world evidence for high-quality, grounded arguments. However, evaluating RAArg remains challenging, as human evaluation is costly and difficult for complex, lengthy answers on complicated topics. At the same time, re-using existing argumentation datasets is no longer sufficient, as they lack long, complex arguments and realistic evidence from potentially misleading sources, limiting holistic evaluation of retrieval effectiveness and argument quality. To address these gaps, we investigate automated evaluation methods using multiple fine-grained LLM judges, providing better and more interpretable assessments than traditional single-score metrics and even previously reported human crowdsourcing. To validate the proposed techniques, we introduce ConQRet, a new benchmark featuring long and complex human-authored arguments on debated topics, grounded in real-world websites, allowing an exhaustive evaluation across retrieval effectiveness, argument quality, and groundedness. We validate our LLM Judges on a prior dataset and the new ConQRet benchmark. Our proposed LLM Judges and the ConQRet benchmark can enable rapid progress in computational argumentation and can be naturally extended to other complex retrieval-augmented generation tasks.