Elvis Nunez, Luca Zancato, Benjamin Bowman, Aditya Golatkar, Wei Xia, Stefano Soatto
The "state" of State Space Models (SSMs) represents their memory, which fades exponentially over an unbounded span. By contrast, Attention-based models have "eidetic" (i.e., verbatim, or photographic) memory over a finite span (context size). Hybrid architectures combine State Space layers with Attention, but still cannot recall the distant past and can access only the most recent tokens eidetically. Unlike current methods of combining SSM and Attention layers, we allow the state to be allocated based on relevancy rather than recency. In this way, for every new set of query tokens, our models can "eidetically" access tokens from beyond the Attention span of current Hybrid SSMs without requiring extra hardware resources. We introduce a method to expand the memory span of the hybrid state by "reserving" a fraction of the Attention context for tokens retrieved from arbitrarily distant in the past, thus expanding the eidetic memory span of the overall state. We call this reserved fraction of tokens the "expansion span," and the mechanism to retrieve and aggregate it "Span-Expanded Attention" (SE-Attn). To adapt Hybrid models to using SE-Attn, we propose a novel fine-tuning method that extends LoRA to Hybrid models (HyLoRA) and allows efficient adaptation on long spans of tokens. We show that SE-Attn enables us to efficiently adapt pre-trained Hybrid models on sequences of tokens up to 8 times longer than the ones used for pre-training. We show that HyLoRA with SE-Attn is cheaper and more performant than alternatives like LongLoRA when applied to Hybrid models on natural language benchmarks with long-range dependencies, such as PG-19, RULER, and other common natural language downstream tasks.
Ioannis Papadimitriou, Ilias Gialampoukidis, Stefanos Vrochidis, Ioannis, Kompatsiaris
We present RAG Playground, an open-source framework for systematic evaluation
of Retrieval-Augmented Generation (RAG) systems. The framework implements and
compares three retrieval approaches: naive vector search, reranking, and hybrid
vector-keyword search, combined with ReAct agents using different prompting
strategies. We introduce a comprehensive evaluation framework with novel
metrics and provide empirical results comparing different language models
(Llama 3.1 and Qwen 2.5) across various retrieval configurations. Our
experiments demonstrate significant performance improvements through hybrid
search methods and structured self-evaluation prompting, achieving up to 72.7%
pass rate on our multi-metric evaluation framework. The results also highlight
the importance of prompt engineering in RAG systems, with our custom-prompted
agents showing consistent improvements in retrieval accuracy and response
quality.
Authors' comments: Work In Progress
Rajat Khanda
Technical troubleshooting in enterprise environments often involves navigating diverse, heterogeneous data sources to resolve complex issues effectively. This paper presents a novel agentic AI solution built on a Weighted Retrieval-Augmented Generation (RAG) Framework tailored for enterprise technical troubleshooting. By dynamically weighting retrieval sources such as product manuals, internal knowledge bases, FAQs, and troubleshooting guides based on query context, the framework prioritizes the most relevant data. For instance, it gives precedence to product manuals for SKU-specific queries while incorporating general FAQs for broader issues. The system employs FAISS for efficient dense vector search, coupled with a dynamic aggregation mechanism to seamlessly integrate results from multiple sources. A Llama-based self-evaluator ensures the contextual accuracy and confidence of the generated responses before delivering them. This iterative cycle of retrieval and validation enhances precision, diversity, and reliability in response generation. Preliminary evaluations on large enterprise datasets demonstrate the framework's efficacy in improving troubleshooting accuracy, reducing resolution times, and adapting to varied technical challenges. Future research aims to enhance the framework by integrating advanced conversational AI capabilities, enabling more interactive and intuitive troubleshooting experiences. Efforts will also focus on refining the dynamic weighting mechanism through reinforcement learning to further optimize the relevance and precision of retrieved information. By incorporating these advancements, the proposed framework is poised to evolve into a comprehensive, autonomous AI solution, redefining technical service workflows across enterprise settings.
Michael Shen, Muhammad Umar, Kiwan Maeng, G. Edward Suh, Udit Gupta
The rapid increase in the number of parameters in large language models (LLMs) has significantly increased the cost involved in fine-tuning and retraining LLMs, a necessity for keeping models up to date and improving accuracy. Retrieval-Augmented Generation (RAG) offers a promising approach to improving the capabilities and accuracy of LLMs without the necessity of retraining. Although RAG eliminates the need for continuous retraining to update model data, it incurs a trade-off in the form of slower model inference times. Resultingly, the use of RAG in enhancing the accuracy and capabilities of LLMs often involves diverse performance implications and trade-offs based on its design. In an effort to begin tackling and mitigating the performance penalties associated with RAG from a systems perspective, this paper introduces a detailed taxonomy and characterization of the different elements within the RAG ecosystem for LLMs that explore trade-offs within latency, throughput, and memory. Our study reveals underlying inefficiencies in RAG for systems deployment, that can result in TTFT latencies that are twice as long and unoptimized datastores that consume terabytes of storage.
Wenchao Gu, Ensheng Shi, Yanlin Wang, Lun Du, Shi Han, Hongyu Zhang, Dongmei Zhang, Michael R. Lyu
Code retrieval, which retrieves code snippets based on users' natural language descriptions, is widely used by developers and plays a pivotal role in real-world software development. The advent of deep learning has shifted the retrieval paradigm from lexical-based matching towards leveraging deep learning models to encode source code and queries into vector representations, facilitating code retrieval according to vector similarity. Despite the effectiveness of these models, managing large-scale code database presents significant challenges. Previous research proposes deep hashing-based methods, which generate hash codes for queries and code snippets and use Hamming distance for rapid recall of code candidates. However, this approach's reliance on linear scanning of the entire code base limits its scalability. To further improve the efficiency of large-scale code retrieval, we propose a novel approach SECRET (Scalable and Efficient Code Retrieval via SegmEnTed deep hashing). SECRET converts long hash codes calculated by existing deep hashing approaches into several short hash code segments through an iterative training strategy. After training, SECRET recalls code candidates by looking up the hash tables for each segment, the time complexity of recall can thus be greatly reduced. Extensive experimental results demonstrate that SECRET can drastically reduce the retrieval time by at least 95% while achieving comparable or even higher performance of existing deep hashing approaches. Besides, SECRET also exhibits superior performance and efficiency compared to the classical hash table-based approach known as LSH under the same number of hash tables.
Hervé Déjean
In this paper, we investigate how efficiently large language models (LLM) can be trained to check whether an answer is already stored in their parametric memory. We distill an LLM-as-a-judge to compute the IK (I Know) score. We found that this method is particularly beneficial in the context of retrieval-assisted augmented generation (RAG), with a respectable accuracy of 80%. It enables a significant reduction (more than 50%) in the number of search and reranking steps required for certain data sets. We have also introduced the IK score, which serves as a useful tool for characterising datasets by facilitating the classification task. Interestingly, through the inclusion of response tokens as input, our results suggest that only about 20,000 training samples are required to achieve good performance. The central element of this work is the use of a teacher model - the LLM as a judge - to generate training data. We also assess the robustness of the IK classifier by evaluating it with various types of teachers, including both string-based methods and LLMs, with the latter providing better results.
Mohamed Basem, Islam Oshallah, Baraa Hikal, Ali Hamdi, Ammar Mohamed
Understanding the deep meanings of the Qur'an and bridging the language gap between modern standard Arabic and classical Arabic is essential to improve the question-and-answer system for the Holy Qur'an. The Qur'an QA 2023 shared task dataset had a limited number of questions with weak model retrieval. To address this challenge, this work updated the original dataset and improved the model accuracy. The original dataset, which contains 251 questions, was reviewed and expanded to 629 questions with question diversification and reformulation, leading to a comprehensive set of 1895 categorized into single-answer, multi-answer, and zero-answer types. Extensive experiments fine-tuned transformer models, including AraBERT, RoBERTa, CAMeLBERT, AraELECTRA, and BERT. The best model, AraBERT-base, achieved a MAP@10 of 0.36 and MRR of 0.59, representing improvements of 63% and 59%, respectively, compared to the baseline scores (MAP@10: 0.22, MRR: 0.37). Additionally, the dataset expansion led to improvements in handling "no answer" cases, with the proposed approach achieving a 75% success rate for such instances, compared to the baseline's 25%. These results demonstrate the effect of dataset improvement and model architecture optimization in increasing the performance of QA systems for the Holy Qur'an, with higher accuracy, recall, and precision.
Zelong Sun, Dong Jing, Guoxing Yang, Nanyi Fei, Zhiwu Lu
Composed Image Retrieval (CIR) aims to retrieve target images from candidate
set using a hybrid-modality query consisting of a reference image and a
relative caption that describes the user intent. Recent studies attempt to
utilize Vision-Language Pre-training Models (VLPMs) with various fusion
strategies for addressing the task.However, these methods typically fail to
simultaneously meet two key requirements of CIR: comprehensively extracting
visual information and faithfully following the user intent. In this work, we
propose CIR-LVLM, a novel framework that leverages the large vision-language
model (LVLM) as the powerful user intent-aware encoder to better meet these
requirements. Our motivation is to explore the advanced reasoning and
instruction-following capabilities of LVLM for accurately understanding and
responding the user intent. Furthermore, we design a novel hybrid intent
instruction module to provide explicit intent guidance at two levels: (1) The
task prompt clarifies the task requirement and assists the model in discerning
user intent at the task level. (2) The instance-specific soft prompt, which is
adaptively selected from the learnable prompt pool, enables the model to better
comprehend the user intent at the instance level compared to a universal prompt
for all instances. CIR-LVLM achieves state-of-the-art performance across three
prominent benchmarks with acceptable inference efficiency. We believe this
study provides fundamental insights into CIR-related fields.
Authors' comments: Accepted by AAAI 2025
Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A. Rossi, Dinesh Manocha
Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings with rich multimodal content, including tables, charts, and presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG, combining robust visual retrieval capabilities with sophisticated linguistic reasoning. VisDoMRAG employs a multi-step reasoning process encompassing evidence curation and chain-of-thought reasoning for concurrent textual and visual RAG pipelines. A key novelty of VisDoMRAG is its consistency-constrained modality fusion mechanism, which aligns the reasoning processes across modalities at inference time to produce a coherent final answer. This leads to enhanced accuracy in scenarios where critical information is distributed across modalities and improved answer verifiability through implicit context attribution. Through extensive experiments involving open-source and proprietary large language models, we benchmark state-of-the-art document QA methods on VisDoMBench. Extensive results show that VisDoMRAG outperforms unimodal and long-context LLM baselines for end-to-end multimodal document QA by 12-20%.
Haoyu Jiang, Zhi-Qi Cheng, Gabriel Moreira, Jiawen Zhu, Jingdong Sun, Bukun Ren, Jun-Yan He, Qi Dai et al.
Universal Cross-Domain Retrieval (UCDR) retrieves relevant images from unseen
domains and classes without semantic labels, ensuring robust generalization.
Existing methods commonly employ prompt tuning with pre-trained vision-language
models but are inherently limited by static prompts, reducing adaptability. We
propose UCDR-Adapter, which enhances pre-trained models with adapters and
dynamic prompt generation through a two-phase training strategy. First, Source
Adapter Learning integrates class semantics with domain-specific visual
knowledge using a Learnable Textual Semantic Template and optimizes Class and
Domain Prompts via momentum updates and dual loss functions for robust
alignment. Second, Target Prompt Generation creates dynamic prompts by
attending to masked source prompts, enabling seamless adaptation to unseen
domains and classes. Unlike prior approaches, UCDR-Adapter dynamically adapts
to evolving data distributions, enhancing both flexibility and generalization.
During inference, only the image branch and generated prompts are used,
eliminating reliance on textual inputs for highly efficient retrieval.
Extensive benchmark experiments show that UCDR-Adapter consistently outperforms
ProS in most cases and other state-of-the-art methods on UCDR, U(c)CDR, and
U(d)CDR settings.
Authors' comments: Accepted to WACV 2025. Project link:
https://github.com/fine68/UCDR2024
Yujin Wang, Quanfeng Liu, Jiaqi Fan, Jinlong Hong, Hongqing Chu, Mengjian Tian, Bingzhao Gao, Hong Chen
Understanding and addressing corner cases is essential for ensuring the
safety and reliability of autonomous driving systems. Vision-language models
(VLMs) play a crucial role in enhancing scenario comprehension, yet they face
significant challenges, such as hallucination and insufficient real-world
grounding, which compromise their performance in critical driving scenarios. In
this work, RAC3, a novel framework designed to enhance the performance of VLMs
in corner case comprehension, is proposed. RAC3 integrates a frequency-spatial
fusion (FSF) image encoder, a cross-modal alignment training method for
embedding models with hard and semi-hard negative mining, and a fast querying
and retrieval pipeline based on K-Means clustering and hierarchical navigable
small world (HNSW) indexing. A multimodal chain-of-thought (CoT) prompting
strategy to guide analogical reasoning and reduce hallucinations during
inference is introduced. Moreover, an update mechanism is integrated into RAC3
to ensure continual learning within the framework. Extensive experiments on the
CODA and nuScenes datasets demonstrate that RAC3 significantly improves corner
case comprehension across multiple downstream tasks. Compared to prior
state-of-the-art methods, RAC3 achieves the highest final score of 74.46 on the
CODA-LM benchmark and shows consistent performance gains when integrated with
end-to-end frameworks like DriveLM. These results demonstrate the effectiveness
of retrieval-augmented strategies and cross-modal alignment for safer and more
interpretable autonomous driving.
Authors' comments: 14 pages, 7 figures
Yash Malviya, Karan Dhingra, Maneesh Singh
Regulatory documents are rich in nuanced terminology and specialized semantics. FRAG systems: Frozen retrieval-augmented generators utilizing pre-trained (or, frozen) components face consequent challenges with both retriever and answering performance. We present a system that adapts the retriever performance to the target domain using a multi-stage tuning (MST) strategy. Our retrieval approach, called MST-R (a) first fine-tunes encoders used in vector stores using hard negative mining, (b) then uses a hybrid retriever, combining sparse and dense retrievers using reciprocal rank fusion, and then (c) adapts the cross-attention encoder by fine-tuning only the top-k retrieved results. We benchmark the system performance on the dataset released for the RIRAG challenge (as part of the RegNLP workshop at COLING 2025). We achieve significant performance gains obtaining a top rank on the RegNLP challenge leaderboard. We also show that a trivial answering approach games the RePASs metric outscoring all baselines and a pre-trained Llama model. Analyzing this anomaly, we present important takeaways for future research.
Hyeonseok Lim, Dongjae Shin, Seohyun Song, Inho Won, Minjun Kim, Junghun Yuk, Haneol Jang, KyungTae Lim
We propose the VLR-Bench, a visual question answering (VQA) benchmark for
evaluating vision language models (VLMs) based on retrieval augmented
generation (RAG). Unlike existing evaluation datasets for external
knowledge-based VQA, the proposed VLR-Bench includes five input passages. This
allows testing of the ability to determine which passage is useful for
answering a given query, a capability lacking in previous research. In this
context, we constructed a dataset of 32,000 automatically generated
instruction-following examples, which we denote as VLR-IF. This dataset is
specifically designed to enhance the RAG capabilities of VLMs by enabling them
to learn how to generate appropriate answers based on input passages. We
evaluated the validity of the proposed benchmark and training data and verified
its performance using the state-of-the-art Llama3-based VLM, the Llava-Llama-3
model. The proposed VLR-Bench and VLR-IF datasets are publicly available
online.
Authors' comments: The 31st International Conference on Computational Linguistics
(COLING 2025), 19 pages
Xiao Zhang, Qianru Meng, Johan Bos
Open-domain semantic parsing remains a challenging task, as models often rely
on heuristics and struggle to handle unseen concepts. In this paper, we
investigate the potential of large language models (LLMs) for this task and
introduce Retrieval-Augmented Semantic Parsing (RASP), a simple yet effective
approach that integrates external lexical knowledge into the parsing process.
Our experiments not only show that LLMs outperform previous encoder-decoder
baselines for semantic parsing, but that RASP further enhances their ability to
predict unseen concepts, nearly doubling the performance of previous models on
out-of-distribution concepts. These findings highlight the promise of
leveraging large language models and retrieval mechanisms for robust and
open-domain semantic parsing.
Authors' comments: Submitted to ARR
Iman Munire Bilal, Zheng Fang, Miguel Arana-Catania, Felix-Anselm van Lier, Juliana Outes Velarde, Harry Bregazzi, Eleanor Carter, Mara Airoldi et al.
As academic literature proliferates, traditional review methods are increasingly challenged by the sheer volume and diversity of available research. This article presents a study that aims to address these challenges by enhancing the efficiency and scope of systematic reviews in the social sciences through advanced machine learning (ML) and natural language processing (NLP) tools. In particular, we focus on automating stages within the systematic reviewing process that are time-intensive and repetitive for human annotators and which lend themselves to immediate scalability through tools such as information retrieval and summarisation guided by expert advice. The article concludes with a summary of lessons learnt regarding the integrated approach towards systematic reviews and future directions for improvement, including explainability.
Nikolay Banar, Ehsan Lotfi, Walter Daelemans
Zero-shot evaluation of information retrieval (IR) models is often performed
using BEIR; a large and heterogeneous benchmark composed of multiple datasets,
covering different retrieval tasks across various domains. Although BEIR has
become a standard benchmark for the zero-shot setup, its exclusively English
content reduces its utility for underrepresented languages in IR, including
Dutch. To address this limitation and encourage the development of Dutch IR
models, we introduce BEIR-NL by automatically translating the publicly
accessible BEIR datasets into Dutch. Using BEIR-NL, we evaluated a wide range
of multilingual dense ranking and reranking models, as well as the lexical BM25
method. Our experiments show that BM25 remains a competitive baseline, and is
only outperformed by the larger dense models trained for retrieval. When
combined with reranking models, BM25 achieves performance on par with the best
dense ranking models. In addition, we explored the impact of translation on the
data by back-translating a selection of datasets to English, and observed a
performance drop for both dense and lexical methods, indicating the limitations
of translation for creating benchmarks. BEIR-NL is publicly available on the
Hugging Face hub.
Authors' comments: To be presented at BUCC 2025 (COLING)
Yuchen Hui, Fengran Mo, Milan Mao, Jian-Yun Nie
The Recherche Appliquee en Linguistique Informatique (RALI) team participated
in the 2024 TREC Interactive Knowledge Assistance (iKAT) Track. In personalized
conversational search, effectively capturing a user's complex search intent
requires incorporating both contextual information and key elements from the
user profile into query reformulation. The user profile often contains many
relevant pieces, and each could potentially complement the user's information
needs. It is difficult to disregard any of them, whereas introducing an
excessive number of these pieces risks drifting from the original query and
hinders search performance. This is a challenge we denote as
over-personalization. To address this, we propose different strategies by
fusing ranking lists generated from the queries with different levels of
personalization.
Authors' comments: Work presented at NIST Text Retrieval Conference 2024.
https://www.nist.gov/news-events/events/2024/11/trec2024
Kartik Sharma, Peeyush Kumar, Yunqing Li
This paper presents OG-RAG, an Ontology-Grounded Retrieval Augmented Generation method designed to enhance LLM-generated responses by anchoring retrieval processes in domain-specific ontologies. While LLMs are widely used for tasks like question answering and search, they struggle to adapt to specialized knowledge, such as industrial workflows or knowledge work, without expensive fine-tuning or sub-optimal retrieval methods. Existing retrieval-augmented models, such as RAG, offer improvements but fail to account for structured domain knowledge, leading to suboptimal context generation. Ontologies, which conceptually organize domain knowledge by defining entities and their interrelationships, offer a structured representation to address this gap. OG-RAG constructs a hypergraph representation of domain documents, where each hyperedge encapsulates clusters of factual knowledge grounded using domain-specific ontology. An optimization algorithm then retrieves the minimal set of hyperedges that constructs a precise, conceptually grounded context for the LLM. This method enables efficient retrieval while preserving the complex relationships between entities. OG-RAG applies to domains where fact-based reasoning is essential, particularly in tasks that require workflows or decision-making steps to follow predefined rules and procedures. These include industrial workflows in healthcare, legal, and agricultural sectors, as well as knowledge-driven tasks such as news journalism, investigative research, consulting and more. Our evaluations demonstrate that OG-RAG increases the recall of accurate facts by 55% and improves response correctness by 40% across four different LLMs. Additionally, OG-RAG enables 30% faster attribution of responses to context and boosts fact-based reasoning accuracy by 27% compared to baseline methods.
Wanwen Chen, Adam Schmidt, Eitan Prisman, Septimiu E. Salcudean
Purpose: Intraoperative ultrasound (US) can enhance real-time visualization
in transoral robotic surgery. The surgeon creates a mental map with a
pre-operative scan. Then, a surgical assistant performs freehand US scanning
during the surgery while the surgeon operates at the remote surgical console.
Communicating the target scanning plane in the surgeon's mental map is
difficult. Automatic image retrieval can help match intraoperative images to
preoperative scans, guiding the assistant to adjust the US probe toward the
target plane. Methods: We propose a self-supervised contrastive learning
approach to match intraoperative US views to a preoperative image database. We
introduce a novel contrastive learning strategy that leverages intra-sweep
similarity and US probe location to improve feature encoding. Additionally, our
model incorporates a flexible threshold to reject unsatisfactory matches.
Results: Our method achieves 92.30% retrieval accuracy on simulated data and
outperforms state-of-the-art temporal-based contrastive learning approaches.
Our ablation study demonstrates that using probe location in the optimization
goal improves image representation, suggesting that semantic information can be
extracted from probe location. We also present our approach on real patient
data to show the feasibility of the proposed US probe localization system
despite tissue deformation from tongue retraction. Conclusion: Our contrastive
learning method, which utilizes intra-sweep similarity and US probe location,
enhances US image representation learning. We also demonstrate the feasibility
of using our image retrieval method to provide neck US localization on real
patient US after tongue retraction.
Authors' comments: 12 pages, 5 figures
Xiaqiang Tang, Jian Li, Nan Du, Sihong Xie
Despite the superior performance of Large language models on many NLP tasks,
they still face significant limitations in memorizing extensive world
knowledge. Recent studies have demonstrated that leveraging the
Retrieval-Augmented Generation (RAG) framework, combined with Knowledge Graphs
that encapsulate extensive factual data in a structured format, robustly
enhances the reasoning capabilities of LLMs. However, deploying such systems in
real-world scenarios presents challenges: the continuous evolution of
non-stationary environments may lead to performance degradation and user
satisfaction requires a careful balance of performance and responsiveness. To
address these challenges, we introduce a Multi-objective Multi-Armed Bandit
enhanced RAG framework, supported by multiple retrieval methods with diverse
capabilities under rich and evolving retrieval contexts in practice. Within
this framework, each retrieval method is treated as a distinct ``arm''. The
system utilizes real-time user feedback to adapt to dynamic environments, by
selecting the appropriate retrieval method based on input queries and the
historical multi-objective performance of each arm. Extensive experiments
conducted on two benchmark KGQA datasets demonstrate that our method
significantly outperforms baseline methods in non-stationary settings while
achieving state-of-the-art performance in stationary environments. Code and
data are available at https://github.com/FUTUREEEEEE/Dynamic-RAG.git
Authors' comments: AAAI 2025