Hessa Alawwad, Usman Naseem, Areej Alhothali, Ali Alkhathlan, Amani Jamal
Textbook question answering (TQA) is a complex task, requiring the
interpretation of complex multimodal context. Although recent advances have
improved overall performance, they often encounter difficulties in educational
settings where accurate semantic alignment and task-specific document retrieval
are essential. In this paper, we propose a novel approach to multimodal
textbook question answering by introducing a mechanism for enhancing semantic
representations through multi-objective joint training. Our model, Joint
Embedding Training With Ranking Supervision for Textbook Question Answering
(JETRTQA), is a multimodal learning framework built on a retriever--generator
architecture that uses a retrieval-augmented generation setup, in which a
multimodal large language model generates answers. JETRTQA is designed to
improve the relevance of retrieved documents in complex educational contexts.
Unlike traditional direct scoring approaches, JETRTQA learns to refine the
semantic representations of questions and documents through a supervised signal
that combines pairwise ranking and implicit supervision derived from answers.
We evaluate our method on the CK12-QA dataset and demonstrate that it
significantly improves the discrimination between informative and irrelevant
documents, even when they are long, complex, and multimodal. JETRTQA
outperforms the previous state of the art, achieving a 2.4\% gain in accuracy
on the validation set and 11.1\% on the test set.
Authors' comments: 14 pages, 16 figure
Ahmed Lekssays, Utsav Shukla, Husrev Taha Sencar, Md Rizwan Parvez
Accurately identifying adversarial techniques in security texts is critical
for effective cyber defense. However, existing methods face a fundamental
trade-off: they either rely on generic models with limited domain precision or
require resource-intensive pipelines that depend on large labeled datasets and
task-specific optimizations, such as custom hard-negative mining and denoising,
resources rarely available in specialized domains.
We propose TechniqueRAG, a domain-specific retrieval-augmented generation
(RAG) framework that bridges this gap by integrating off-the-shelf retrievers,
instruction-tuned LLMs, and minimal text-technique pairs. Our approach
addresses data scarcity by fine-tuning only the generation component on limited
in-domain examples, circumventing the need for resource-intensive retrieval
training. While conventional RAG mitigates hallucination by coupling retrieval
and generation, its reliance on generic retrievers often introduces noisy
candidates, limiting domain-specific precision. To address this, we enhance
retrieval quality and domain specificity through zero-shot LLM re-ranking,
which explicitly aligns retrieved candidates with adversarial techniques.
Experiments on multiple security benchmarks demonstrate that TechniqueRAG
achieves state-of-the-art performance without extensive task-specific
optimizations or labeled data, while comprehensive analysis provides further
insights.
Authors' comments: Accepted at ACL (Findings) 2025
David Osei Opoku, Ming Sheng, Yong Zhang
Domain-specific QA systems require not just generative fluency but high
factual accuracy grounded in structured expert knowledge. While recent
Retrieval-Augmented Generation (RAG) frameworks improve context recall, they
struggle with integrating heterogeneous data and maintaining reasoning
consistency. To address these challenges, we propose DO-RAG, a scalable and
customizable hybrid QA framework that integrates multi-level knowledge graph
construction with semantic vector retrieval. Our system employs a novel agentic
chain-of-thought architecture to extract structured relationships from
unstructured, multimodal documents, constructing dynamic knowledge graphs that
enhance retrieval precision. At query time, DO-RAG fuses graph and vector
retrieval results to generate context-aware responses, followed by
hallucination mitigation via grounded refinement. Experimental evaluations in
the database and electrical domains show near-perfect recall and over 94%
answer relevancy, with DO-RAG outperforming baseline frameworks by up to
33.38%. By combining traceability, adaptability, and performance efficiency,
DO-RAG offers a reliable foundation for multi-domain, high-precision QA at
scale.
Authors' comments: 6 pages, 5 figures;
Aaron Wilhelm, Nils Napp
Ground texture localization using a downward-facing camera offers a low-cost,
high-precision localization solution that is robust to dynamic environments and
requires no environmental modification. We present a significantly improved
bag-of-words (BoW) image retrieval system for ground texture localization,
achieving substantially higher accuracy for global localization and higher
precision and recall for loop closure detection in SLAM. Our approach leverages
an approximate $k$-means (AKM) vocabulary with soft assignment, and exploits
the consistent orientation and constant scale constraints inherent to ground
texture localization. Identifying the different needs of global localization
vs. loop closure detection for SLAM, we present both high-accuracy and
high-speed versions of our algorithm. We test the effect of each of our
proposed improvements through an ablation study and demonstrate our method's
effectiveness for both global localization and loop closure detection. With
numerous ground texture localization systems already using BoW, our method can
readily replace other generic BoW systems in their pipeline and immediately
improve their results.
Authors' comments: Accepted to ICRA 2025
Petr Kasalický, Martin Spišák, Vojtěch Vančura, Daniel Bohuněk, Rodrigo Alves, Pavel Kordík
Industry-scale recommender systems face a core challenge: representing entities with high cardinality, such as users or items, using dense embeddings that must be accessible during both training and inference. However, as embedding sizes grow, memory constraints make storage and access increasingly difficult. We describe a lightweight, learnable embedding compression technique that projects dense embeddings into a high-dimensional, sparsely activated space. Designed for retrieval tasks, our method reduces memory requirements while preserving retrieval performance, enabling scalable deployment under strict resource constraints. Our results demonstrate that leveraging sparsity is a promising approach for improving the efficiency of large-scale recommenders. We release our code at https://github.com/recombee/CompresSAE.
Chuan Xu, Qiaosheng Chen, Yutong Feng, Gong Cheng
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the capabilities of large language models. However, existing RAG evaluation predominantly focuses on text retrieval and relies on opaque, end-to-end assessments of generated outputs. To address these limitations, we introduce mmRAG, a modular benchmark designed for evaluating multi-modal RAG systems. Our benchmark integrates queries from six diverse question-answering datasets spanning text, tables, and knowledge graphs, which we uniformly convert into retrievable documents. To enable direct, granular evaluation of individual RAG components -- such as the accuracy of retrieval and query routing -- beyond end-to-end generation quality, we follow standard information retrieval procedures to annotate document relevance and derive dataset relevance. We establish baseline performance by evaluating a wide range of RAG implementations on mmRAG.
Shaohan Wang, Licheng Zhang, Zheren Fu, Zhendong Mao
Retrieval-Augmented Generation (RAG) is an effective method to enhance the capabilities of large language models (LLMs). Existing methods focus on optimizing the retriever or generator in the RAG system by directly utilizing the top-k retrieved documents. However, the documents effectiveness are various significantly across user queries, i.e. some documents provide valuable knowledge while others totally lack critical information. It hinders the retriever and generator's adaptation during training. Inspired by human cognitive learning, curriculum learning trains models using samples progressing from easy to difficult, thus enhancing their generalization ability, and we integrate this effective paradigm to the training of the RAG system. In this paper, we propose a multi-stage Curriculum Learning based RAG system training framework, named CL-RAG. We first construct training data with multiple difficulty levels for the retriever and generator separately through sample evolution. Then, we train the model in stages based on the curriculum learning approach, thereby optimizing the overall performance and generalization of the RAG system more effectively. Our CL-RAG framework demonstrates consistent effectiveness across four open-domain QA datasets, achieving performance gains of 2% to 4% over multiple advanced methods.
Esther van Dijk, Yamila Miguel
Understanding exoplanet interiors is crucial for interpreting atmospheric
observations and constraining their evolution and formation. However, due to
limited observational constraints, interiors structures remain poorly
understood. In this work, we investigate how new observational constraints,
such as the Love number and atmospheric metallicity, improve our ability to
characterize the interiors of hot Jupiters, planets for which Love number
measurements are most feasible. We assess the precision required in Love number
measurements to derive interior properties using both a simple two-layer
homogeneous model and a more complex dilute core model. To account for
observational uncertainties, we implement a retrieval framework. Our results
show that accurately constraining core mass and bulk metallicity requires a
high-precision Love number measurement, better than 40% for a homogeneous model
and 15% for a dilute core model, along with an atmospheric metallicity
measurement. We apply our retrieval framework to five planets with observed
Love numbers, of which only WASP-19Ab has both an atmospheric metallicity
constraint and a highly precise Love number measurement, with a precision of
12%. For this flagship planet, both models confirm the presence of a core,
although we cannot yet distinguish between a compact core or diluted core. With
the homogeneous model, we find a core mass fraction of $0.21^{+0.05}_{-0.04}$,
corresponding to $79^{+21}_{-18}$ $M_\mathrm{earth}$. Upcoming JWST
observations are expected to provide high-precision Love number measurements
and precise atmospheric data, offering new insights into the structure and
composition of gas giant interiors.
Authors' comments: 17 pages, 14 figures, accepted for publication in MNRAS
Zhiyuan Chang, Xiaojun Jia, Mingyang Li, Junjie Wang, Yuekai Huang, Qing Wang, Ziyou Jiang, Yang Liu
Large Language Models (LLMs) enhanced with Retrieval-Augmented Generation
(RAG) have shown improved performance in generating accurate responses.
However, the dependence on external knowledge bases introduces potential
security vulnerabilities, particularly when these knowledge bases are publicly
accessible and modifiable. Poisoning attacks on knowledge bases for RAG systems
face two fundamental challenges: the injected malicious content must compete
with multiple authentic documents retrieved by the retriever, and LLMs tend to
trust retrieved information that aligns with their internal memorized
knowledge. Previous works attempt to address these challenges by injecting
multiple malicious documents, but such saturation attacks are easily detectable
and impractical in real-world scenarios. To enable the effective single
document poisoning attack, we propose AuthChain, a novel knowledge poisoning
attack method that leverages Chain-of-Evidence theory and authority effect to
craft more convincing poisoned documents. AuthChain generates poisoned content
that establishes strong evidence chains and incorporates authoritative
statements, effectively overcoming the interference from both authentic
documents and LLMs' internal knowledge. Extensive experiments across six
popular LLMs demonstrate that AuthChain achieves significantly higher attack
success rates while maintaining superior stealthiness against RAG defense
mechanisms compared to state-of-the-art baselines.
Authors' comments: 15pages, 4 figures
Han Peng, Jinhao Jiang, Zican Dong, Wayne Xin Zhao, Lei Fang
Advancements in Large Language Models (LLMs) have extended their input context length, yet they still struggle with retrieval and reasoning in long-context inputs. Existing methods propose to utilize the prompt strategy and retrieval head to alleviate this limitation. However, they still face challenges in balancing retrieval precision and recall, impacting their efficacy in answering questions. To address this, we introduce $\textbf{CAFE}$, a two-stage coarse-to-fine method to enhance multi-document question-answering capacities. By gradually eliminating the negative impacts of background and distracting documents, CAFE makes the responses more reliant on the evidence documents. Initially, a coarse-grained filtering method leverages retrieval heads to identify and rank relevant documents. Then, a fine-grained steering method guides attention to the most relevant content. Experiments across benchmarks show CAFE outperforms baselines, achieving up to 22.1% and 13.7% SubEM improvement over SFT and RAG methods on the Mistral model, respectively.
Yifan Wu, Lutao Yan, Yizhang Zhu, Yinan Mei, Jiannan Wang, Nan Tang, Yuyu Luo
Charts are crucial for data analysis and decision-making.Text-to-chart retrieval systems have become increasingly important for Business Intelligence (BI), where users need to find relevant charts that match their analytical needs. These needs can be categorized into precise queries that are well-specified and fuzzy queries that are more exploratory -- both require understanding the semantics and context of the charts. However, existing text-to-chart retrieval solutions often fail to capture the semantic content and contextual information of charts, primarily due to the lack of comprehensive metadata (or semantic insights). To address this limitation, we propose a training data development pipeline that automatically synthesizes hierarchical semantic insights for charts, covering visual patterns (visual-oriented), statistical properties (statistics-oriented), and practical applications (task-oriented), which produces 207,498 semantic insights for 69,166 charts. Based on these, we train a CLIP-based model named ChartFinder to learn better representations of charts for text-to-chart retrieval. Our method leverages rich semantic insights during the training phase to develop a model that understands both visual and semantic aspects of charts.To evaluate text-to-chart retrieval performance, we curate the first benchmark, CRBench, for this task with 21,862 charts and 326 text queries from real-world BI applications, with ground-truth labels verified by the crowd workers.Experiments show that ChartFinder significantly outperforms existing methods in text-to-chart retrieval tasks across various settings. For precise queries, ChartFinder achieves up to 66.9% NDCG@10, which is 11.58% higher than state-of-the-art models. In fuzzy query tasks, our method also demonstrates consistent improvements, with an average increase of 5% across nearly all metrics.
Deeksha Prahlad, Chanhee Lee, Dongha Kim, Hokeun Kim
The advent of large language models (LLMs) has allowed numerous applications,
including the generation of queried responses, to be leveraged in chatbots and
other conversational assistants. Being trained on a plethora of data, LLMs
often undergo high levels of over-fitting, resulting in the generation of extra
and incorrect data, thus causing hallucinations in output generation. One of
the root causes of such problems is the lack of timely, factual, and
personalized information fed to the LLM. In this paper, we propose an approach
to address these problems by introducing retrieval augmented generation (RAG)
using knowledge graphs (KGs) to assist the LLM in personalized response
generation tailored to the users. KGs have the advantage of storing
continuously updated factual information in a structured way. While our KGs can
be used for a variety of frequently updated personal data, such as calendar,
contact, and location data, we focus on calendar data in this paper. Our
experimental results show that our approach works significantly better in
understanding personal information and generating accurate responses compared
to the baseline LLMs using personal data as text inputs, with a moderate
reduction in response time.
Authors' comments: To appear in the Companion Proceedings of the ACM Web Conference 2025
(WWW Companion '25)
Qiwei Peng, Robert Moro, Michal Gregor, Ivan Srba, Simon Ostermann, Marian Simko, Juraj Podroužek, Matúš Mesarčík et al.
The rapid spread of online disinformation presents a global challenge, and machine learning has been widely explored as a potential solution. However, multilingual settings and low-resource languages are often neglected in this field. To address this gap, we conducted a shared task on multilingual claim retrieval at SemEval 2025, aimed at identifying fact-checked claims that match newly encountered claims expressed in social media posts across different languages. The task includes two subtracks: (1) a monolingual track, where social posts and claims are in the same language, and (2) a crosslingual track, where social posts and claims might be in different languages. A total of 179 participants registered for the task contributing to 52 test submissions. 23 out of 31 teams have submitted their system papers. In this paper, we report the best-performing systems as well as the most common and the most effective approaches across both subtracks. This shared task, along with its dataset and participating systems, provides valuable insights into multilingual claim retrieval and automated fact-checking, supporting future research in this field.
Mohamed Abdelmagied, Mohamed Amine Chatti, Shoeb Joarder, Qurat Ul Ain, Rawaa Alatrash
Massive Open Online Courses (MOOCs) lack direct interaction between learners
and instructors, making it challenging for learners to understand new knowledge
concepts. Recently, learners have increasingly used Large Language Models
(LLMs) to support them in acquiring new knowledge. However, LLMs are prone to
hallucinations which limits their reliability. Retrieval-Augmented Generation
(RAG) addresses this issue by retrieving relevant documents before generating a
response. However, the application of RAG across different MOOCs is limited by
unstructured learning material. Furthermore, current RAG systems do not
actively guide learners toward their learning needs. To address these
challenges, we propose a Graph RAG pipeline that leverages Educational
Knowledge Graphs (EduKGs) and Personal Knowledge Graphs (PKGs) to guide
learners to understand knowledge concepts in the MOOC platform CourseMapper.
Specifically, we implement (1) a PKG-based Question Generation method to
recommend personalized questions for learners in context, and (2) an
EduKG-based Question Answering method that leverages the relationships between
knowledge concepts in the EduKG to answer learner selected questions. To
evaluate both methods, we conducted a study with 3 expert instructors on 3
different MOOCs in the MOOC platform CourseMapper. The results of the
evaluation show the potential of Graph RAG to empower learners to understand
new knowledge concepts in a personalized learning experience.
Authors' comments: Accepted at EMOOCs 2025
Kazuki Hayashi, Hidetaka Kamigaito, Shinya Kouda, Taro Watanabe
Retrieval-Augmented Generation (RAG) has emerged as a way to complement the in-context knowledge of Large Language Models (LLMs) by integrating external documents. However, real-world applications demand not only accuracy but also interpretability. While dense retrieval methods provide high accuracy, they lack interpretability; conversely, sparse retrieval methods offer transparency but often fail to capture the full intent of queries due to their reliance on keyword matching. To address these issues, we introduce IterKey, an LLM-driven iterative keyword generation framework that enhances RAG via sparse retrieval. IterKey consists of three LLM-driven stages: generating keywords for retrieval, generating answers based on retrieved documents, and validating the answers. If validation fails, the process iteratively repeats with refined keywords. Across four QA tasks, experimental results show that IterKey achieves 5% to 20% accuracy improvements over BM25-based RAG and simple baselines. Its performance is comparable to dense retrieval-based RAG and prior iterative query refinement methods using dense models. In summary, IterKey is a novel BM25-based approach leveraging LLMs to iteratively refine RAG, effectively balancing accuracy with interpretability.
Adel Ammar, Anis Koubaa, Omer Nacar, Wadii Boulila
Large language models achieve high task performance yet often hallucinate or rely on outdated knowledge. Retrieval-augmented generation (RAG) addresses these gaps by coupling generation with external search. We analyse how hyperparameters influence speed and quality in RAG systems, covering Chroma and Faiss vector stores, chunking policies, cross-encoder re-ranking, and temperature, and we evaluate six metrics: faithfulness, answer correctness, answer relevancy, context precision, context recall, and answer similarity. Chroma processes queries 13% faster, whereas Faiss yields higher retrieval precision, revealing a clear speed-accuracy trade-off. Naive fixed-length chunking with small windows and minimal overlap outperforms semantic segmentation while remaining the quickest option. Re-ranking provides modest gains in retrieval quality yet increases runtime by roughly a factor of 5, so its usefulness depends on latency constraints. These results help practitioners balance computational cost and accuracy when tuning RAG systems for transparent, up-to-date responses. Finally, we re-evaluate the top configurations with a corrective RAG workflow and show that their advantages persist when the model can iteratively request additional evidence. We obtain a near-perfect context precision (99%), which demonstrates that RAG systems can achieve extremely high retrieval accuracy with the right combination of hyperparameters, with significant implications for applications where retrieval quality directly impacts downstream task performance, such as clinical decision support in healthcare.
Jingfen Qiao, Jia-Huei Ju, Xinyu Ma, Evangelos Kanoulas, Andrew Yates
Visual Document Retrieval (VDR) is an emerging research area that focuses on encoding and retrieving document images directly, bypassing the dependence on Optical Character Recognition (OCR) for document search. A recent advance in VDR was introduced by ColPali, which significantly improved retrieval effectiveness through a late interaction mechanism. ColPali's approach demonstrated substantial performance gains over existing baselines that do not use late interaction on an established benchmark. In this study, we investigate the reproducibility and replicability of VDR methods with and without late interaction mechanisms by systematically evaluating their performance across multiple pre-trained vision-language models. Our findings confirm that late interaction yields considerable improvements in retrieval effectiveness; however, it also introduces computational inefficiencies during inference. Additionally, we examine the adaptability of VDR models to textual inputs and assess their robustness across text-intensive datasets within the proposed benchmark, particularly when scaling the indexing mechanism. Furthermore, our research investigates the specific contributions of late interaction by looking into query-patch matching in the context of visual document retrieval. We find that although query tokens cannot explicitly match image patches as in the text retrieval scenario, they tend to match the patch contains visually similar tokens or their surrounding patches.
Lei Wang
Retrieval-Augmented Generation (RAG) models frequently encounter hallucination phenomena when integrating external information with internal parametric knowledge. Empirical studies demonstrate that the disequilibrium between external contextual information and internal parametric knowledge constitutes a primary factor in hallucination generation. Existing hallucination detection methodologies predominantly emphasize either the external or internal mechanism in isolation, thereby overlooking their synergistic effects. The recently proposed ReDeEP framework decouples these dual mechanisms, identifying two critical contributors to hallucinations: excessive reliance on parametric knowledge encoded in feed-forward networks (FFN) and insufficient utilization of external information by attention mechanisms (particularly copy heads). ReDeEP quantitatively assesses these factors to detect hallucinations and dynamically modulates the contributions of FFNs and copy heads to attenuate their occurrence. Nevertheless, ReDeEP and numerous other hallucination detection approaches have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, inadequately address the semantic dimensions of model responses, resulting in inconsistent hallucination assessments in RAG implementations. Building upon ReDeEP's foundation, this paper introduces SEReDeEP, which enhances computational processes through semantic entropy captured via trained linear probes, thereby achieving hallucination assessments that more accurately reflect ground truth evaluations.
Zheng Yao, Shuai Wang, Guido Zuccon
Dense retrievers utilize pre-trained backbone language models (e.g., BERT,
LLaMA) that are fine-tuned via contrastive learning to perform the task of
encoding text into sense representations that can be then compared via a
shallow similarity operation, e.g. inner product. Recent research has
questioned the role of fine-tuning vs. that of pre-training within dense
retrievers, specifically arguing that retrieval knowledge is primarily gained
during pre-training, meaning knowledge not acquired during pre-training cannot
be sub-sequentially acquired via fine-tuning. We revisit this idea here as the
claim was only studied in the context of a BERT-based encoder using DPR as
representative dense retriever. We extend the previous analysis by testing
other representation approaches (comparing the use of CLS tokens with that of
mean pooling), backbone architectures (encoder-only BERT vs. decoder-only
LLaMA), and additional datasets (MSMARCO in addition to Natural Questions). Our
study confirms that in DPR tuning, pre-trained knowledge underpins retrieval
performance, with fine-tuning primarily adjusting neuron activation rather than
reorganizing knowledge. However, this pattern does not hold universally, such
as in mean-pooled (Contriever) and decoder-based (LLaMA) models. We ensure full
reproducibility and make our implementation publicly available at
https://github.com/ielab/DenseRetriever-Knowledge-Acquisition.
Authors' comments: Accepted in SIGIR-2025
Jiashuo Sun, Xianrui Zhong, Sizhe Zhou, Jiawei Han
Retrieval-augmented generation (RAG) systems combine large language models
(LLMs) with external knowledge retrieval, making them highly effective for
knowledge-intensive tasks. A crucial but often under-explored component of
these systems is the reranker. Since irrelevant documents in RAG systems can
mislead the generator, the reranker plays a vital role in refining retrieved
documents to enhance generation quality and explainability. However, it is
challenging to determine the appropriate number of documents ($k$) that the
reranker should select: too few may result in missing critical information,
while too many introduce noise and inefficiencies. Although recent studies have
explored LLM-based rerankers, they primarily leverage internal model knowledge
and overlook the rich supervisory signals that LLMs can provide, such as using
response quality as feedback for optimizing reranking decisions. In this paper,
we propose DynamicRAG, a novel RAG framework where the reranker dynamically
adjusts both the order and number of retrieved documents based on the query. We
model the reranker as an agent optimized through reinforcement learning (RL),
using rewards derived from LLM output quality. Across seven knowledge-intensive
datasets, DynamicRAG demonstrates superior performance, achieving
state-of-the-art results among models of same parameter sizes. The model, data
and code are available at https://github.com/GasolSun36/DynamicRAG.
Authors' comments: 24 pages, 7 figures, 15 tables