Zehan Qi, Rongwu Xu, Zhijiang Guo, Cunxiang Wang, Hao Zhang, Wei Xu
Retrieval-augmented generation (RAG) is a promising approach to address the
limitations of fixed knowledge in large language models (LLMs). However,
current benchmarks for evaluating RAG systems suffer from two key deficiencies:
(1) they fail to adequately measure LLMs' capability in handling long-context
retrieval due to a lack of datasets that reflect the characteristics of
retrieved documents, and (2) they lack a comprehensive evaluation method for
assessing LLMs' ability to generate long-form responses that effectively
exploits retrieved information. To address these shortcomings, we introduce the
Long$^2$RAG benchmark and the Key Point Recall (KPR) metric. Long$^2$RAG
comprises 280 questions spanning 10 domains and across 8 question categories,
each associated with 5 retrieved documents with an average length of 2,444
words. KPR evaluates the extent to which LLMs incorporate key points extracted
from the retrieved documents into their generated responses, providing a more
nuanced assessment of their ability to exploit retrieved information.
Authors' comments: Accepted to EMNLP'24 (Findings). Camera-ready version
Shuyang Yu, Runxue Bao, Parminder Bhatia, Taha Kass-Hout, Jiayu Zhou, Cao Xiao
Large language models (LLMs) can learn vast amounts of knowledge from diverse
domains during pre-training. However, long-tail knowledge from specialized
domains is often scarce and underrepresented, rarely appearing in the models'
memorization. Prior work has shown that in-context learning (ICL) with
retriever augmentation can help LLMs better capture long-tail knowledge,
reducing their reliance on pre-trained data. Despite these advances, we observe
that LLM predictions for long-tail questions remain uncertain to variations in
retrieved samples. To take advantage of the uncertainty in ICL for guiding LLM
predictions toward correct answers on long-tail samples, we propose a
reinforcement learning-based dynamic uncertainty ranking method for ICL that
accounts for the varying impact of each retrieved sample on LLM predictions.
Our approach prioritizes more informative and stable samples while demoting
misleading ones, updating rankings based on the feedback from the LLM w.r.t.
each retrieved sample. To enhance training efficiency and reduce query costs,
we introduce a learnable dynamic ranking threshold, adjusted when the model
encounters negative prediction shifts. Experimental results on various
question-answering datasets from different domains show that our method
outperforms the best baseline by $2.76\%$, with a notable $5.96\%$ boost in
accuracy on long-tail questions that elude zero-shot inference.
Authors' comments: Accepted by NAACL 2025
Ryoya Ogura, Tomoya Nishida, Yohei Kawaguchi
This paper proposes a method for unsupervised anomalous sound detection (UASD) and captioning the reason for detection. While there is a method that captions the difference between given normal and anomalous sound pairs, it is assumed to be trained and used separately from the UASD model. Therefore, the obtained caption can be irrelevant to the differences that the UASD model captured. In addition, it requires many caption labels representing differences between anomalous and normal sounds for model training. The proposed method employs a retrieval-augmented approach for captioning of anomalous sounds. Difference captioning in the embedding space output by the pre-trained CLAP (contrastive language-audio pre-training) model makes the anomalous sound detection results consistent with the captions and does not require training. Experiments based on subjective evaluation and a sample-wise analysis of the output captions demonstrate the effectiveness of the proposed method.
Suhang Wu, Jialong Tang, Baosong Yang, Ante Wang, Kaidi Jia, Jiawei Yu, Junfeng Yao, Jinsong Su
RALMs (Retrieval-Augmented Language Models) broaden their knowledge scope by incorporating external textual resources. However, the multilingual nature of global knowledge necessitates RALMs to handle diverse languages, a topic that has received limited research focus. In this work, we propose \textit{Futurepedia}, a carefully crafted benchmark containing parallel texts across eight representative languages. We evaluate six multilingual RALMs using our benchmark to explore the challenges of multilingual RALMs. Experimental results reveal linguistic inequalities: 1) high-resource languages stand out in Monolingual Knowledge Extraction; 2) Indo-European languages lead RALMs to provide answers directly from documents, alleviating the challenge of expressing answers across languages; 3) English benefits from RALMs' selection bias and speaks louder in multilingual knowledge selection. Based on these findings, we offer advice for improving multilingual Retrieval Augmented Generation. For monolingual knowledge extraction, careful attention must be paid to cascading errors from translating low-resource languages into high-resource ones. In cross-lingual knowledge transfer, encouraging RALMs to provide answers within documents in different languages can improve transfer performance. For multilingual knowledge selection, incorporating more non-English documents and repositioning English documents can help mitigate RALMs' selection bias. Through comprehensive experiments, we underscore the complexities inherent in multilingual RALMs and offer valuable insights for future research.
Kemal Altwlkany, Sead Delalić, Adis Alihodžić, Elmedin Selmanović, Damir Hasić
Audio fingerprinting techniques have seen great advances in recent years, enabling accurate and fast audio retrieval even in conditions when the queried audio sample has been highly deteriorated or recorded in noisy conditions. Expectedly, most of the existing work is centered around music, with popular music identification services such as Apple's Shazam or Google's Now Playing designed for individual audio recognition on mobile devices. However, the spectral content of speech differs from that of music, necessitating modifications to current audio fingerprinting approaches. This paper offers fresh insights into adapting existing techniques to address the specialized challenge of speech retrieval in telecommunications and cloud communications platforms. The focus is on achieving rapid and accurate audio retrieval in batch processing instead of facilitating single requests, typically on a centralized server. Moreover, the paper demonstrates how this approach can be utilized to support audio clustering based on speech transcripts without undergoing actual speech-to-text conversion. This optimization enables significantly faster processing without the need for GPU computing, a requirement for real-time operation that is typically associated with state-of-the-art speech-to-text tools.
Ashutosh Chaubey, Anoubhav Agarwaal, Sartaki Sinha Roy, Aayush Agrawal, Susmita Ghose
Contextual advertising serves ads that are aligned to the content that the
user is viewing. The rapid growth of video content on social platforms and
streaming services, along with privacy concerns, has increased the need for
contextual advertising. Placing the right ad in the right context creates a
seamless and pleasant ad viewing experience, resulting in higher audience
engagement and, ultimately, better ad monetization. From a technology
standpoint, effective contextual advertising requires a video retrieval system
capable of understanding complex video content at a very granular level.
Current text-to-video retrieval models based on joint multimodal training
demand large datasets and computational resources, limiting their practicality
and lacking the key functionalities required for ad ecosystem integration. We
introduce ContextIQ, a multimodal expert-based video retrieval system designed
specifically for contextual advertising. ContextIQ utilizes modality-specific
experts-video, audio, transcript (captions), and metadata such as objects,
actions, emotion, etc.-to create semantically rich video representations. We
show that our system, without joint training, achieves better or comparable
results to state-of-the-art models and commercial solutions on multiple
text-to-video retrieval benchmarks. Our ablation studies highlight the benefits
of leveraging multiple modalities for enhanced video retrieval accuracy instead
of using a vision-language model alone. Furthermore, we show how video
retrieval systems such as ContextIQ can be used for contextual advertising in
an ad ecosystem while also addressing concerns related to brand safety and
filtering inappropriate content.
Authors' comments: Published at WACV 2025
Isidora Chara Tourni, Sayontan Ghosh, Brenda Miao, Constantijn van der Poel
This paper explores the problems of Question Answering (QA) and Named Entity
Recognition (NER) in five diverse languages. We tested five Large Language
Models with various prompting methods, including zero-shot, chain-of-thought
reasoning, and translation techniques. Our results show that while some models
consistently outperform others, their effectiveness varies significantly across
tasks and languages. We saw that advanced prompting techniques generally
improved QA performance but had mixed results for NER; and we observed that
language difficulty patterns differed between tasks. Our findings highlight the
need for task-specific approaches in multilingual NLP and suggest that current
models may develop different linguistic competencies for different tasks.
Authors' comments: MRL 2024 Shared Task on Multi-lingual Multi-task Information
Retrieval; 4th Multilingual Representation Learning (MRL) Workshop; EMNLP
2024
Timothy D. Gebhard, Jonas Wildberger, Maximilian Dax, Annalena Kofler, Daniel Angerhausen, Sascha P. Quanz, Bernhard Schölkopf
Inferring atmospheric properties of exoplanets from observed spectra is key
to understanding their formation, evolution, and habitability. Since
traditional Bayesian approaches to atmospheric retrieval (e.g., nested
sampling) are computationally expensive, a growing number of machine learning
(ML) methods such as neural posterior estimation (NPE) have been proposed. We
seek to make ML-based atmospheric retrieval (1) more reliable and accurate with
verified results, and (2) more flexible with respect to the underlying neural
networks and the choice of the assumed noise models. First, we adopt flow
matching posterior estimation (FMPE) as a new ML approach to atmospheric
retrieval. FMPE maintains many advantages of NPE, but provides greater
architectural flexibility and scalability. Second, we use importance sampling
(IS) to verify and correct ML results, and to compute an estimate of the
Bayesian evidence. Third, we condition our ML models on the assumed noise level
of a spectrum (i.e., error bars), thus making them adaptable to different noise
models. Both our noise level-conditional FMPE and NPE models perform on par
with nested sampling across a range of noise levels when tested on simulated
data. FMPE trains about 3 times faster than NPE and yields higher IS
efficiencies. IS successfully corrects inaccurate ML results, identifies model
failures via low efficiencies, and provides accurate estimates of the Bayesian
evidence. FMPE is a powerful alternative to NPE for fast, amortized, and
parallelizable atmospheric retrieval. IS can verify results, thus helping to
build confidence in ML-based approaches, while also facilitating model
comparison via the evidence ratio. Noise level conditioning allows design
studies for future instruments to be scaled up, for example, in terms of the
range of signal-to-noise ratios.
Authors' comments: Accepted for publication in Astronomy & Astrophysics
Yang Tan, Ruilin Wang, Banghao Wu, Liang Hong, Bingxin Zhou
Enzyme engineering enables the modification of wild-type proteins to meet
industrial and research demands by enhancing catalytic activity, stability,
binding affinities, and other properties. The emergence of deep learning
methods for protein modeling has demonstrated superior results at lower costs
compared to traditional approaches such as directed evolution and rational
design. In mutation effect prediction, the key to pre-training deep learning
models lies in accurately interpreting the complex relationships among protein
sequence, structure, and function. This study introduces a retrieval-enhanced
protein language model for comprehensive analysis of native properties from
sequence and local structural interactions, as well as evolutionary properties
from retrieved homologous sequences. The state-of-the-art performance of the
proposed ProtREM is validated on over 2 million mutants across 217 assays from
an open benchmark (ProteinGym). We also conducted post-hoc analyses of the
model's ability to improve the stability and binding affinity of a VHH
antibody. Additionally, we designed 10 new mutants on a DNA polymerase and
conducted wet-lab experiments to evaluate their enhanced activity at higher
temperatures. Both in silico and experimental evaluations confirmed that our
method provides reliable predictions of mutation effects, offering an auxiliary
tool for biologists aiming to evolve existing enzymes. The implementation is
publicly available at https://github.com/tyang816/ProtREM.
Authors' comments: 25 pages, 10 figures, 8 tables
Jinlin Wang, Suyuchen Wang, Ziwen Xia, Sirui Hong, Yun Zhu, Bang Liu, Chenglin Wu
Large Language Models (LLMs) are proficient at retrieving single facts from
extended contexts, yet they struggle with tasks requiring the simultaneous
retrieval of multiple facts, especially during generation. This paper
identifies a novel "lost-in-the-middle" phenomenon, where LLMs progressively
lose track of critical information throughout the generation process, resulting
in incomplete or inaccurate retrieval. To address this challenge, we introduce
Find All Crucial Texts (FACT), an iterative retrieval method that refines
context through successive rounds of rewriting. This approach enables models to
capture essential facts incrementally, which are often overlooked in
single-pass retrieval. Experiments demonstrate that FACT substantially enhances
multi-fact retrieval performance across various tasks, though improvements are
less notable in general-purpose QA scenarios. Our findings shed light on the
limitations of LLMs in multi-fact retrieval and underscore the need for more
resilient long-context retrieval strategies.
Authors' comments: Work in Progress
Yen-Shan Chen, Jing Jin, Peng-Ting Kuo, Chao-Wei Huang, Yun-Nung Chen
Recent studies have demonstrated that large language models (LLMs) exhibit
significant biases in evaluation tasks, particularly in preferentially rating
and favoring self-generated content. However, the extent to which this bias
manifests in fact-oriented tasks, especially within retrieval-augmented
generation (RAG) frameworks-where keyword extraction and factual accuracy take
precedence over stylistic elements-remains unclear. Our study addresses this
knowledge gap by simulating two critical phases of the RAG framework. In the
first phase, we access the suitability of human-authored versus model-generated
passages, emulating the pointwise reranking process. The second phase involves
conducting pairwise reading comprehension tests to simulate the generation
process. Contrary to previous findings indicating a self-preference in rating
tasks, our results reveal no significant self-preference effect in RAG
frameworks. Instead, we observe that factual accuracy significantly influences
LLMs' output, even in the absence of prior knowledge. Our research contributes
to the ongoing discourse on LLM biases and their implications for RAG-based
system, offering insights that may inform the development of more robust and
unbiased LLM systems.
Authors' comments: 15 pages, 14 tables, 5 figures
Haoyu Zhang, Jun Liu, Zhenhua Zhu, Shulin Zeng, Maojia Sheng, Tao Yang, Guohao Dai, Yu Wang
ANNS for embedded vector representations of texts is commonly used in
information retrieval, with two important information representations being
sparse and dense vectors. While it has been shown that combining these
representations improves accuracy, the current method of conducting sparse and
dense vector searches separately suffers from low scalability and high system
complexity. Alternatively, building a unified index faces challenges with
accuracy and efficiency. To address these issues, we propose a graph-based ANNS
algorithm for dense-sparse hybrid vectors. Firstly, we propose a distribution
alignment method to improve accuracy, which pre-samples dense and sparse
vectors to analyze their distance distribution statistic, resulting in a
1%$\sim$9% increase in accuracy. Secondly, to improve efficiency, we design an
adaptive two-stage computation strategy that initially computes dense distances
only and later computes hybrid distances. Further, we prune the sparse vectors
to speed up the calculation. Compared to naive implementation, we achieve
$\sim2.1\times$ acceleration. Thorough experiments show that our algorithm
achieves 8.9x$\sim$11.7x throughput at equal accuracy compared to existing
hybrid vector search algorithms.
Authors' comments: 8 pages
Zihan Wang, Xuri Ge, Joemon M. Jose, Haitao Yu, Weizhi Ma, Zhaochun Ren, Xin Xin
Retrieval-augmented generation (RAG) has gained wide attention as the key
component to improve generative models with external knowledge augmentation
from information retrieval. It has shown great prominence in enhancing the
functionality and performance of large language model (LLM)-based applications.
However, with the comprehensive application of RAG, more and more problems and
limitations have been identified, thus urgently requiring further fundamental
exploration to improve current RAG frameworks. This workshop aims to explore in
depth how to conduct refined and reliable RAG for downstream AI tasks.
To this end, we propose to organize the first R3AG workshop at SIGIR-AP 2024
to call for participants to re-examine and formulate the basic principles and
practical implementation of refined and reliable RAG. The workshop serves as a
platform for both academia and industry researchers to conduct discussions,
share insights, and foster research to build the next generation of RAG
systems. Participants will engage in discussions and presentations focusing on
fundamental challenges, cutting-edge research, and potential pathways to
improve RAG. At the end of the workshop, we aim to have a clearer understanding
of how to improve the reliability and applicability of RAG with more robust
information retrieval and language generation.
Authors' comments: R^3AG workshop overview at SIGIR-AP 2024
Bin Kang, Bin Chen, Junjie Wang, Yong Xu
Text-based person retrieval aims to identify the specific persons using textual descriptions as queries. Existing ad vanced methods typically depend on vision-language pre trained (VLP) models to facilitate effective cross-modal alignment. However, the inherent constraints of VLP mod-els, which include the global alignment biases and insuffi-cient self-feedback regulation, impede optimal retrieval per formance. In this paper, we propose MeFa, a Multi-Pathway Exploration, Feedback, and Adjustment framework, which deeply explores intrinsic feedback of intra and inter-modal to make targeted adjustment, thereby achieving more precise person-text associations. Specifically, we first design an intra modal reasoning pathway that generates hard negative sam ples for cross-modal data, leveraging feedback from these samples to refine intra-modal reasoning, thereby enhancing sensitivity to subtle discrepancies. Subsequently, we intro duce a cross-modal refinement pathway that utilizes both global information and intermodal feedback to refine local in formation, thus enhancing its global semantic representation. Finally, the discriminative clue correction pathway incorpo rates fine-grained features of secondary similarity as discrim inative clues to further mitigate retrieval failures caused by disparities in these features. Experimental results on three public benchmarks demonstrate that MeFa achieves superior person retrieval performance without necessitating additional data or complex structures.
Cathal Maguire, Elyar Sedaghati, Neale P. Gibson, Alain Smette, Lorenzo Pino
Recent advancements in ultra-stable ground-based high-resolution
spectrographs have propelled ground-based astronomy to the forefront of
exoplanet detection and characterisation. Retrieving accurate atmospheric
parameters depends on accurate modelling and removal of the telluric
contamination while preserving the faint underlying exoplanet signal. There
exist many methods to model telluric contamination, whether directly modelling
the Earth's transmission spectrum via radiative transfer modelling, or using a
principal component analysis (PCA)-like reconstruction to fit the
time-invariant features of a spectrum. We aimed to assess the efficacy of these
various telluric removal methods in preserving the underlying exoplanetary
spectra. We compared two of the most common telluric modelling and removal
methods, molecfit and the PCA-like algorithm SysRem, using planetary
transmission spectra injected into three high-resolution optical observations
taken with ESPRESSO. These planetary signals were injected at orbital periods
of P = 2 days and 12 days, resulting in differing changes in radial velocity
during transit. We then retrieved various injected atmospheric model parameters
in order to determine the efficacy of the telluric removal methods. For the
close-in, high velocity injected signal, we found that SysRem performed better
for species that are also present in the Earth's atmosphere across each of the
datasets. As we moved to slower moving signals at larger orbital separations,
for one of the three datasets, SysRem dampened the planetary H$_2$O signal. In
contrast, the H$_2$O signal was preserved for the telluric modelling method,
molecfit. However, this behaviour was not ubiquitous across all three of the
injected datasets, with another dataset showing a more precise H$_2$O/Fe ratio
when preprocessed with SysRem.
Authors' comments: 25 pages, 23 figures, appendices included. Accepted for publication
in A&A
Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao
Key-Value (KV) caching is a common technique to enhance the computational
efficiency of Large Language Models (LLMs), but its memory overhead grows
rapidly with input length. Prior work has shown that not all tokens are equally
important for text generation, proposing layer-level KV cache compression to
selectively retain key information. Recognizing the distinct roles of attention
heads in generation, we propose HeadKV, a head-level KV cache compression
method, and HeadKV-R2, which leverages a novel contextual reasoning ability
estimation for compression. Our approach operates at the level of individual
heads, estimating their importance for contextual QA tasks that require both
retrieval and reasoning capabilities. Extensive experiments across diverse
benchmarks (LongBench, LooGLE), model architectures (e.g., Llama-3-8B-Instruct,
Mistral-7B-Instruct), and long-context abilities tests demonstrate that our
head-level KV cache compression significantly outperforms strong baselines,
particularly in low-resource settings (KV size = 64 & 128). Notably, our method
retains just 1.5% of the KV cache while achieving 97% of the performance of the
full KV cache on the contextual question answering benchmark.
Authors' comments: 18pages,submitted to ICLR 2025
Fengchen Liu, Jordan Jung, Wei Feinstein, Jeff DAmbrogia, Gary Jung
This paper introduces a novel approach to enhancing closed-domain Question Answering (QA) systems, focusing on the specific needs of the Lawrence Berkeley National Laboratory (LBL) Science Information Technology (ScienceIT) domain. Utilizing a rich dataset derived from the ScienceIT documentation, our study embarks on a detailed comparison of two fine-tuned large language models and five retrieval-augmented generation (RAG) models. Through data processing techniques, we transform the documentation into structured context-question-answer triples, leveraging the latest Large Language Models (AWS Bedrock, GCP PaLM2, Meta LLaMA2, OpenAI GPT-4, Google Gemini-Pro) for data-driven insights. Additionally, we introduce the Aggregated Knowledge Model (AKM), which synthesizes responses from the seven models mentioned above using K-means clustering to select the most representative answers. The evaluation of these models across multiple metrics offers a comprehensive look into their effectiveness and suitability for the LBL ScienceIT environment. The results demonstrate the potential benefits of integrating fine-tuning and retrieval-augmented strategies, highlighting significant performance improvements achieved with the AKM. The insights gained from this study can be applied to develop specialized QA systems tailored to specific domains.
Zhihao Liu, Simon Filhol, Désirée Treichler
Estimating the variability of seasonal snow cover, in particular snow depth in remote areas, poses significant challenges due to limited spatial and temporal data availability. This study uses snow depth measurements from the ICESat-2 satellite laser altimeter, which are sparse in both space and time, and incorporates them with climate reanalysis data into a downscaling-calibration scheme to produce monthly gridded snow depth maps at microscale (10 m). Snow surface elevation measurements from ICESat-2 along profiles are compared to a digital elevation model to determine snow depth at each point. To efficiently turn sparse measurements into snow depth maps, a regression model is fitted to establish a relationship between the retrieved snow depth and the corresponding ERA5 Land snow depth. This relationship, referred to as subgrid variability, is then applied to downscale the monthly ERA5 Land snow depth data. The method can provide timeseries of monthly snow depth maps for the entire ERA5 time range (since 1950). The validation of downscaled snow depth data was performed at an intermediate scale (100 m x 500 m) using datasets from airborne laser scanning (ALS) in the Hardangervidda region of southern Norway. Results show that snow depth prediction achieved R2 values ranging from 0.74 to 0.88 (post-calibration). The method relies on globally available data and is applicable to other snow regions above the treeline. Though requiring area-specific calibration, our approach has the potential to provide snow depth maps in areas where no such data exist and can be used to extrapolate existing snow surveys in time and over larger areas. With this, it can offer valuable input data for hydrological, ecological or permafrost modeling tasks.
Salman Rakin, Md. A. R. Shibly, Zahin M. Hossain, Zeeshan Khan, Md. Mostofa Akbar
While ongoing advancements in Large Language Models have demonstrated
remarkable success across various NLP tasks, Retrieval Augmented Generation
Model stands out to be highly effective on downstream applications like
Question Answering. Recently, RAG-end2end model further optimized the
architecture and achieved notable performance improvements on domain
adaptation. However, the effectiveness of these RAG-based architectures remains
relatively unexplored when fine-tuned on specialized domains such as customer
service for building a reliable conversational AI system. Furthermore, a
critical challenge persists in reducing the occurrence of hallucinations while
maintaining high domain-specific accuracy. In this paper, we investigated the
performance of diverse RAG and RAG-like architectures through domain adaptation
and evaluated their ability to generate accurate and relevant response grounded
in the contextual knowledge base. To facilitate the evaluation of the models,
we constructed a novel dataset HotelConvQA, sourced from wide range of
hotel-related conversations and fine-tuned all the models on our domain
specific dataset. We also addressed a critical research gap on determining the
impact of domain adaptation on reducing hallucinations across different RAG
architectures, an aspect that was not properly measured in prior work. Our
evaluation shows positive results in all metrics by employing domain
adaptation, demonstrating strong performance on QA tasks and providing insights
into their efficacy in reducing hallucinations. Our findings clearly indicate
that domain adaptation not only enhances the models' performance on QA tasks
but also significantly reduces hallucination across all evaluated RAG
architectures.
Authors' comments: Initial Version fine-tuned on HotelConvQA
Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li et al.
Retrieval-augmented generation (RAG) enhances the question-answering (QA)
abilities of large language models (LLMs) by integrating external knowledge.
However, adapting general-purpose RAG systems to specialized fields such as
science and medicine poses unique challenges due to distribution shifts and
limited access to domain-specific data. To tackle this, we propose SimRAG, a
self-training approach that equips the LLM with joint capabilities of question
answering and question generation for domain adaptation. Our method first
fine-tunes the LLM on instruction-following, question-answering, and
search-related data. Then, it prompts the same LLM to generate diverse
domain-relevant questions from unlabeled corpora, with an additional filtering
strategy to retain high-quality synthetic examples. By leveraging these
self-generated synthetic examples, the LLM can improve their performance on
domain-specific RAG tasks. Experiments on 11 datasets, spanning two backbone
sizes and three domains, demonstrate that SimRAG outperforms baselines by
1.2\%--8.6\%.
Authors' comments: Accepted to NAACL 2025 main conference