Oliver Savolainen, Dur e Najaf Amjad, Roxana Petcu
This reproducibility study analyzes and extends the paper "Axiomatic Causal
Interventions for Reverse Engineering Relevance Computation in Neural Retrieval
Models," which investigates how neural retrieval models encode task-relevant
properties such as term frequency. We reproduce key experiments from the
original paper, confirming that information on query terms is captured in the
model encoding. We extend this work by applying activation patching to Spanish
and Chinese datasets and by exploring whether document-length information is
encoded in the model as well. Our results confirm that the designed activation
patching method can isolate the behavior to specific components and tokens in
neural retrieval models. Moreover, our findings indicate that the location of
term frequency generalizes across languages and that in later layers, the
information for sequence-level tasks is represented in the CLS token. The
results highlight the need for further research into interpretability in
information retrieval and reproducibility in machine learning research. Our
code is available at
https://github.com/OliverSavolainen/axiomatic-ir-reproduce.
Authors' comments: 10 pages, SIGIR 2025
Zaifu Zhan, Shuang Zhou, Xiaoshan Zhou, Yongkang Xiao, Jun Wang, Jiawen Deng, He Zhu, Yu Hou et al.
Objectives: We aim to dynamically retrieve informative demonstrations,
enhancing in-context learning in multimodal large language models (MLLMs) for
disease classification.
Methods: We propose a Retrieval-Augmented In-Context Learning (RAICL)
framework, which integrates retrieval-augmented generation (RAG) and in-context
learning (ICL) to adaptively select demonstrations with similar disease
patterns, enabling more effective ICL in MLLMs. Specifically, RAICL examines
embeddings from diverse encoders, including ResNet, BERT, BioBERT, and
ClinicalBERT, to retrieve appropriate demonstrations, and constructs
conversational prompts optimized for ICL. We evaluated the framework on two
real-world multi-modal datasets (TCGA and IU Chest X-ray), assessing its
performance across multiple MLLMs (Qwen, Llava, Gemma), embedding strategies,
similarity metrics, and varying numbers of demonstrations.
Results: RAICL consistently improved classification performance. Accuracy
increased from 0.7854 to 0.8368 on TCGA and from 0.7924 to 0.8658 on IU Chest
X-ray. Multi-modal inputs outperformed single-modal ones, with text-only inputs
being stronger than images alone. The richness of information embedded in each
modality will determine which embedding model can be used to get better
results. Few-shot experiments showed that increasing the number of retrieved
examples further enhanced performance. Across different similarity metrics,
Euclidean distance achieved the highest accuracy while cosine similarity
yielded better macro-F1 scores. RAICL demonstrated consistent improvements
across various MLLMs, confirming its robustness and versatility.
Conclusions: RAICL provides an efficient and scalable approach to enhance
in-context learning in MLLMs for multimodal disease classification.
Authors' comments: 17 Pages, 1 figure, 7 tables
Regan Bolton, Mohammadreza Sheikhfathollahi, Simon Parkinson, Vanessa Vulovic, Gary Bamford, Dan Basher, Howard Parkinson
Safety critical software assessment requires robust assessment against complex regulatory frameworks, a process traditionally limited by manual evaluation. This paper presents Document Retrieval-Augmented Fine-Tuning (DRAFT), a novel approach that enhances the capabilities of a large language model (LLM) for safety-critical compliance assessment. DRAFT builds upon existing Retrieval-Augmented Generation (RAG) techniques by introducing a novel fine-tuning framework that accommodates our dual-retrieval architecture, which simultaneously accesses both software documentation and applicable reference standards. To fine-tune DRAFT, we develop a semi-automated dataset generation methodology that incorporates variable numbers of relevant documents with meaningful distractors, closely mirroring real-world assessment scenarios. Experiments with GPT-4o-mini demonstrate a 7% improvement in correctness over the baseline model, with qualitative improvements in evidence handling, response structure, and domain-specific reasoning. DRAFT represents a practical approach to improving compliance assessment systems while maintaining the transparency and evidence-based reasoning essential in regulatory domains.
Jiawei He, Boya Zhang, Hossein Rouhizadeh, Yingjian Chen, Rui Yang, Jin Lu, Xudong Chen, Nan Liu et al.
Recent advances in large language models (LLMs) have demonstrated remarkable
capabilities in natural language processing tasks. However, their application
in the biomedical domain presents unique challenges, particularly regarding
factual accuracy and up-to-date knowledge integration. Retrieval Augmented
Generation (RAG) has emerged as a promising solution to address these
challenges by combining the generative capabilities of LLMs with external
knowledge retrieval. This comprehensive survey examines the application of RAG
in the biomedical domain, focusing on its technological components, available
datasets, and clinical applications. We present a systematic analysis of
retrieval methods, ranking strategies, and generation models, while also
exploring the challenges and future directions in this rapidly evolving field.
Our work provides researchers and practitioners with a thorough understanding
of the current state of biomedical RAG systems and identifies key areas for
future research and development.
Authors' comments: 30 pages
Tasnim Ahmed, Salimur Choudhury
Linear Programming (LP) problems aim to find the optimal solution to an
objective under constraints. These problems typically require domain knowledge,
mathematical skills, and programming ability, presenting significant challenges
for non-experts. This study explores the efficiency of Large Language Models
(LLMs) in generating solver-specific LP code. We propose CHORUS, a
retrieval-augmented generation (RAG) framework for synthesizing Gurobi-based LP
code from natural language problem statements. CHORUS incorporates a
hierarchical tree-like chunking strategy for theoretical contents and generates
additional metadata based on code examples from documentation to facilitate
self-contained, semantically coherent retrieval. Two-stage retrieval approach
of CHORUS followed by cross-encoder reranking further ensures contextual
relevance. Finally, expertly crafted prompt and structured parser with
reasoning steps improve code generation performance significantly. Experiments
on the NL4Opt-Code benchmark show that CHORUS improves the performance of
open-source LLMs such as Llama3.1 (8B), Llama3.3 (70B), Phi4 (14B), Deepseek-r1
(32B), and Qwen2.5-coder (32B) by a significant margin compared to baseline and
conventional RAG. It also allows these open-source LLMs to outperform or match
the performance of much stronger baselines-GPT3.5 and GPT4 while requiring far
fewer computational resources. Ablation studies further demonstrate the
importance of expert prompting, hierarchical chunking, and structured
reasoning.
Authors' comments: This paper has been accepted for presentation at the 19th Learning
and Intelligent Optimization Conference (LION 19)
Fuma Ito, Chihiro Tsutake, Keita Takahashi, Toshiaki Fujii
To efficiently compress the sign information of images, we address a sign retrieval problem for the block-wise discrete cosine transformation (DCT): reconstruction of the signs of DCT coefficients from their amplitudes. To this end, we propose a fast sign retrieval method on the basis of binary classification machine learning. We first introduce 3D representations of the amplitudes and signs, where we pack amplitudes/signs belonging to the same frequency band into a 2D slice, referred to as the sub-band block. We then retrieve the signs from the 3D amplitudes via binary classification, where each sign is regarded as a binary label. We implement a binary classification algorithm using convolutional neural networks, which are advantageous for efficiently extracting features in the 3D amplitudes. Experimental results demonstrate that our method achieves accurate sign retrieval with an overwhelmingly low computation cost.
Aleksei Dorkin, Kairit Sirts
We present our submission to the Task 5 of SemEval-2025 that aims to aid
librarians in assigning subject tags to the library records by producing a list
of likely relevant tags for a given document. We frame the task as an
information retrieval problem, where the document content is used to retrieve
subject tags from a large subject taxonomy. We leverage two types of encoder
models to build a two-stage information retrieval system -- a bi-encoder for
coarse-grained candidate extraction at the first stage, and a cross-encoder for
fine-grained re-ranking at the second stage. This approach proved effective,
demonstrating significant improvements in recall compared to single-stage
methods and showing competitive results according to qualitative evaluation.
Authors' comments: To appear in the Proceedings of the 19th International Workshop on
Semantic Evaluation (SemEval-2025)
Xuanzhao Dong, Wenhui Zhu, Hao Wang, Xiwen Chen, Peijie Qiu, Rui Yin, Yi Su, Yalin Wang
Medical question answering (QA) is a reasoning-intensive task that remains challenging for large language models (LLMs) due to hallucinations and outdated domain knowledge. Retrieval-Augmented Generation (RAG) provides a promising post-training solution by leveraging external knowledge. However, existing medical RAG systems suffer from two key limitations: (1) a lack of modeling for human-like reasoning behaviors during information retrieval, and (2) reliance on suboptimal medical corpora, which often results in the retrieval of irrelevant or noisy snippets. To overcome these challenges, we propose Discuss-RAG, a plug-and-play module designed to enhance the medical QA RAG system through collaborative agent-based reasoning. Our method introduces a summarizer agent that orchestrates a team of medical experts to emulate multi-turn brainstorming, thereby improving the relevance of retrieved content. Additionally, a decision-making agent evaluates the retrieved snippets before their final integration. Experimental results on four benchmark medical QA datasets show that Discuss-RAG consistently outperforms MedRAG, especially significantly improving answer accuracy by up to 16.67% on BioASQ and 12.20% on PubMedQA. The code is available at: https://github.com/LLM-VLM-GSL/Discuss-RAG.
Feifei Niu, Chuanyi Li, Kui Liu, Xin Xia, David Lo
Bug localization is a crucial aspect of software maintenance, running through the entire software lifecycle. Information retrieval-based bug localization (IRBL) identifies buggy code based on bug reports, expediting the bug resolution process for developers. Recent years have witnessed significant achievements in IRBL, propelled by the widespread adoption of deep learning (DL). To provide a comprehensive overview of the current state of the art and delve into key issues, we conduct a survey encompassing 61 IRBL studies leveraging DL. We summarize best practices in each phase of the IRBL workflow, undertake a meta-analysis of prior studies, and suggest future research directions. This exploration aims to guide further advancements in the field, fostering a deeper understanding and refining practices for effective bug localization. Our study suggests that the integration of DL in IRBL enhances the model's capacity to extract semantic and syntactic information from both bug reports and source code, addressing issues such as lexical gaps, neglect of code structure information, and cold-start problems. Future research avenues for IRBL encompass exploring diversity in programming languages, adopting fine-grained granularity, and focusing on real-world applications. Most importantly, although some studies have started using large language models for IRBL, there is still a need for more in-depth exploration and thorough investigation in this area.
Yanliang Li, Wenbo Li, Qian Gong, Qing Liu, Norbert Podhorszki, Scott Klasky, Xin Liang, Jieyang Chen
Scientific applications produce vast amounts of data, posing grand challenges in the underlying data management and analytic tasks. Progressive compression is a promising way to address this problem, as it allows for on-demand data retrieval with significantly reduced data movement cost. However, most existing progressive methods are designed for CPUs, leaving a gap for them to unleash the power of today's heterogeneous computing systems with GPUs. In this work, we propose HP-MDR, a high-performance and portable data refactoring and progressive retrieval framework for GPUs. Our contributions are three-fold: (1) We carefully optimize the bitplane encoding and lossless encoding, two key stages in progressive methods, to achieve high performance on GPUs; (2) We propose pipeline optimization and incorporate it with data refactoring and progressive retrieval workflows to further enhance the performance for large data process; (3) We leverage our framework to enable high-performance data retrieval with guaranteed error control for common Quantities of Interest; (4) We evaluate HP-MDR and compare it with state of the arts using five real-world datasets. Experimental results demonstrate that HP-MDR delivers up to 6.6x throughput in data refactoring and progressive retrieval tasks. It also leads to 10.4x throughput for recomposing required data representations under Quantity-of-Interest error control and 4.2x performance for the corresponding end-to-end data retrieval, when compared with state-of-the-art solutions.
Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang
Retrieval-Augmented Generation (RAG) has shown substantial promise in
improving factual accuracy by grounding model responses with external knowledge
relevant to queries. However, most existing RAG approaches are limited to a
text-only corpus, and while recent efforts have extended RAG to other
modalities such as images and videos, they typically operate over a single
modality-specific corpus. In contrast, real-world queries vary widely in the
type of knowledge they require, which a single type of knowledge source cannot
address. To address this, we introduce UniversalRAG, a novel RAG framework
designed to retrieve and integrate knowledge from heterogeneous sources with
diverse modalities and granularities. Specifically, motivated by the
observation that forcing all modalities into a unified representation space
derived from a single combined corpus causes a modality gap, where the
retrieval tends to favor items from the same modality as the query, we propose
a modality-aware routing mechanism that dynamically identifies the most
appropriate modality-specific corpus and performs targeted retrieval within it.
Also, beyond modality, we organize each modality into multiple granularity
levels, enabling fine-tuned retrieval tailored to the complexity and scope of
the query. We validate UniversalRAG on 8 benchmarks spanning multiple
modalities, showing its superiority over modality-specific and unified
baselines.
Authors' comments: Project page : https://universalrag.github.io
Wing Yan Li, Zeqiang Wang, Jon Johnson, Suparna De
Automated detection of semantically equivalent questions in longitudinal social science surveys is crucial for long-term studies informing empirical research in the social, economic, and health sciences. Retrieving equivalent questions faces dual challenges: inconsistent representation of theoretical constructs (i.e. concept/sub-concept) across studies as well as between question and response options, and the evolution of vocabulary and structure in longitudinal text. To address these challenges, our multi-disciplinary collaboration of computer scientists and survey specialists presents a new information retrieval (IR) task of identifying concept (e.g. Housing, Job, etc.) equivalence across question and response options to harmonise longitudinal population studies. This paper investigates multiple unsupervised approaches on a survey dataset spanning 1946-2020, including probabilistic models, linear probing of language models, and pre-trained neural networks specialised for IR. We show that IR-specialised neural models achieve the highest overall performance with other approaches performing comparably. Additionally, the re-ranking of the probabilistic model's results with neural models only introduces modest improvements of 0.07 at most in F1-score. Qualitative post-hoc evaluation by survey specialists shows that models generally have a low sensitivity to questions with high lexical overlap, particularly in cases where sub-concepts are mismatched. Altogether, our analysis serves to further research on harmonising longitudinal studies in social science.
Magnus Bengtsson, Jens Wittsten, Jonas Waidringer
This paper introduces a warehouse optimization procedure aimed at enhancing
the efficiency of product storage and retrieval. By representing product
locations and order flows within a time-evolving graph structure, we employ
unsupervised clustering to define and refine compact order regions, effectively
reducing picking distances. We describe the procedure using a dynamic
mathematical model formulated using tools from random dynamical systems theory,
enabling a principled analysis of the system's behavior over time even under
random operational variations. For routing within this framework, we implement
a parallelized Bellman-Ford algorithm, utilizing GPU acceleration to evaluate
path segments efficiently. To address scalability challenges inherent in large
routing graphs, we introduce a segmentation strategy that preserves performance
while maintaining tractable memory requirements. Our results demonstrate
significant improvements in both operational efficiency and computational
feasibility for large-scale warehouse environments.
Authors' comments: 19 pages. Comments welcome
Michele Garetto, Alessandro Cornacchia, Franco Galante, Emilio Leonardi, Alessandro Nordio, Alberto Tarable
The advent of Large Language Models (LLMs) and generative AI is fundamentally
transforming information retrieval and processing on the Internet, bringing
both great potential and significant concerns regarding content authenticity
and reliability. This paper presents a novel quantitative approach to shed
light on the complex information dynamics arising from the growing use of
generative AI tools. Despite their significant impact on the digital ecosystem,
these dynamics remain largely uncharted and poorly understood. We propose a
stochastic model to characterize the generation, indexing, and dissemination of
information in response to new topics. This scenario particularly challenges
current LLMs, which often rely on real-time Retrieval-Augmented Generation
(RAG) techniques to overcome their static knowledge limitations. Our findings
suggest that the rapid pace of generative AI adoption, combined with increasing
user reliance, can outpace human verification, escalating the risk of
inaccurate information proliferation across digital resources. An in-depth
analysis of Stack Exchange data confirms that high-quality answers inevitably
require substantial time and human effort to emerge. This underscores the
considerable risks associated with generating persuasive text in response to
new questions and highlights the critical need for responsible development and
deployment of future generative AI tools.
Authors' comments: To be presented at ACM SIGIR 25
Jinglin He, Yunqi Guo, Lai Kwan Lam, Waikei Leung, Lixing He, Yuanan Jiang, Chi Chiu Wang, Guoliang Xing et al.
Traditional Chinese Medicine (TCM) represents a rich repository of ancient
medical knowledge that continues to play an important role in modern
healthcare. Due to the complexity and breadth of the TCM literature, the
integration of AI technologies is critical for its modernization and broader
accessibility. However, this integration poses considerable challenges,
including the interpretation of obscure classical Chinese texts and the
modeling of intricate semantic relationships among TCM concepts. In this paper,
we develop OpenTCM, an LLM-based system that combines a domain-specific TCM
knowledge graph and Graph-based Retrieval-Augmented Generation (GraphRAG).
First, we extract more than 3.73 million classical Chinese characters from 68
gynecological books in the Chinese Medical Classics Database, with the help of
TCM and gynecology experts. Second, we construct a comprehensive
multi-relational knowledge graph comprising more than 48,000 entities and
152,000 interrelationships, using customized prompts and Chinese-oriented LLMs
such as DeepSeek and Kimi to ensure high-fidelity semantic understanding. Last,
we integrate OpenTCM with this knowledge graph, enabling high-fidelity
ingredient knowledge retrieval and diagnostic question-answering without model
fine-tuning. Experimental evaluations demonstrate that OpenTCM achieves mean
expert scores (MES) of 4.378 in ingredient information retrieval and 4.045 in
diagnostic question-answering tasks, outperforming state-of-the-art solutions
in real-world TCM use cases.
Authors' comments: 8 pages, 6 figures, 7 tables
Junlong Ren, Gangjian Zhang, Yu Hu, Jian Shu, Hao Wang
Partially Relevant Video Retrieval (PRVR) aims to retrieve the target video that is partially relevant to the text query. The primary challenge in PRVR arises from the semantic asymmetry between textual and visual modalities, as videos often contain substantial content irrelevant to the query. Existing methods coarsely align paired videos and text queries to construct the semantic space, neglecting the critical cross-modal dual nature inherent in this task: inter-sample correlation and intra-sample redundancy. To this end, we propose a novel PRVR framework to systematically exploit these two characteristics. Our framework consists of three core modules. First, the Inter Correlation Enhancement (ICE) module captures inter-sample correlation by identifying semantically similar yet unpaired text queries and video moments, combining them to form pseudo-positive pairs for more robust semantic space construction. Second, the Intra Redundancy Mining (IRM) module mitigates intra-sample redundancy by mining redundant video moment features and treating them as hard negative samples, thereby encouraging the model to learn more discriminative representations. Finally, to reinforce these modules, we introduce the Temporal Coherence Prediction (TCP) module, which enhances feature discrimination by training the model to predict the original temporal order of randomly shuffled video frames and moments. Extensive experiments on three datasets demonstrate the superiority of our approach compared to previous methods, achieving state-of-the-art results.
Pierre Dussarrat, Guillaume Deschamps
Nowadays, tens of satellites carry hyperspectral spectrometers. Such instruments allow decomposing the light that exits the atmosphere from its top into hundreds to thousands of contiguous spectral channels. By analysis of the light spectral distribution, and in particular the depths of selected absorption lines, researchers and meteorological agencies can retrieve the atmosphere composition and thermodynamic state. To get a global view of the Earth, several instruments are generally operated synergistically, therefore, a harmonized calibration must be achieved between them. To cross-calibrate two spectrometers, a common practice is to analyze an ensemble of collocated measurements, meaning acquisitions performed at the same time and under the same geometry. Nonetheless, such analysis always faces the issue of setting appropriate temporal and geometric thresholds in defining the collocations, trading off between statistics and quality. Consequently, some collocation mismatches may have a substantial impact on the cross-calibration results. Thus, the following manuscript describes in detail the inclusion of collocation errors into the mathematical description and presents an application which is designed on purpose to be robust to such errors. Then, the knowledge of the spectral sensitivities of each channel to the incoming light, called the spectral response functions (SRF), are key to the exploitation of the acquisitions. In that context, the authors have studied and designed a novel methodology to retrieve relative SRF between two or more spectrometers, within a single instrument or between instruments embarked on different platforms. The objective of the methodology is to characterize discrepancies of responses between flying spectrometers, track long-term evolutions and harmonize their responses with post-processing when necessary.
Pengchao Feng, Ziyang Ma, Wenxi Chen, Yao Li, Sheng Wang, Kai Yu, Xie Chen
In recent years, end-to-end speech-to-speech (S2S) dialogue systems have garnered increasing research attention due to their advantages over traditional cascaded systems, including achieving lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these end-to-end systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries, eliminating the need for intermediate speech-to-text conversion via techniques like ASR. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. We will release the code and dataset to support reproducibility and promote further research in this area.
Jacky He, Guiran Liu, Binrong Zhu, Hanlu Zhang, Hongye Zheng, Xiaokai Wang
This paper focuses on the dynamic optimization of the Retrieval-Augmented Generation (RAG) architecture. It proposes a state-aware dynamic knowledge retrieval mechanism to enhance semantic understanding and knowledge scheduling efficiency in large language models for open-domain question answering and complex generation tasks. The method introduces a multi-level perceptive retrieval vector construction strategy and a differentiable document matching path. These components enable end-to-end joint training and collaborative optimization of the retrieval and generation modules. This effectively addresses the limitations of static RAG structures in context adaptation and knowledge access. Experiments are conducted on the Natural Questions dataset. The proposed structure is thoroughly evaluated across different large models, including GPT-4, GPT-4o, and DeepSeek. Comparative and ablation experiments from multiple perspectives confirm the significant improvements in BLEU and ROUGE-L scores. The approach also demonstrates stronger robustness and generation consistency in tasks involving semantic ambiguity and multi-document fusion. These results highlight its broad application potential and practical value in building high-quality language generation systems.
Hang Yu, Jiahao Wen, Zhedong Zheng
Text-based person retrieval aims to identify specific individuals within an image database using textual descriptions. Due to the high cost of annotation and privacy protection, researchers resort to synthesized data for the paradigm of pretraining and fine-tuning. However, these generated data often exhibit domain biases in both images and textual annotations, which largely compromise the scalability of the pre-trained model. Therefore, we introduce a domain-agnostic pretraining framework based on Cross-modality Adaptive Meta-Learning (CAMeL) to enhance the model generalization capability during pretraining to facilitate the subsequent downstream tasks. In particular, we develop a series of tasks that reflect the diversity and complexity of real-world scenarios, and introduce a dynamic error sample memory unit to memorize the history for errors encountered within multiple tasks. To further ensure multi-task adaptation, we also adopt an adaptive dual-speed update strategy, balancing fast adaptation to new tasks and slow weight updates for historical tasks. Albeit simple, our proposed model not only surpasses existing state-of-the-art methods on real-world benchmarks, including CUHK-PEDES, ICFG-PEDES, and RSTPReid, but also showcases robustness and scalability in handling biased synthetic images and noisy text annotations. Our code is available at https://github.com/Jahawn-Wen/CAMeL-reID.