Hengran Zhang, Keping Bi, Jiafeng Guo, Xiaojie Sun, Shihao Liu, Daiting Shi, Dawei Yin, Xueqi Cheng
Dense retrieval is a crucial task in Information Retrieval (IR) and is the
foundation for downstream tasks such as re-ranking. Recently, large language
models (LLMs) have shown compelling semantic understanding capabilities and are
appealing to researchers studying dense retrieval. LLMs, as decoder-style
generative models, are competent at language generation while falling short on
modeling global information due to the lack of attention to tokens afterward.
Inspired by the classical word-based language modeling approach for IR, i.e.,
the query likelihood (QL) model, we seek to sufficiently utilize LLMs'
generative ability by QL maximization. However, instead of ranking documents
with QL estimation, we introduce an auxiliary task of QL maximization to yield
a better backbone for contrastively learning a discriminative retriever. We
name our model as LLM-QL. To condense global document semantics to a single
vector during QL modeling, LLM-QL has two major components, Attention Stop (AS)
and Input Corruption (IC). AS stops the attention of predictive tokens to
previous tokens until the ending token of the document. IC masks a portion of
tokens in the input documents during prediction. Experiments on MSMARCO show
that LLM-QL can achieve significantly better performance than other LLM-based
retrievers and using QL estimated by LLM-QL for ranking outperforms word-based
QL by a large margin.
Authors' comments: 12 pages, 3 figures
Hengran Zhang, Minghao Tang, Keping Bi, Jiafeng Guo, Shihao Liu, Daiting Shi, Dawei Yin, Xueqi Cheng
Retrieval models typically rely on costly human-labeled query-document
relevance annotations for training and evaluation. To reduce this cost and
leverage the potential of Large Language Models (LLMs) in relevance judgments,
we aim to explore whether LLM-generated annotations can effectively replace
human annotations in training retrieval models. Retrieval usually emphasizes
relevance, which indicates "topic-relatedness" of a document to a query, while
in RAG, the value of a document (or utility) depends on how it contributes to
answer generation. Recognizing this mismatch, some researchers use LLM
performance on downstream tasks with documents as labels, but this approach
requires manual answers for specific tasks, leading to high costs and limited
generalization. In another line of work, prompting LLMs to select useful
documents as RAG references eliminates the need for human annotation and is not
task-specific. If we leverage LLMs' utility judgments to annotate retrieval
data, we may retain cross-task generalization without human annotation in
large-scale corpora. Therefore, we investigate utility-focused annotation via
LLMs for large-scale retriever training data across both in-domain and
out-of-domain settings on the retrieval and RAG tasks. To reduce the impact of
low-quality positives labeled by LLMs, we design a novel loss function, i.e.,
Disj-InfoNCE. Our experiments reveal that: (1) Retrievers trained on
utility-focused annotations significantly outperform those trained on human
annotations in the out-of-domain setting on both tasks, demonstrating superior
generalization capabilities. (2) LLM annotation does not replace human
annotation in the in-domain setting. However, incorporating just 20%
human-annotated data enables retrievers trained with utility-focused
annotations to match the performance of models trained entirely with human
annotations.
Authors' comments: 12 pages, 4 figures
Xiaolun Jing, Genke Yang, Jian Chu
Motivated by the success of coarse-grained or fine-grained contrast in text-video retrieval, there emerge multi-grained contrastive learning methods which focus on the integration of contrasts with different granularity. However, due to the wider semantic range of videos, the text-agnostic video representations might encode misleading information not described in texts, thus impeding the model from capturing precise cross-modal semantic correspondence. To this end, we propose a Text-Conditioned Multi-Grained Contrast framework, dubbed TC-MGC. Specifically, our model employs a language-video attention block to generate aggregated frame and video representations conditioned on the word's and text's attention weights over frames. To filter unnecessary similarity interactions and decrease trainable parameters in the Interactive Similarity Aggregation (ISA) module, we design a Similarity Reorganization (SR) module to identify attentive similarities and reorganize cross-modal similarity vectors and matrices. Next, we argue that the imbalance problem among multigrained similarities may result in over- and under-representation issues. We thereby introduce an auxiliary Similarity Decorrelation Regularization (SDR) loss to facilitate cooperative relationship utilization by similarity variance minimization on matching text-video pairs. Finally, we present a Linear Softmax Aggregation (LSA) module to explicitly encourage the interactions between multiple similarities and promote the usage of multi-grained information. Empirically, TC-MGC achieves competitive results on multiple text-video retrieval benchmarks, outperforming X-CLIP model by +2.8% (+1.3%), +2.2% (+1.0%), +1.5% (+0.9%) relative (absolute) improvements in text-to-video retrieval R@1 on MSR-VTT, DiDeMo and VATEX, respectively. Our code is publicly available at https://github.com/JingXiaolun/TC-MGC.
Jiaqi Deng, Kaize Shi, Zonghan Wu, Huan Huo, Dingxian Wang, Guandong Xu
Knowledge-based Vision Question Answering (KB-VQA) systems address complex
visual-grounded questions requiring external knowledge, such as web-sourced
encyclopedia articles. Existing methods often use sequential and separate
frameworks for the retriever and the generator with limited parametric
knowledge sharing. However, since both retrieval and generation tasks require
accurate understanding of contextual and external information, such separation
can potentially lead to suboptimal system performance. Another key challenge is
the integration of multimodal information. General-purpose multimodal
pre-trained models, while adept at multimodal representation learning, struggle
with fine-grained retrieval required for knowledge-intensive visual questions.
Recent specialized pre-trained models mitigate the issue, but are
computationally expensive. To bridge the gap, we propose a Unified
Retrieval-Augmented VQA framework (UniRVQA). UniRVQA adapts general multimodal
pre-trained models for fine-grained knowledge-intensive tasks within a unified
framework, enabling cross-task parametric knowledge sharing and the extension
of existing multimodal representation learning capability. We further introduce
a reflective-answering mechanism that allows the model to explicitly evaluate
and refine its knowledge boundary. Additionally, we integrate late interaction
into the retrieval-augmented generation joint training process to enhance
fine-grained understanding of queries and documents. Our approach achieves
competitive performance against state-of-the-art models, delivering a
significant 4.7% improvement in answering accuracy, and brings an average 7.5%
boost in base MLLMs' VQA performance.
Authors' comments: 10 pages, 5 figures
Michael J Bommarito, Daniel Martin Katz, Jillian Bommarito
We present NUPunkt and CharBoundary, two sentence boundary detection
libraries optimized for high-precision, high-throughput processing of legal
text in large-scale applications such as due diligence, e-discovery, and legal
research. These libraries address the critical challenges posed by legal
documents containing specialized citations, abbreviations, and complex sentence
structures that confound general-purpose sentence boundary detectors.
Our experimental evaluation on five diverse legal datasets comprising over
25,000 documents and 197,000 annotated sentence boundaries demonstrates that
NUPunkt achieves 91.1% precision while processing 10 million characters per
second with modest memory requirements (432 MB). CharBoundary models offer
balanced and adjustable precision-recall tradeoffs, with the large model
achieving the highest F1 score (0.782) among all tested methods.
Notably, NUPunkt provides a 29-32% precision improvement over general-purpose
tools while maintaining exceptional throughput, processing multi-million
document collections in minutes rather than hours. Both libraries run
efficiently on standard CPU hardware without requiring specialized
accelerators. NUPunkt is implemented in pure Python with zero external
dependencies, while CharBoundary relies only on scikit-learn and optional ONNX
runtime integration for optimized performance. Both libraries are available
under the MIT license, can be installed via PyPI, and can be interactively
tested at https://sentences.aleainstitute.ai/.
These libraries address critical precision issues in retrieval-augmented
generation systems by preserving coherent legal concepts across sentences,
where each percentage improvement in precision yields exponentially greater
reductions in context fragmentation, creating cascading benefits throughout
retrieval pipelines and significantly enhancing downstream reasoning quality.
Authors' comments: 12 pages, 5 figures, 6 tables
Kepu Zhang, Zhongxiang Sun, Weijie Yu, Xiaoxue Zang, Kai Zheng, Yang Song, Han Li, Jun Xu
Retriever-augmented generation (RAG) has become a widely adopted approach for enhancing the factual accuracy of large language models (LLMs). While current benchmarks evaluate the performance of RAG methods from various perspectives, they share a common assumption that user queries used for retrieval are error-free. However, in real-world interactions between users and LLMs, query entry errors such as keyboard proximity errors, visual similarity errors, and spelling errors are frequent. The impact of these errors on current RAG methods against such errors remains largely unexplored. To bridge this gap, we propose QE-RAG, the first robust RAG benchmark designed specifically to evaluate performance against query entry errors. We augment six widely used datasets by injecting three common types of query entry errors into randomly selected user queries at rates of 20\% and 40\%, simulating typical user behavior in real-world scenarios. We analyze the impact of these errors on LLM outputs and find that corrupted queries degrade model performance, which can be mitigated through query correction and training a robust retriever for retrieving relevant documents. Based on these insights, we propose a contrastive learning-based robust retriever training method and a retrieval-augmented query correction method. Extensive in-domain and cross-domain experiments reveal that: (1) state-of-the-art RAG methods including sequential, branching, and iterative methods, exhibit poor robustness to query entry errors; (2) our method significantly enhances the robustness of RAG when handling query entry errors and it's compatible with existing RAG methods, further improving their robustness.
Zahid Hassan Tushar, Adeleke Ademakinwa, Jianwu Wang, Zhibo Zhang, Sanjay Purushotham
Accurate cloud property retrieval is vital for understanding cloud behavior
and its impact on climate, including applications in weather forecasting,
climate modeling, and estimating Earth's radiation balance. The Independent
Pixel Approximation (IPA), a widely used physics-based approach, simplifies
radiative transfer calculations by assuming each pixel is independent of its
neighbors. While computationally efficient, IPA has significant limitations,
such as inaccuracies from 3D radiative effects, errors at cloud edges, and
ineffectiveness for overlapping or heterogeneous cloud fields. Recent
AI/ML-based deep learning models have improved retrieval accuracy by leveraging
spatial relationships across pixels. However, these models are often
memory-intensive, retrieve only a single cloud property, or struggle with joint
property retrievals. To overcome these challenges, we introduce CloudUNet with
Attention Module (CAM), a compact UNet-based model that employs attention
mechanisms to reduce errors in thick, overlapping cloud regions and a
specialized loss function for joint retrieval of Cloud Optical Thickness (COT)
and Cloud Effective Radius (CER). Experiments on a Large Eddy Simulation (LES)
dataset show that our CAM model outperforms state-of-the-art deep learning
methods, reducing mean absolute errors (MAE) by 34% for COT and 42% for CER,
and achieving 76% and 86% lower MAE for COT and CER retrievals compared to the
IPA method.
Authors' comments: 6 Pages, 4 figures, to be published in 2025 IEEE International
Geoscience and Remote Sensing Symposium (IGARSS 2025)
Shabnam Choudhury, Yash Salunkhe, Sarthak Mehrotra, Biplab Banerjee
The rapid expansion of remote sensing image archives demands the development
of strong and efficient techniques for content-based image retrieval (RS-CBIR).
This paper presents REJEPA (Retrieval with Joint-Embedding Predictive
Architecture), an innovative self-supervised framework designed for unimodal
RS-CBIR. REJEPA utilises spatially distributed context token encoding to
forecast abstract representations of target tokens, effectively capturing
high-level semantic features and eliminating unnecessary pixel-level details.
In contrast to generative methods that focus on pixel reconstruction or
contrastive techniques that depend on negative pairs, REJEPA functions within
feature space, achieving a reduction in computational complexity of 40-60% when
compared to pixel-reconstruction baselines like Masked Autoencoders (MAE). To
guarantee strong and varied representations, REJEPA incorporates
Variance-Invariance-Covariance Regularisation (VICReg), which prevents encoder
collapse by promoting feature diversity and reducing redundancy. The method
demonstrates an estimated enhancement in retrieval accuracy of 5.1% on BEN-14K
(S1), 7.4% on BEN-14K (S2), 6.0% on FMoW-RGB, and 10.1% on FMoW-Sentinel
compared to prominent SSL techniques, including CSMAE-SESD, Mask-VLM, SatMAE,
ScaleMAE, and SatMAE++, on extensive RS benchmarks BEN-14K (multispectral and
SAR data), FMoW-RGB and FMoW-Sentinel. Through effective generalisation across
sensor modalities, REJEPA establishes itself as a sensor-agnostic benchmark for
efficient, scalable, and precise RS-CBIR, addressing challenges like varying
resolutions, high object density, and complex backgrounds with computational
efficiency.
Authors' comments: 14 pages
Yuwei An, Yihua Cheng, Seo Jin Park, Junchen Jiang
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) by integrating external knowledge into the generation process. A key component of RAG pipelines is the reranker, which selects the most relevant documents from a pool of retrieved candidates and significantly improves the quality of the generated responses. While rerankers refine the selection of retrieved documents in RAG pipelines, they introduce computational challenges that hinder high throughput and low latency. To address this problem, we propose HyperRAG, a system that optimizes the trade-off between quality and efficiency in RAG pipelines by leveraging KV-cache reuse for efficient reranker inference. By reusing document-side KV-cache, HyperRAG achieves both high-quality generation and system-level efficiency. To fully realize the benefits of KV-cache reuse, HyperRAG incorporates a range of system-level optimizations designed to enhance efficiency and scalability. Experiments show that HyperRAG achieves a 2 - 3 throughput improvement with decoder-only rerankers while also delivering higher downstream performance compared with traditional RAG service.
Boseung Jeong, Jicheol Park, Sungyeon Kim, Suha Kwak
Video-text retrieval, the task of retrieving videos based on a textual query
or vice versa, is of paramount importance for video understanding and
multimodal information retrieval. Recent methods in this area rely primarily on
visual and textual features and often ignore audio, although it helps enhance
overall comprehension of video content. Moreover, traditional models that
incorporate audio blindly utilize the audio input regardless of whether it is
useful or not, resulting in suboptimal video representation. To address these
limitations, we propose a novel video-text retrieval framework, Audio-guided
VIdeo representation learning with GATEd attention (AVIGATE), that effectively
leverages audio cues through a gated attention mechanism that selectively
filters out uninformative audio signals. In addition, we propose an adaptive
margin-based contrastive loss to deal with the inherently unclear
positive-negative relationship between video and text, which facilitates
learning better video-text alignment. Our extensive experiments demonstrate
that AVIGATE achieves state-of-the-art performance on all the public
benchmarks.
Authors' comments: Accepted to CVPR 2025
Lena Schmidt, Oshin Sharma, Chris Marshall, Sonia Garcia Gonzalez Moral
Introduction: Horizon scanning in healthcare assesses early signals of innovation, crucial for timely adoption. Current horizon scanning faces challenges in efficient information retrieval and analysis, especially from unstructured sources like news, presenting a need for innovative tools. Methodology: The study introduces SCANAR and AIDOC, open-source Python-based tools designed to improve horizon scanning. SCANAR automates the retrieval and processing of news articles, offering functionalities such as de-duplication and unsupervised relevancy ranking. AIDOC aids filtration by leveraging AI to reorder textual data based on relevancy, employing neural networks for semantic similarity, and subsequently prioritizing likely relevant entries for human review. Results: Twelve internal datasets from horizon scans and four external benchmarking datasets were used. SCANAR improved retrieval efficiency by automating processes previously dependent on manual labour. AIDOC displayed work-saving potential, achieving around 62% reduction in manual review efforts at 95% recall. Comparative analysis with benchmarking data showed AIDOC's performance was similar to existing systematic review automation tools, though performance varied depending on dataset characteristics. A smaller case-study on our news datasets shows the potential of ensembling large language models within the active-learning process for faster detection of relevant articles across news datasets. Conclusion: The validation indicates that SCANAR and AIDOC show potential to enhance horizon scanning efficiency by streamlining data retrieval and prioritisation. These tools may alleviate methodological limitations and allow broader, swifter horizon scans. Further studies are suggested to optimize these models and to design new workflows and validation processes that integrate large language models.
Runlong Zhou, Yi Zhang
Language models often struggle with cross-mode knowledge retrieval -- the ability to access knowledge learned in one format (mode) when queried in another. We demonstrate that models trained on multiple data sources (e.g., Wikipedia and TinyStories) exhibit significantly reduced accuracy when retrieving knowledge in a format different from its original training mode. This paper quantitatively investigates this phenomenon through a controlled study of random token sequence memorization across different modes. We first explore dataset rewriting as a solution, revealing that effective cross-mode retrieval requires prohibitively extensive rewriting efforts that follow a sigmoid-like relationship. As an alternative, we propose CASCADE, a novel pretraining algorithm that uses cascading datasets with varying sequence lengths to capture knowledge at different scales. Our experiments demonstrate that CASCADE outperforms dataset rewriting approaches, even when compressed into a single model with a unified loss function. This work provides both qualitative evidence of cross-mode retrieval limitations and a practical solution to enhance language models' ability to access knowledge independently of its presentational format.
Tongtong Liu, Zhaohui Wang, Meiyue Qin, Zenghui Lu, Xudong Chen, Yuekui Yang, Peng Shu
The integration of Large Language Models (LLMs) with retrieval systems has
shown promising potential in retrieving documents (docs) or advertisements
(ads) for a given query. Existing LLM-based retrieval methods generate numeric
or content-based DocIDs to retrieve docs/ads. However, the one-to-few mapping
between numeric IDs and docs, along with the time-consuming content extraction,
leads to semantic inefficiency and limits scalability in large-scale corpora.
In this paper, we propose the Real-time Ad REtrieval (RARE) framework, which
leverages LLM-generated text called Commercial Intentions (CIs) as an
intermediate semantic representation to directly retrieve ads for queries in
real-time. These CIs are generated by a customized LLM injected with commercial
knowledge, enhancing its domain relevance. Each CI corresponds to multiple ads,
yielding a lightweight and scalable set of CIs. RARE has been implemented in a
real-world online system, handling daily search volumes in the hundreds of
millions. The online implementation has yielded significant benefits: a 5.04%
increase in consumption, a 6.37% rise in Gross Merchandise Volume (GMV), a
1.28% enhancement in click-through rate (CTR) and a 5.29% increase in shallow
conversions. Extensive offline experiments show RARE's superiority over ten
competitive baselines in four major categories.
Authors' comments: 13pages,5 figures
Ezzeldin Shereen, Dan Ristea, Burak Hasircioglu, Shae McFadden, Vasilios Mavroudis, Chris Hicks
Multimodal retrieval augmented generation (M-RAG) has recently emerged as a
method to inhibit hallucinations of large multimodal models (LMMs) through a
factual knowledge base (KB). However, M-RAG also introduces new attack vectors
for adversaries that aim to disrupt the system by injecting malicious entries
into the KB. In this work, we present a poisoning attack against M-RAG
targeting visual document retrieval applications, where the KB contains images
of document pages. Our objective is to craft a single image that is retrieved
for a variety of different user queries, and consistently influences the output
produced by the generative model, thus creating a universal denial-of-service
(DoS) attack against the M-RAG system. We demonstrate that while our attack is
effective against a diverse range of widely-used, state-of-the-art retrievers
(embedding models) and generators (LMMs), it can also be ineffective against
robust embedding models. Our attack not only highlights the vulnerability of
M-RAG pipelines to poisoning attacks, but also sheds light on a fundamental
weakness that potentially hinders their performance even in benign settings.
Authors' comments: 8 pages, 6 figures
Zixin Chen, Jianghui Ji, Guo Chen, Fei Yan, Xianyu Tan
Transmission spectroscopy has provided unprecedented insight into the makeup
of exoplanet atmospheres. A transmission spectrum contains contributions from a
planet's morning and evening limbs, which can differ in temperature,
composition and aerosol properties due to atmospheric circulation. While
high-resolution ground-based observations have identified limb asymmetry in
several ultra-hot/hot exoplanets, space-based studies of limb asymmetry are
still in their early stages. The prevalence of limb asymmetry across a broad
range of exoplanets remains largely unexplored. We conduct a comparative
analysis of retrievals on transmission spectra, including traditional 1D
approaches and four 2D models that account for limb asymmetry. Two of these 2D
models include our newly proposed dynamical constraints derived from
shallow-water simulations to provide physically-motivated temperature
differences between limbs. Our analysis of WASP-39 b using JWST observations
and previous combined datasets (HST, VLT, and Spitzer) strongly favors 2D
retrievals over traditional 1D approaches, confirming significant limb
asymmetry in this hot Jupiter. Within our 2D framework, unconstrained models
recover larger temperature contrasts than dynamically-constrained models, with
improved fits to specific spectral features, although Bayesian evidence cannot
definitively distinguish between these 2D approaches. Our results support the
presence of homogeneous C/O in both the morning and evening atmospheres, but
with temperature differences leading to variations in clouds and hazes. Using
this treatment, we can study a larger sample of hot Jupiters to gain insights
into atmospheric limb asymmetries on these planets.
Authors' comments: 16 pages, 6 figures, accepted for publication in AJ
Chengshuai Zhao, Riccardo De Maria, Tharindu Kumarage, Kumar Satvik Chaudhary, Garima Agrawal, Yiwen Li, Jongchan Park, Yuli Deng et al.
Advancements in large language models (LLMs) have enabled the development of intelligent educational tools that support inquiry-based learning across technical domains. In cybersecurity education, where accuracy and safety are paramount, systems must go beyond surface-level relevance to provide information that is both trustworthy and domain-appropriate. To address this challenge, we introduce CyberBOT, a question-answering chatbot that leverages a retrieval-augmented generation (RAG) pipeline to incorporate contextual information from course-specific materials and validate responses using a domain-specific cybersecurity ontology. The ontology serves as a structured reasoning layer that constrains and verifies LLM-generated answers, reducing the risk of misleading or unsafe guidance. CyberBOT has been deployed in a large graduate-level course at Arizona State University (ASU), where more than one hundred students actively engage with the system through a dedicated web-based platform. Computational evaluations in lab environments highlight the potential capacity of CyberBOT, and a forthcoming field study will evaluate its pedagogical impact. By integrating structured domain reasoning with modern generative capabilities, CyberBOT illustrates a promising direction for developing reliable and curriculum-aligned AI applications in specialized educational contexts.
Jirui Qi, Raquel Fernández, Arianna Bisazza
Retrieval-augmented generation (RAG) with large language models (LLMs) has
demonstrated strong performance in multilingual question-answering (QA) tasks
by leveraging relevant passages retrieved from corpora. In multilingual RAG
(mRAG), the retrieved passages can be written in languages other than that of
the query entered by the user, making it challenging for LLMs to effectively
utilize the provided information. Recent research suggests that retrieving
passages from multilingual corpora can improve RAG performance, particularly
for low-resource languages. However, the extent to which LLMs can leverage
different kinds of multilingual contexts to generate accurate answers,
*independently from retrieval quality*, remains understudied. In this paper, we
conduct an extensive assessment of LLMs' ability to (i) make consistent use of
a relevant passage regardless of its language, (ii) respond in the expected
language, and (iii) focus on the relevant passage even when multiple
`distracting' passages in different languages are provided in the context. Our
experiments with four LLMs across three QA datasets covering a total of 48
languages reveal a surprising ability of LLMs to extract the relevant
information from out-language passages, but a much weaker ability to formulate
a full answer in the correct language. Our analysis, based on both accuracy and
feature attribution techniques, further shows that distracting passages
negatively impact answer quality regardless of their language. However,
distractors in the query language exert a slightly stronger influence. Taken
together, our findings deepen the understanding of how LLMs utilize context in
mRAG systems, providing directions for future improvements.
Authors' comments: Under review at COLM2025. All codes and data are released at
https://github.com/Betswish/mRAG-Context-Consistency
Sakhinana Sagar Srinivas, Akash Das, Shivam Gupta, Venkataramana Runkana
We present a comprehensive framework for enhancing Retrieval-Augmented Generation (RAG) systems through dynamic retrieval strategies and reinforcement fine-tuning. This approach significantly improves large language models on knowledge-intensive tasks, including opendomain question answering and complex reasoning. Our framework integrates two complementary techniques: Policy-Optimized RetrievalAugmented Generation (PORAG), which optimizes the use of retrieved information, and Adaptive Token-Layer Attention Scoring (ATLAS), which dynamically determines retrieval timing and content based on contextual needs. Together, these techniques enhance both the utilization and relevance of retrieved content, improving factual accuracy and response quality. Designed as a lightweight solution compatible with any Transformer-based LLM without requiring additional training, our framework excels in knowledge-intensive tasks, boosting output accuracy in RAG settings. We further propose CRITIC, a novel method to selectively compress key-value caches by token importance, mitigating memory bottlenecks in long-context applications. The framework also incorporates test-time scaling techniques to dynamically balance reasoning depth and computational resources, alongside optimized decoding strategies for faster inference. Experiments on benchmark datasets show that our framework reduces hallucinations, strengthens domain-specific reasoning, and achieves significant efficiency and scalability gains over traditional RAG systems. This integrated approach advances the development of robust, efficient, and scalable RAG systems across diverse applications.
Xiaofan Zhou, Liangjie Huang, Pinyang Cheng, Wenpen Yin, Rui Zhang, Wenrui Hao, Lu Cheng
The causal relationships between biomarkers are essential for disease
diagnosis and medical treatment planning. One notable application is
Alzheimer's disease (AD) diagnosis, where certain biomarkers may influence the
presence of others, enabling early detection, precise disease staging, targeted
treatments, and improved monitoring of disease progression. However,
understanding these causal relationships is complex and requires extensive
research. Constructing a comprehensive causal network of biomarkers demands
significant effort from human experts, who must analyze a vast number of
research papers, and have bias in understanding diseases' biomarkers and their
relation. This raises an important question: Can advanced large language models
(LLMs), such as those utilizing retrieval-augmented generation (RAG), assist in
building causal networks of biomarkers for further medical analysis? To explore
this, we collected 200 AD-related research papers published over the past 25
years and then integrated scientific literature with RAG to extract AD
biomarkers and generate causal relations among them. Given the high-risk nature
of the medical diagnosis, we applied uncertainty estimation to assess the
reliability of the generated causal edges and examined the faithfulness and
scientificness of LLM reasoning using both automatic and human evaluation. We
find that RAG enhances the ability of LLMs to generate more accurate causal
networks from scientific papers. However, the overall performance of LLMs in
identifying causal relations of AD biomarkers is still limited. We hope this
study will inspire further foundational research on AI-driven analysis of AD
biomarkers causal network discovery.
Authors' comments: 9 pages, under review
Yiqun Duan, Sameera Ramasinghe, Stephen Gould, Ajanthan Thalaiyasingam
Composed Image Retrieval (CIR) is the task of retrieving images matching a reference image augmented with a text, where the text describes changes to the reference image in natural language. Traditionally, models designed for CIR have relied on triplet data containing a reference image, reformulation text, and a target image. However, curating such triplet data often necessitates human intervention, leading to prohibitive costs. This challenge has hindered the scalability of CIR model training even with the availability of abundant unlabeled data. With the recent advances in foundational models, we advocate a shift in the CIR training paradigm where human annotations can be efficiently replaced by large language models (LLMs). Specifically, we demonstrate the capability of large captioning and language models in efficiently generating data for CIR only relying on unannotated image collections. Additionally, we introduce an embedding reformulation architecture that effectively combines image and text modalities. Our model, named InstructCIR, outperforms state-of-the-art methods in zero-shot composed image retrieval on CIRR and FashionIQ datasets. Furthermore, we demonstrate that by increasing the amount of generated data, our zero-shot model gets closer to the performance of supervised baselines.