Teerapol Saengsukhiran, Peerawat Chomphooyod, Narabodee Rodjananant, Chompakorn Chaksangchaichot, Patawee Prakrankamanant, Witthawin Sripheanpol, Pak Lovichit, SarChaksaana Nutanong et al.
Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We study two such biases. First, prevalence bias refers to the tendency to favor entries from prevalent languages over semantically faithful entries in image-to-text retrieval. Second, association bias refers to the tendency to favor images culturally associated with the query over semantically correct ones in text-to-image retrieval. Results show that explicit alignment is a more effective strategy for mitigating prevalence bias. However, association bias remains a distinct and more challenging problem. These findings suggest that achieving truly equitable multimodal systems requires targeted strategies beyond simple data scaling and that bias arising from cultural association may be treated as a more challenging problem than one arising from linguistic prevalence.
Zirui Cheng, Jikai Sun, Anjun Gao, Yueyang Quan, Zhuqing Liu, Xiaohua Hu, Minghong Fang
Large language models (LLMs) have transformed natural language processing
(NLP), enabling applications from content generation to decision support.
Retrieval-Augmented Generation (RAG) improves LLMs by incorporating external
knowledge but also introduces security risks, particularly from data poisoning,
where the attacker injects poisoned texts into the knowledge database to
manipulate system outputs. While various defenses have been proposed, they
often struggle against advanced attacks. To address this, we introduce RAGuard,
a detection framework designed to identify poisoned texts. RAGuard first
expands the retrieval scope to increase the proportion of clean texts, reducing
the likelihood of retrieving poisoned content. It then applies chunk-wise
perplexity filtering to detect abnormal variations and text similarity
filtering to flag highly similar texts. This non-parametric approach enhances
RAG security, and experiments on large-scale datasets demonstrate its
effectiveness in detecting and mitigating poisoning attacks, including strong
adaptive attacks.
Authors' comments: To appear in IEEE BigData 2025
Yang Tian, Fan Liu, Jingyuan Zhang, Wei Bi, Yupeng Hu, Liqiang Nie
Large Multimodal Models (LMMs) have achieved remarkable progress in
generating photorealistic and prompt-aligned images, but they often produce
outputs that contradict verifiable knowledge, especially when prompts involve
fine-grained attributes or time-sensitive events. Conventional
retrieval-augmented approaches attempt to address this issue by introducing
external information, yet they are fundamentally incapable of grounding
generation in accurate and evolving knowledge due to their reliance on static
sources and shallow evidence integration. To bridge this gap, we introduce
ORIG, an agentic open multimodal retrieval-augmented framework for Factual
Image Generation (FIG), a new task that requires both visual realism and
factual grounding. ORIG iteratively retrieves and filters multimodal evidence
from the web and incrementally integrates the refined knowledge into enriched
prompts to guide generation. To support systematic evaluation, we build
FIG-Eval, a benchmark spanning ten categories across perceptual, compositional,
and temporal dimensions. Experiments demonstrate that ORIG substantially
improves factual consistency and overall image quality over strong baselines,
highlighting the potential of open multimodal retrieval for factual image
generation.
Authors' comments: Preprint
Yang Zhong, Zhiming Wang, Zhaoyang Li, Jinyu Ma, Xiang Li
This paper introduces the 3rd place solution to the ICCV LargeFineFoodAI Retrieval Competition on Kaggle. Four basic models are independently trained with the weighted sum of ArcFace and Circle loss, then TTA and Ensemble are successively applied to improve feature representation ability. In addition, a new reranking method for retrieval is proposed based on diffusion and k-reciprocal reranking. Finally, our method scored 0.81219 and 0.81191 mAP@100 on the public and private leaderboard, respectively.
Saptarshi Sengupta, Zhengyu Zhou, Jun Araki, Xingbo Wang, Bingqing Wang, Suhang Wang, Zhe Feng
Tool calling has become increasingly popular for Large Language Models (LLMs). However, for large tool sets, the resulting tokens would exceed the LLM's context window limit, making it impossible to include every tool. Hence, an external retriever is used to provide LLMs with the most relevant tools for a query. Existing retrieval models rank tools based on the similarity between a user query and a tool description (TD). This leads to suboptimal retrieval as user requests are often poorly aligned with the language of TD. To remedy the issue, we propose ToolDreamer, a framework to condition retriever models to fetch tools based on hypothetical (synthetic) TD generated using an LLM, i.e., description of tools that the LLM feels will be potentially useful for the query. The framework enables a more natural alignment between queries and tools within the language space of TD's. We apply ToolDreamer on the ToolRet dataset and show that our method improves the performance of sparse and dense retrievers with and without training, thus showcasing its flexibility. Through our proposed framework, our aim is to offload a portion of the reasoning burden to the retriever so that the LLM may effectively handle a large collection of tools without inundating its context window.
Yang Hu, William Kuszmaul, Jingxun Liang, Huacheng Yu, Junkai Zhang, Renfei Zhou
In the static retrieval problem, a data structure must answer retrieval
queries mapping a set of $n$ keys in a universe $[U]$ to $v$-bit values.
Information-theoretically, retrieval data structures can use as little as $nv$
bits of space. For small value sizes $v$, it is possible to achieve $O(1)$
query time while using space $nv + o(n)$ bits -- whether or not such a result
is possible for larger values of $v$ (e.g., $v = \Theta(\log n)$) has remained
open.
In this paper, we obtain a tight lower bound (as well as matching upper
bounds) for the static retrieval problem. In the case where values are large,
we show that there is actually a significant tension between time and space. It
is not possible, for example, to get $O(1)$ query time using $nv + o(n)$ bits
of space, when $v = \Theta(\log n)$ (and assuming the word RAM model with
$O(\log n)$-bit words).
At first glance, our lower bound would seem to render retrieval unusable in
many settings that aim to achieve very low redundancy. However, our second
result offers a way around this: We show that, whenever a retrieval data
structure $D_1$ is stored along with another data structure $D_2$ (whose size
is similar to or larger than the size of $D_1$), it is possible to implement
the combined data structure $D_1 \cup D_2$ so that queries to $D_1$ take $O(1)$
time, operations on $D_2$ take the same asymptotic time as if $D_2$ were stored
on its own, and the total space is $nv + \mathrm{Space}(D_2) + n^{0.67}$ bits.
Authors' comments: 28 pages, in FOCS 2025
Jiahao Shi, Tianyi Zhang
Despite recent advances, Large Language Models (LLMs) still generate vulnerable code. Retrieval-Augmented Generation (RAG) has the potential to enhance LLMs for secure code generation by incorporating external security knowledge. However, the conventional RAG design struggles with the noise of raw security-related documents, and existing retrieval methods overlook the significant security semantics implicitly embedded in task descriptions. To address these issues, we propose RESCUE, a new RAG framework for secure code generation with two key innovations. First, we propose a hybrid knowledge base construction method that combines LLM-assisted cluster-then-summarize distillation with program slicing, producing both high-level security guidelines and concise, security-focused code examples. Second, we design a hierarchical multi-faceted retrieval to traverse the constructed knowledge base from top to bottom and integrates multiple security-critical facts at each hierarchical level, ensuring comprehensive and accurate retrieval. We evaluated RESCUE on four benchmarks and compared it with five state-of-the-art secure code generation methods on six LLMs. The results demonstrate that RESCUE improves the SecurePass@1 metric by an average of 4.8 points, establishing a new state-of-the-art performance for security. Furthermore, we performed in-depth analysis and ablation studies to rigorously validate the effectiveness of individual components in RESCUE.
Iman Deznabi, Peeyush Kumar, Madalina Fiterau
Zero-shot forecasting aims to predict outcomes for previously unseen conditions without direct historical data, posing a significant challenge for traditional forecasting methods. We introduce a Resolution-Aware Retrieval-Augmented Forecasting model that enhances predictive accuracy by leveraging spatial correlations and temporal frequency characteristics. By decomposing signals into different frequency components, our model employs resolution-aware retrieval, where lower-frequency components rely on broader spatial context, while higher-frequency components focus on local influences. This allows the model to dynamically retrieve relevant data and adapt to new locations with minimal historical context. Applied to microclimate forecasting, our model significantly outperforms traditional forecasting methods, numerical weather prediction models, and modern foundation time series models, achieving 71% lower MSE than HRRR and 34% lower MSE than Chronos on the ERA5 dataset. Our results highlight the effectiveness of retrieval-augmented and resolution-aware strategies, offering a scalable and data-efficient solution for zero-shot forecasting in microclimate modeling and beyond.
Huyen N. Nguyen, Nils Gehlenborg
Effective visualization retrieval necessitates a clear definition of
similarity. Despite the growing body of work in specialized visualization
retrieval systems, a systematic approach to understanding visualization
similarity remains absent. We introduce the Similarity Framework for
Visualization Retrieval (Safire), a conceptual model that frames visualization
similarity along two dimensions: comparison criteria and representation
modalities. Comparison criteria identify the aspects that make visualizations
similar, which we divide into primary facets (data, visual encoding,
interaction, style, metadata) and derived properties (data-centric and
human-centric measures). Safire connects what to compare with how comparisons
are executed through representation modalities. We categorize existing
representation approaches into four groups based on their levels of information
content and visualization determinism: raster image, vector image,
specification, and natural language description, together guiding what is
computable and comparable. We analyze several visualization retrieval systems
using Safire to demonstrate its practical value in clarifying similarity
considerations. Our findings reveal how particular criteria and modalities
align across different use cases. Notably, the choice of representation
modality is not only an implementation detail but also an important decision
that shapes retrieval capabilities and limitations. Based on our analysis, we
provide recommendations and discuss broader implications for multimodal
learning, AI applications, and visualization reproducibility.
Authors' comments: To appear in IEEE VIS 2025
Zhiyuan Hu, Fakhriyya Mammadova, Julián Tachella, Michael Unser, Jonathan Dong
Phase retrieval is a nonlinear inverse problem that arises in a wide range of imaging modalities, from electron microscopy to optical Fourier ptychography. Among various modalities, random phase retrieval stands out thanks to its strong theoretical guarantees and efficient reconstruction algorithms, although its applicability is hindered by prohibitive computational costs. In this paper, we propose the structured random models for phase retrieval, where we emulate a dense random matrix by a cascade of structured transforms and random diagonal matrices. We demonstrate that structured random models can achieve the same reconstruction performance as dense random models, with complexity reduced from quadratic to log-linear. Using a spectral method initialization followed by gradient descent, robust reconstruction is obtained at an oversampling ratio as low as 2.8. Moreover, we observe that the reconstruction performance is solely determined by the singular value distribution of the forward matrix. This class of models can directly be implemented with basic optical elements such as lenses and diffusers, paving the way for large-scale phase imaging with robust reconstruction guarantees.
Malte Fliedner, Julian Golak, Yağmur Gül, Simone Neumann
Growing demand for sustainable logistics and higher space utilization, driven by e-commerce and urbanization, increases the need for storage systems that are both energy- and space-efficient. Compact storage systems aim to maximize space utilization in limited storage areas and are therefore particularly suited in densely-populated urban areas where space is scarce. In this paper, we examine a recently introduced compact storage system in which uniformly shaped bins are stacked directly on top of each other, eliminating the need for aisles used to handle materials. Target bins are retrieved in a fully automated process by first lifting all other bins that block access and then accessing the target bin from the side of the system by a dedicated robot. Consequently, retrieving a bin can require substantial lifting effort, and thus energy. However, this energy can be reduced through smart retrieval strategies. From an operational perspective, we investigate how retrievals can be optimized with respect to energy consumption. We model the retrieval problem within a mathematical framework. We show that the problem is strongly NP-complete and derive structural insights. Building on these insights, we propose two exact methods: a mixed-integer programming (MIP) formulation and a dynamic programming algorithm, along with a simple, practitioner-oriented greedy algorithm that yields near-instant solutions. Numerical experiments reveal that dynamic programming consistently outperforms state-of-the-art MIP solvers in small to medium sized instances, while the greedy algorithm delivers satisfactory performance, especially when exact methods become computationally impractical.
Heydar Soudani, Hamed Zamani, Faegheh Hasibi
Retrieval-augmented reasoning (RAR) is a recent evolution of retrieval-augmented generation (RAG) that employs multiple reasoning steps for retrieval and generation. While effective for some complex queries, RAR remains vulnerable to errors and misleading outputs. Uncertainty quantification (UQ) offers methods to estimate the confidence of systems' outputs. These methods, however, often handle simple queries with no retrieval or single-step retrieval, without properly handling RAR setup. Accurate estimation of UQ for RAR requires accounting for all sources of uncertainty, including those arising from retrieval and generation. In this paper, we account for all these sources and introduce Retrieval-Augmented Reasoning Consistency (R2C)--a novel UQ method for RAR. The core idea of R2C is to perturb the multi-step reasoning process by applying various actions to reasoning steps. These perturbations alter the retriever's input, which shifts its output and consequently modifies the generator's input at the next step. Through this iterative feedback loop, the retriever and generator continuously reshape one another's inputs, enabling us to capture uncertainty arising from both components. Experiments on five popular RAR systems across diverse QA datasets show that R2C improves AUROC by over 5% on average compared to the state-of-the-art UQ baselines. Extrinsic evaluations using R2C as an external signal further confirm its effectiveness for two downstream tasks: in Abstention, it achieves ~5% gains in both F1Abstain and AccAbstain; in Model Selection, it improves the exact match by ~7% over single models and ~3% over selection methods.
Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang, Sherry Tongshuang Wu, Iryna Gurevych, Heinz Koeppl
Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on CoIR and matches them on BRIGHT. It achieves BEIR's unsupervised SoTA with ~ 1000x less training data and 10x less compute. Performance increases with batch size and model size, highlighting Revela's scalability and its promise for self-supervised retriever learning.
Meiru Zhang, Philipp Borchert, Milan Gritta, Gerasimos Lampouras
Automating the formalization of mathematical statements for theorem proving remains a major challenge for Large Language Models (LLMs). LLMs struggle to identify and utilize the prerequisite mathematical knowledge and its corresponding formal representation in languages like Lean. Current retrieval-augmented autoformalization methods query external libraries using the informal statement directly, but overlook a fundamental limitation: informal mathematical statements are often complex and offer limited context on the underlying math concepts. To address this, we introduce DRIFT, a novel framework that enables LLMs to decompose informal mathematical statements into smaller, more tractable ''sub-components''. This facilitates targeted retrieval of premises from mathematical libraries such as Mathlib. Additionally, DRIFT retrieves illustrative theorems to help models use premises more effectively in formalization tasks. We evaluate DRIFT across diverse benchmarks (ProofNet, ConNF, and MiniF2F-test) and find that it consistently improves premise retrieval, nearly doubling the F1 score compared to the DPR baseline on ProofNet. Notably, DRIFT demonstrates strong performance on the out-of-distribution ConNF benchmark, with BEq+@10 improvements of 37.14% and 42.25% using GPT-4.1 and DeepSeek-V3.1, respectively. Our analysis shows that retrieval effectiveness in mathematical autoformalization depends heavily on model-specific knowledge boundaries, highlighting the need for adaptive retrieval strategies aligned with each model's capabilities.
Maoliang Li, Ke Li, Yaoyang Liu, Jiayu Chen, Zihao Zheng, Yinjun Wu, Xiang Chen
To effectively leverage user-specific data, retrieval augmented generation
(RAG) is employed in multimodal large language model (MLLM) applications.
However, conventional retrieval approaches often suffer from limited retrieval
accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by
decomposing queries and matching against segmented images. They still suffer
from sub-optimal accuracy and efficiency, overlooking alignment between the
query and varying image objects and redundant fine-grained image segments. In
this work, we present an efficient scheduling framework for image retrieval -
HiMIR. First, we introduce a novel hierarchical paradigm, employing multiple
intermediate granularities for varying image objects to enhance alignment.
Second, we minimize redundancy in retrieval by leveraging cross-hierarchy
similarity consistency and hierarchy sparsity to minimize unnecessary matching
computation. Furthermore, we configure parameters for each dataset
automatically for practicality across diverse scenarios. Our empirical study
shows that, HiMIR not only achieves substantial accuracy improvements but also
reduces computation by up to 3.5 times over the existing MVR system.
Authors' comments: Under Review
Arkadeep Acharya, Akash Ghosh, Pradeepika Verma, Kitsuchart Pasupa, Sriparna Saha, Priti Singh
With the increasing use of RetrievalAugmented Generation (RAG), strong
retrieval models have become more important than ever. In healthcare,
multimodal retrieval models that combine information from both text and images
offer major advantages for many downstream tasks such as question answering,
cross-modal retrieval, and multimodal summarization, since medical data often
includes both formats. However, there is currently no standard benchmark to
evaluate how well these models perform in medical settings. To address this
gap, we introduce M3Retrieve, a Multimodal Medical Retrieval Benchmark.
M3Retrieve, spans 5 domains,16 medical fields, and 4 distinct tasks, with over
1.2 Million text documents and 164K multimodal queries, all collected under
approved licenses. We evaluate leading multimodal retrieval models on this
benchmark to explore the challenges specific to different medical specialities
and to understand their impact on retrieval performance. By releasing
M3Retrieve, we aim to enable systematic evaluation, foster model innovation,
and accelerate research toward building more capable and reliable multimodal
retrieval systems for medical applications. The dataset and the baselines code
are available in this github page https://github.com/AkashGhosh/M3Retrieve.
Authors' comments: EMNLP Mains 2025
Yu-Fei Shih, An-Zi Yen, Hen-Hsen Huang, Hsin-Hsi Chen
People often struggle to remember specific details of past experiences, which can lead to the need to revisit these memories. Consequently, lifelog retrieval has emerged as a crucial application. Various studies have explored methods to facilitate rapid access to personal lifelogs for memory recall assistance. In this paper, we propose a Captioning-Integrated Visual Lifelog (CIVIL) Retrieval System for extracting specific images from a user's visual lifelog based on textual queries. Unlike traditional embedding-based methods, our system first generates captions for visual lifelogs and then utilizes a text embedding model to project both the captions and user queries into a shared vector space. Visual lifelogs, captured through wearable cameras, provide a first-person viewpoint, necessitating the interpretation of the activities of the individual behind the camera rather than merely describing the scene. To address this, we introduce three distinct approaches: the single caption method, the collective caption method, and the merged caption method, each designed to interpret the life experiences of lifeloggers. Experimental results show that our method effectively describes first-person visual images, enhancing the outcomes of lifelog retrieval. Furthermore, we construct a textual dataset that converts visual lifelogs into captions, thereby reconstructing personal life experiences.
Paul Teiletche, Quentin Macé, Max Conti, Antonio Loison, Gautier Viaud, Pierre Colombo, Manuel Faysse
Multimodal embedding models are gaining prevalence, notably for document retrieval as efficient alternatives to text-only pipelines. These models are typically built by finetuning large vision-language decoders (VLMs) with contrastive losses on text-image pairs. In this work, we show that, while cost-efficient, this repurposing approach often bottlenecks retrieval performance. Through controlled experiments, we establish a principled recipe for improving visual document retrieval models. We notably measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as central performance factors. Building on these insights, we release ModernVBERT, a compact 250M-parameter vision-language encoder that outperforms models up to 10 times larger when finetuned on document retrieval tasks. Models and code are made available at https://huggingface.co/ModernVBERT.
Leopold Müller, Joshua Holstein, Sarah Bause, Gerhard Satzger, Niklas Kühl
Organizations increasingly adopt Retrieval-Augmented Generation (RAG) to
enhance Large Language Models with enterprise-specific knowledge. However,
current data quality (DQ) frameworks have been primarily developed for static
datasets, and only inadequately address the dynamic, multi-stage nature of RAG
systems. This study aims to develop DQ dimensions for this new type of AI-based
systems. We conduct 16 semi-structured interviews with practitioners of leading
IT service companies. Through a qualitative content analysis, we inductively
derive 15 distinct DQ dimensions across the four processing stages of RAG
systems: data extraction, data transformation, prompt & search, and generation.
Our findings reveal that (1) new dimensions have to be added to traditional DQ
frameworks to also cover RAG contexts; (2) these new dimensions are
concentrated in early RAG steps, suggesting the need for front-loaded quality
management strategies, and (3) DQ issues transform and propagate through the
RAG pipeline, necessitating a dynamic, step-aware approach to quality
management.
Authors' comments: Preprint version. Accepted for presentation at the International
Conference on Information Systems (ICIS 2025). Please cite the published
version when available
Xiaoyu Song, William Han, Tony Chen, Chaojing Duan, Michael A. Rosenberg, Emerson Liu, Ding Zhao
Interest in generative Electrocardiogram-Language Models (ELMs) is growing,
as they can produce textual responses conditioned on ECG signals and textual
queries. Unlike traditional classifiers that output label probabilities, ELMs
are more versatile, supporting domain-specific tasks (e.g., waveform analysis,
diagnosis, prognosis) as well as general tasks (e.g., open-ended questions,
dialogue). Retrieval-Augmented Generation (RAG), widely used in Large Language
Models (LLMs) to ground LLM outputs in retrieved knowledge, helps reduce
hallucinations and improve natural language generation (NLG). However, despite
its promise, no open-source implementation or systematic study of RAG pipeline
design for ELMs currently exists. To address this gap, we present the first
open-source RAG pipeline for ELMs, along with baselines and ablation studies
for NLG. Experiments on three public datasets show that ELMs with RAG
consistently improves performance over non-RAG baselines and highlights key ELM
design considerations. Our code is available at:
https://github.com/willxxy/ECG-Bench.
Authors' comments: 5 pages, 2 figures; Submitted to ICASSP 2026