Shaoqi Wang, Lu Yu, Chunjie Yang
The convergence of deep learning and formal mathematics has spurred research in formal verification. Statement autoformalization, a crucial first step in this process, aims to translate informal descriptions into machine-verifiable representations but remains a significant challenge. The core difficulty lies in the fact that existing methods often suffer from a lack of contextual awareness, leading to hallucination of formal definitions and theorems. Furthermore, current retrieval-augmented approaches exhibit poor precision and recall for formal library dependency retrieval, and lack the scalability to effectively leverage ever-growing public datasets. To bridge this gap, we propose a novel retrieval-augmented framework based on DDR (\textit{Direct Dependency Retrieval}) for statement autoformalization. Our DDR method directly generates candidate library dependencies from natural language mathematical descriptions and subsequently verifies their existence within the formal library via an efficient suffix array check. Leveraging this efficient search mechanism, we constructed a dependency retrieval dataset of over 500,000 samples and fine-tuned a high-precision DDR model. Experimental results demonstrate that our DDR model significantly outperforms SOTA methods in both retrieval precision and recall. Consequently, an autoformalizer equipped with DDR shows consistent performance advantages in both single-attempt accuracy and multi-attempt stability compared to models using traditional selection-based RAG methods.
Paolo Astrino
Organizations handling sensitive documents face a critical dilemma: adopt cloud-based AI systems that offer powerful question-answering capabilities but compromise data privacy, or maintain local processing that ensures security but delivers poor accuracy. We present a question-answering system that resolves this trade-off by combining semantic understanding with keyword precision, operating entirely on local infrastructure without internet access. Our approach demonstrates that organizations can achieve competitive accuracy on complex queries across legal, scientific, and conversational documents while keeping all data on their machines. By balancing two complementary retrieval strategies and using consumer-grade hardware acceleration, the system delivers reliable answers with minimal errors, letting banks, hospitals, and law firms adopt conversational document AI without transmitting proprietary information to external providers. This work establishes that privacy and performance need not be mutually exclusive in enterprise AI deployment.
Authors' comments: 10 pages, 5 figures, 3 tables; conference-style (ACL format); fully local RAG system
Xinyu Zhou, Yu Wu, Jiayao Ma, Wenhao Wang, Min Cao, Mang Ye
This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), which aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions. Unlike traditional Text-based Person Retrieval (T-PR), which focuses solely on ground-view images, TAG-PR introduces greater practical significance and presents unique challenges due to the large viewpoint discrepancy across images. To support this task, we contribute: (1) TAG-PEDES dataset, constructed from public benchmarks with automatically generated textual descriptions, enhanced by a diversified text generation paradigm to ensure robustness under view heterogeneity; and (2) TAG-CLIP, a novel retrieval framework that addresses view heterogeneity through a hierarchically-routed mixture of experts module to learn view-specific and view-agnostic features and a viewpoint decoupling strategy to decouple view-specific features for better cross-modal alignment. We evaluate the effectiveness of TAG-CLIP on both the proposed TAG-PEDES dataset and existing T-PR benchmarks. The dataset and code are available at https://github.com/Flame-Chasers/TAG-PR.
Özay Ezerceli, Gizem Gümüşçekiçci, Tuğba Erkoç, Berke Özenç
In this work, we introduce TurkEmbed4Retrieval, a retrieval specialized variant of the TurkEmbed model originally designed for Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. By fine-tuning the base model on the MS MARCO TR dataset using advanced training techniques, including Matryoshka representation learning and a tailored multiple negatives ranking loss, we achieve SOTA performance for Turkish retrieval tasks. Extensive experiments demonstrate that our model outperforms Turkish colBERT by 19,26% on key retrieval metrics for the Scifact TR dataset, thereby establishing a new benchmark for Turkish information retrieval.
Authors' comments: 4 pages, in Turkish language, 1 figure, conference
Dazhao Du, Tao Han, Song Guo
Deep learning models such as MLP, Transformer, and TCN have achieved
remarkable success in univariate time series forecasting, typically relying on
sliding window samples from historical data for training. However, while these
models implicitly compress historical information into their parameters during
training, they are unable to explicitly and dynamically access this global
knowledge during inference, relying only on the local context within the
lookback window. This results in an underutilization of rich patterns from the
global history. To bridge this gap, we propose Predicting the Future by
Retrieving the Past (PFRP), a novel approach that explicitly integrates global
historical data to enhance forecasting accuracy. Specifically, we construct a
Global Memory Bank (GMB) to effectively store and manage global historical
patterns. A retrieval mechanism is then employed to extract similar patterns
from the GMB, enabling the generation of global predictions. By adaptively
combining these global predictions with the outputs of any local prediction
model, PFRP produces more accurate and interpretable forecasts. Extensive
experiments conducted on seven real-world datasets demonstrate that PFRP
significantly enhances the average performance of advanced univariate
forecasting models by 8.4\%. Codes can be found in
https://github.com/ddz16/PFRP.
Authors' comments: Accepted by AAAI 2026
Qianru Meng, Xiao Zhang, Zhaochen Ren, Joost Visser
Code review is essential for maintaining software quality but is labor-intensive. Automated code review generation offers a promising solution to this challenge. Both deep learning-based generative techniques and retrieval-based methods have demonstrated strong performance in this task. However, despite these advancements, there are still some limitations where generated reviews can be either off-point or overly general. To address these issues, we introduce Retrieval-Augmented Reviewer (RARe), which leverages Retrieval-Augmented Generation (RAG) to combine retrieval-based and generative methods, explicitly incorporating external domain knowledge into the code review process. RARe uses a dense retriever to select the most relevant reviews from the codebase, which then enrich the input for a neural generator, utilizing the contextual learning capacity of large language models (LLMs), to produce the final review. RARe outperforms state-of-the-art methods on two benchmark datasets, achieving BLEU-4 scores of 12.32 and 12.96, respectively. Its effectiveness is further validated through a detailed human evaluation and a case study using an interpretability tool, demonstrating its practical utility and reliability.
Arthur Satouf, Yuxuan Zong, Habiboulaye Amadou-Boubacar, Pablo Piantanida, Benjamin Piwowarski
Generative Retrieval (GR) differs from the traditional index-then-retrieve pipeline by storing relevance in model parameters and directly generating document identifiers. However, GR often struggles to generalize and is costly to scale. We introduce QUESTER (QUEry SpecificaTion gEnerative Retrieval), which reframes GR as query specification generation - in this work, a simple keyword query handled by BM25 - using a (small) LLM. The policy is trained using reinforcement learning techniques (GRPO). Across in- and out-of-domain evaluations, we show that our model is more effective than BM25, and competitive with neural IR models, while maintaining a good efficiency
Reza Esfandiarpoor, Max Zuo, Stephen H. Bach
We introduce Trove, an easy-to-use open-source retrieval toolkit that simplifies research experiments without sacrificing flexibility or speed. For the first time, we introduce efficient data management features that load and process (filter, select, transform, and combine) retrieval datasets on the fly, with just a few lines of code. This gives users the flexibility to easily experiment with different dataset configurations without the need to compute and store multiple copies of large datasets. Trove is highly customizable: in addition to many built-in options, it allows users to freely modify existing components or replace them entirely with user-defined objects. It also provides a low-code and unified pipeline for evaluation and hard negative mining, which supports multi-node execution without any code changes. Trove's data management features reduce memory consumption by a factor of 2.6. Moreover, Trove's easy-to-use inference pipeline incurs no overhead, and inference times decrease linearly with the number of available nodes. Most importantly, we demonstrate how Trove simplifies retrieval experiments and allows for arbitrary customizations, thus facilitating exploratory research.
Hailong Yin, Bin Zhu, Jingjing Chen, Chong-Wah Ngo
Although Large Language Models (LLMs) demonstrate significant capabilities, their reliance on parametric knowledge often leads to inaccuracies. Retrieval Augmented Generation (RAG) mitigates this by incorporating external knowledge, but these methods may introduce irrelevant retrieved documents, leading to inaccurate responses. While the integration methods filter out incorrect answers from multiple responses, but lack external knowledge like RAG methods, and their high costs require balancing overhead with performance gains. To address these issues, we propose an Efficient Test-Time Retrieval-Augmented Generation Framework named ET2RAG to improve the performance of LLMs while maintaining efficiency. Specifically, ET2RAG is a training-free method, that first retrieves the most relevant documents and augments the LLMs to efficiently generate diverse candidate responses by managing response length. Then we compute the similarity of candidate responses and employ a majority voting mechanism to select the most suitable response as the final output. In particular, we discover that partial generation is sufficient to capture the key information necessary for consensus calculation, allowing us to effectively perform majority voting without the need for fully generated responses. Thus, we can reach a balance between computational cost and performance by managing the response length for the number of retrieved documents for majority voting. Experimental results demonstrate that ET2RAG significantly enhances performance across three tasks, including open-domain question answering, recipe generation and image captioning.
Teerapol Saengsukhiran, Peerawat Chomphooyod, Narabodee Rodjananant, Chompakorn Chaksangchaichot, Patawee Prakrankamanant, Witthawin Sripheanpol, Pak Lovichit, SarChaksaana Nutanong et al.
Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We study two such biases. First, prevalence bias refers to the tendency to favor entries from prevalent languages over semantically faithful entries in image-to-text retrieval. Second, association bias refers to the tendency to favor images culturally associated with the query over semantically correct ones in text-to-image retrieval. Results show that explicit alignment is a more effective strategy for mitigating prevalence bias. However, association bias remains a distinct and more challenging problem. These findings suggest that achieving truly equitable multimodal systems requires targeted strategies beyond simple data scaling and that bias arising from cultural association may be treated as a more challenging problem than one arising from linguistic prevalence.
Zirui Cheng, Jikai Sun, Anjun Gao, Yueyang Quan, Zhuqing Liu, Xiaohua Hu, Minghong Fang
Large language models (LLMs) have transformed natural language processing
(NLP), enabling applications from content generation to decision support.
Retrieval-Augmented Generation (RAG) improves LLMs by incorporating external
knowledge but also introduces security risks, particularly from data poisoning,
where the attacker injects poisoned texts into the knowledge database to
manipulate system outputs. While various defenses have been proposed, they
often struggle against advanced attacks. To address this, we introduce RAGuard,
a detection framework designed to identify poisoned texts. RAGuard first
expands the retrieval scope to increase the proportion of clean texts, reducing
the likelihood of retrieving poisoned content. It then applies chunk-wise
perplexity filtering to detect abnormal variations and text similarity
filtering to flag highly similar texts. This non-parametric approach enhances
RAG security, and experiments on large-scale datasets demonstrate its
effectiveness in detecting and mitigating poisoning attacks, including strong
adaptive attacks.
Authors' comments: To appear in IEEE BigData 2025
Yang Tian, Fan Liu, Jingyuan Zhang, Wei Bi, Yupeng Hu, Liqiang Nie
Large Multimodal Models (LMMs) have achieved remarkable progress in
generating photorealistic and prompt-aligned images, but they often produce
outputs that contradict verifiable knowledge, especially when prompts involve
fine-grained attributes or time-sensitive events. Conventional
retrieval-augmented approaches attempt to address this issue by introducing
external information, yet they are fundamentally incapable of grounding
generation in accurate and evolving knowledge due to their reliance on static
sources and shallow evidence integration. To bridge this gap, we introduce
ORIG, an agentic open multimodal retrieval-augmented framework for Factual
Image Generation (FIG), a new task that requires both visual realism and
factual grounding. ORIG iteratively retrieves and filters multimodal evidence
from the web and incrementally integrates the refined knowledge into enriched
prompts to guide generation. To support systematic evaluation, we build
FIG-Eval, a benchmark spanning ten categories across perceptual, compositional,
and temporal dimensions. Experiments demonstrate that ORIG substantially
improves factual consistency and overall image quality over strong baselines,
highlighting the potential of open multimodal retrieval for factual image
generation.
Authors' comments: Preprint
Yang Zhong, Zhiming Wang, Zhaoyang Li, Jinyu Ma, Xiang Li
This paper introduces the 3rd place solution to the ICCV LargeFineFoodAI Retrieval Competition on Kaggle. Four basic models are independently trained with the weighted sum of ArcFace and Circle loss, then TTA and Ensemble are successively applied to improve feature representation ability. In addition, a new reranking method for retrieval is proposed based on diffusion and k-reciprocal reranking. Finally, our method scored 0.81219 and 0.81191 mAP@100 on the public and private leaderboard, respectively.
Saptarshi Sengupta, Zhengyu Zhou, Jun Araki, Xingbo Wang, Bingqing Wang, Suhang Wang, Zhe Feng
Tool calling has become increasingly popular for Large Language Models (LLMs). However, for large tool sets, the resulting tokens would exceed the LLM's context window limit, making it impossible to include every tool. Hence, an external retriever is used to provide LLMs with the most relevant tools for a query. Existing retrieval models rank tools based on the similarity between a user query and a tool description (TD). This leads to suboptimal retrieval as user requests are often poorly aligned with the language of TD. To remedy the issue, we propose ToolDreamer, a framework to condition retriever models to fetch tools based on hypothetical (synthetic) TD generated using an LLM, i.e., description of tools that the LLM feels will be potentially useful for the query. The framework enables a more natural alignment between queries and tools within the language space of TD's. We apply ToolDreamer on the ToolRet dataset and show that our method improves the performance of sparse and dense retrievers with and without training, thus showcasing its flexibility. Through our proposed framework, our aim is to offload a portion of the reasoning burden to the retriever so that the LLM may effectively handle a large collection of tools without inundating its context window.
Yang Hu, William Kuszmaul, Jingxun Liang, Huacheng Yu, Junkai Zhang, Renfei Zhou
In the static retrieval problem, a data structure must answer retrieval
queries mapping a set of $n$ keys in a universe $[U]$ to $v$-bit values.
Information-theoretically, retrieval data structures can use as little as $nv$
bits of space. For small value sizes $v$, it is possible to achieve $O(1)$
query time while using space $nv + o(n)$ bits -- whether or not such a result
is possible for larger values of $v$ (e.g., $v = \Theta(\log n)$) has remained
open.
In this paper, we obtain a tight lower bound (as well as matching upper
bounds) for the static retrieval problem. In the case where values are large,
we show that there is actually a significant tension between time and space. It
is not possible, for example, to get $O(1)$ query time using $nv + o(n)$ bits
of space, when $v = \Theta(\log n)$ (and assuming the word RAM model with
$O(\log n)$-bit words).
At first glance, our lower bound would seem to render retrieval unusable in
many settings that aim to achieve very low redundancy. However, our second
result offers a way around this: We show that, whenever a retrieval data
structure $D_1$ is stored along with another data structure $D_2$ (whose size
is similar to or larger than the size of $D_1$), it is possible to implement
the combined data structure $D_1 \cup D_2$ so that queries to $D_1$ take $O(1)$
time, operations on $D_2$ take the same asymptotic time as if $D_2$ were stored
on its own, and the total space is $nv + \mathrm{Space}(D_2) + n^{0.67}$ bits.
Authors' comments: 28 pages, in FOCS 2025
Jiahao Shi, Tianyi Zhang
Despite recent advances, Large Language Models (LLMs) still generate vulnerable code. Retrieval-Augmented Generation (RAG) has the potential to enhance LLMs for secure code generation by incorporating external security knowledge. However, the conventional RAG design struggles with the noise of raw security-related documents, and existing retrieval methods overlook the significant security semantics implicitly embedded in task descriptions. To address these issues, we propose RESCUE, a new RAG framework for secure code generation with two key innovations. First, we propose a hybrid knowledge base construction method that combines LLM-assisted cluster-then-summarize distillation with program slicing, producing both high-level security guidelines and concise, security-focused code examples. Second, we design a hierarchical multi-faceted retrieval to traverse the constructed knowledge base from top to bottom and integrates multiple security-critical facts at each hierarchical level, ensuring comprehensive and accurate retrieval. We evaluated RESCUE on four benchmarks and compared it with five state-of-the-art secure code generation methods on six LLMs. The results demonstrate that RESCUE improves the SecurePass@1 metric by an average of 4.8 points, establishing a new state-of-the-art performance for security. Furthermore, we performed in-depth analysis and ablation studies to rigorously validate the effectiveness of individual components in RESCUE.
Iman Deznabi, Peeyush Kumar, Madalina Fiterau
Zero-shot forecasting aims to predict outcomes for previously unseen conditions without direct historical data, posing a significant challenge for traditional forecasting methods. We introduce a Resolution-Aware Retrieval-Augmented Forecasting model that enhances predictive accuracy by leveraging spatial correlations and temporal frequency characteristics. By decomposing signals into different frequency components, our model employs resolution-aware retrieval, where lower-frequency components rely on broader spatial context, while higher-frequency components focus on local influences. This allows the model to dynamically retrieve relevant data and adapt to new locations with minimal historical context. Applied to microclimate forecasting, our model significantly outperforms traditional forecasting methods, numerical weather prediction models, and modern foundation time series models, achieving 71% lower MSE than HRRR and 34% lower MSE than Chronos on the ERA5 dataset. Our results highlight the effectiveness of retrieval-augmented and resolution-aware strategies, offering a scalable and data-efficient solution for zero-shot forecasting in microclimate modeling and beyond.
Huyen N. Nguyen, Nils Gehlenborg
Effective visualization retrieval necessitates a clear definition of
similarity. Despite the growing body of work in specialized visualization
retrieval systems, a systematic approach to understanding visualization
similarity remains absent. We introduce the Similarity Framework for
Visualization Retrieval (Safire), a conceptual model that frames visualization
similarity along two dimensions: comparison criteria and representation
modalities. Comparison criteria identify the aspects that make visualizations
similar, which we divide into primary facets (data, visual encoding,
interaction, style, metadata) and derived properties (data-centric and
human-centric measures). Safire connects what to compare with how comparisons
are executed through representation modalities. We categorize existing
representation approaches into four groups based on their levels of information
content and visualization determinism: raster image, vector image,
specification, and natural language description, together guiding what is
computable and comparable. We analyze several visualization retrieval systems
using Safire to demonstrate its practical value in clarifying similarity
considerations. Our findings reveal how particular criteria and modalities
align across different use cases. Notably, the choice of representation
modality is not only an implementation detail but also an important decision
that shapes retrieval capabilities and limitations. Based on our analysis, we
provide recommendations and discuss broader implications for multimodal
learning, AI applications, and visualization reproducibility.
Authors' comments: To appear in IEEE VIS 2025
Zhiyuan Hu, Fakhriyya Mammadova, Julián Tachella, Michael Unser, Jonathan Dong
Phase retrieval is a nonlinear inverse problem that arises in a wide range of imaging modalities, from electron microscopy to optical Fourier ptychography. Among various modalities, random phase retrieval stands out thanks to its strong theoretical guarantees and efficient reconstruction algorithms, although its applicability is hindered by prohibitive computational costs. In this paper, we propose the structured random models for phase retrieval, where we emulate a dense random matrix by a cascade of structured transforms and random diagonal matrices. We demonstrate that structured random models can achieve the same reconstruction performance as dense random models, with complexity reduced from quadratic to log-linear. Using a spectral method initialization followed by gradient descent, robust reconstruction is obtained at an oversampling ratio as low as 2.8. Moreover, we observe that the reconstruction performance is solely determined by the singular value distribution of the forward matrix. This class of models can directly be implemented with basic optical elements such as lenses and diffusers, paving the way for large-scale phase imaging with robust reconstruction guarantees.
Malte Fliedner, Julian Golak, Yağmur Gül, Simone Neumann
Growing demand for sustainable logistics and higher space utilization, driven by e-commerce and urbanization, increases the need for storage systems that are both energy- and space-efficient. Compact storage systems aim to maximize space utilization in limited storage areas and are therefore particularly suited in densely-populated urban areas where space is scarce. In this paper, we examine a recently introduced compact storage system in which uniformly shaped bins are stacked directly on top of each other, eliminating the need for aisles used to handle materials. Target bins are retrieved in a fully automated process by first lifting all other bins that block access and then accessing the target bin from the side of the system by a dedicated robot. Consequently, retrieving a bin can require substantial lifting effort, and thus energy. However, this energy can be reduced through smart retrieval strategies. From an operational perspective, we investigate how retrievals can be optimized with respect to energy consumption. We model the retrieval problem within a mathematical framework. We show that the problem is strongly NP-complete and derive structural insights. Building on these insights, we propose two exact methods: a mixed-integer programming (MIP) formulation and a dynamic programming algorithm, along with a simple, practitioner-oriented greedy algorithm that yields near-instant solutions. Numerical experiments reveal that dynamic programming consistently outperforms state-of-the-art MIP solvers in small to medium sized instances, while the greedy algorithm delivers satisfactory performance, especially when exact methods become computationally impractical.