Jiahao Zhang, Wenzhe Yin, Shujian Yu
Effective cross-modal retrieval requires robust alignment of heterogeneous data types. Most existing methods focus on bi-modal retrieval tasks and rely on distributional alignment techniques such as Kullback-Leibler divergence, Maximum Mean Discrepancy, and correlation alignment. However, these methods often suffer from critical limitations, including numerical instability, sensitivity to hyperparameters, and their inability to capture the full structure of the underlying distributions. In this paper, we introduce the Cauchy-Schwarz (CS) divergence, a hyperparameter-free measure that improves both training stability and retrieval performance. We further propose a novel Generalized CS (GCS) divergence inspired by Hölder's inequality. This extension enables direct alignment of three or more modalities within a unified mathematical framework through a bidirectional circular comparison scheme, eliminating the need for exhaustive pairwise comparisons. Extensive experiments on six benchmark datasets demonstrate the effectiveness of our method in both bi-modal and tri-modal retrieval tasks. The code of our CS/GCS divergence is publicly available at https://github.com/JiahaoZhang666/CSD.
Authors' comments: Accepted by ACMMM-25
Chao Huang, Fengran Mo, Yufeng Chen, Changhao Guan, Zhenrui Yue, Xinyu Wang, Jinan Xu, Kaiyu Huang
Multilingual dense retrieval aims to retrieve relevant documents across
different languages based on a unified retriever model. The challenge lies in
aligning representations of different languages in a shared vector space. The
common practice is to fine-tune the dense retriever via contrastive learning,
whose effectiveness highly relies on the quality of the negative sample and the
efficacy of mini-batch data. Different from the existing studies that focus on
developing sophisticated model architecture, we propose a method to boost data
utilization for multilingual dense retrieval by obtaining high-quality hard
negative samples and effective mini-batch data. The extensive experimental
results on a multilingual retrieval benchmark, MIRACL, with 16 languages
demonstrate the effectiveness of our method by outperforming several existing
strong baselines.
Authors' comments: Accepted by EMNLP 2025 (main)
Jihyun Moon, Charmgil Hong
Accurate and early diagnosis of malignant melanoma is critical for improving
patient outcomes. While convolutional neural networks (CNNs) have shown promise
in dermoscopic image analysis, they often neglect clinical metadata and require
extensive preprocessing. Vision-language models (VLMs) offer a multimodal
alternative but struggle to capture clinical specificity when trained on
general-domain data. To address this, we propose a retrieval-augmented VLM
framework that incorporates semantically similar patient cases into the
diagnostic prompt. Our method enables informed predictions without fine-tuning
and significantly improves classification accuracy and error correction over
conventional baselines. These results demonstrate that retrieval-augmented
prompting provides a robust strategy for clinical decision support.
Authors' comments: Medical Image Computing and Computer-Assisted Intervention (MICCAI)
ISIC Skin Image Analysis Workshop (MICCAI ISIC) 2025; 10 pages
Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2
Mpoki Mwaisela, Peterson Yuhala, Pascal Felber, Valerio Schiavoni
Private information retrieval (PIR) is a cryptographic primitive that allows a client to securely query one or multiple servers without revealing their specific interests. In spite of their strong security guarantees, current PIR constructions are computationally costly. Specifically, most PIR implementations are memory-bound due to the need to scan extensive databases (in the order of GB), making them inherently constrained by the limited memory bandwidth in traditional processor-centric computing architectures.Processing-in-memory (PIM) is an emerging computing paradigm that augments memory with compute capabilities, addressing the memory bandwidth bottleneck while simultaneously providing extensive parallelism.Recent research has demonstrated PIM's potential to significantly improve performance across a range of data-intensive workloads, including graph processing, genome analysis, and machine learning. In this work, we propose the first PIM-based architecture for multi-server PIR. We discuss the algorithmic foundations of the latter and show how its operations align with the core strengths of PIM architectures: extensive parallelism and high memory bandwidth. Based on this observation, we design and implement IM-PIR, a PIM-based multi-server PIR approach on top of UPMEM PIM, the first openly commercialized PIM architecture. Our evaluation demonstrates that a PIM-based multi-server PIR implementation significantly improves query throughput by more than 3.7x when compared to a standard CPU-based PIR approach.
Tingwei Zhang, Fnu Suya, Rishi Jha, Collin Zhang, Vitaly Shmatikov
Hubness is a phenomenon in high-dimensional vector spaces where a point from the natural distribution is unusually close to many other points. This is a well-known problem in information retrieval that causes some items to accidentally (and incorrectly) appear relevant to many queries. In this paper, we investigate how attackers can exploit hubness to turn any image or audio input in a multi-modal retrieval system into an adversarial hub. Adversarial hubs can be used to inject universal adversarial content (e.g., spam) that will be retrieved in response to thousands of different queries, and also for targeted attacks on queries related to specific, attacker-chosen concepts. We present a method for creating adversarial hubs and evaluate the resulting hubs on benchmark multi-modal retrieval datasets and an image-to-image retrieval system implemented by Pinecone, a popular vector database. For example, in text-caption-to-image retrieval, a single adversarial hub, generated using 100 random queries, is retrieved as the top-1 most relevant image for more than 21,000 out of 25,000 test queries (by contrast, the most common natural hub is the top-1 response to only 102 queries), demonstrating the strong generalization capabilities of adversarial hubs. We also investigate whether techniques for mitigating natural hubness can also mitigate adversarial hubs, and show that they are not effective against hubs that target queries related to specific concepts.
Songjiang Lai, Tsun-Hin Cheung, Ka-Chun Fung, Kaiwen Xue, Kwan-Ho Lin, Yan-Ming Choi, Vincent Ng, Kin-Man Lam
In this paper, we introduce Technical-Embeddings, a novel framework designed to optimize semantic retrieval in technical documentation, with applications in both hardware and software development. Our approach addresses the challenges of understanding and retrieving complex technical content by leveraging the capabilities of Large Language Models (LLMs). First, we enhance user queries by generating expanded representations that better capture user intent and improve dataset diversity, thereby enriching the fine-tuning process for embedding models. Second, we apply summary extraction techniques to encode essential contextual information, refining the representation of technical documents. To further enhance retrieval performance, we fine-tune a bi-encoder BERT model using soft prompting, incorporating separate learning parameters for queries and document context to capture fine-grained semantic nuances. We evaluate our approach on two public datasets, RAG-EDA and Rust-Docs-QA, demonstrating that Technical-Embeddings significantly outperforms baseline models in both precision and recall. Our findings highlight the effectiveness of integrating query expansion and contextual summarization to enhance information access and comprehension in technical domains. This work advances the state of Retrieval-Augmented Generation (RAG) systems, offering new avenues for efficient and accurate technical document retrieval in engineering and product development workflows.
Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Shiji Song, Gao Huang
Vision-language retrieval is an important multi-modal learning topic, where the goal is to retrieve the most relevant visual candidate for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on retrieval tasks. However, as pre-trained models are scaling up, fully fine-tuning them on donwstream retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel Cross-Modal Adapter for parameter-efficient transfer learning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows encoder-level implicit cross-modal interactions between vision and language encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces the vast majority of fine-tuned parameters, (2) saves training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, our approach outperforms adapter-based methods on image-text retrieval datasets (MSCOCO, Flickr30K) and video-text retrieval datasets (MSR-VTT, DiDeMo, and ActivityNet).
Authors' comments: The accepted manuscript by Pattern Recognition 25 Journal. The published journal article is available at: https://doi.org/10.1016/j.patcog.2024.111144
Rohan Phanse, Yijie Zhou, Kejian Shi, Wencai Zhang, Yixin Liu, Yilun Zhao, Arman Cohan
Retrieval-augmented systems are typically evaluated in settings where
information required to answer the query can be found within a single source or
the answer is short-form or factoid-based. However, many real-world
applications demand the ability to integrate and summarize information
scattered across multiple sources, where no single source is sufficient to
respond to the user's question. In such settings, the retrieval component of a
RAG pipeline must recognize a variety of relevance signals, and the generation
component must connect and synthesize information across multiple sources. We
present a scalable framework for constructing evaluation benchmarks that
challenge RAG systems to integrate information across distinct sources and
generate long-form responses. Using our framework, we build two new benchmarks
on Multi-Source Retrieval and Synthesis: MSRS-Story and MSRS-Meet, representing
narrative synthesis and summarization tasks, respectively, that require
retrieval from large collections. Our extensive experiments with various RAG
pipelines -- including sparse and dense retrievers combined with frontier LLMs
-- reveal that generation quality is highly dependent on retrieval
effectiveness, which varies greatly by task. While multi-source synthesis
proves challenging even in an oracle retrieval setting, we find that reasoning
models significantly outperform standard LLMs at this distinct step.
Authors' comments: COLM 2025; this article supersedes the preprint: arXiv:2309.08960
Daniil Zelezetsky, Egor Cherepanov, Alexey K. Kovalev, Aleksandr I. Panov
Offline reinforcement learning (RL) often deals with suboptimal data when
collecting large expert datasets is unavailable or impractical. This limitation
makes it difficult for agents to generalize and achieve high performance, as
they must learn primarily from imperfect or inconsistent trajectories. A
central challenge is therefore how to best leverage scarce expert
demonstrations alongside abundant but lower-quality data. We demonstrate that
incorporating even a tiny amount of expert experience can substantially improve
RL agent performance. We introduce Re:Frame (Retrieving Experience From
Associative Memory), a plug-in module that augments a standard offline RL
policy (e.g., Decision Transformer) with a small external Associative Memory
Buffer (AMB) populated by expert trajectories drawn from a separate dataset.
During training on low-quality data, the policy learns to retrieve expert data
from the Associative Memory Buffer (AMB) via content-based associations and
integrate them into decision-making; the same AMB is queried at evaluation.
This requires no environment interaction and no modifications to the backbone
architecture. On D4RL MuJoCo tasks, using as few as 60 expert trajectories
(0.1% of a 6000-trajectory dataset), Re:Frame consistently improves over a
strong Decision Transformer baseline in three of four settings, with gains up
to +10.7 normalized points. These results show that Re:Frame offers a simple
and data-efficient way to inject scarce expert knowledge and substantially
improve offline RL from low-quality datasets.
Authors' comments: 11 pages, 3 figures
Yosef Dayani, Omer Benishu, Sagie Benaim
Text-to-3D generation approaches have advanced significantly by leveraging
pretrained 2D diffusion priors, producing high-quality and 3D-consistent
outputs. However, they often fail to produce out-of-domain (OOD) or rare
concepts, yielding inconsistent or inaccurate results. To this end, we propose
MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images
from a large in-the-wild 2D database and then conditions a multiview diffusion
model on these images to synthesize consistent and accurate multiview outputs.
Training such a retrieval-conditioned model is achieved via a novel hybrid
strategy bridging structured multiview data and diverse 2D image collections.
This involves training on multiview data using augmented conditioning views
that simulate retrieval variance for view-specific reconstruction, alongside
training on sets of retrieved real-world 2D images using a distinctive held-out
view prediction objective: the model predicts the held-out view from the other
views to infer 3D consistency from 2D data. To facilitate a rigorous OOD
evaluation, we introduce a new collection of challenging OOD prompts.
Experiments against state-of-the-art text-to-3D, image-to-3D, and
personalization baselines show that our approach significantly improves 3D
consistency, photorealism, and text adherence for OOD/rare concepts, while
maintaining competitive performance on standard benchmarks.
Authors' comments: Project page: https://yosefdayani.github.io/MV-RAG
Minbiao Han, Seyed A. Esmaeili, Michael Albert, Haifeng Xu
We study the problem of data selling for Retrieval Augmented Generation (RAG) tasks in Generative AI applications. We model each buyer's valuation of a dataset with a natural coverage-based valuation function that increases with the inclusion of more relevant data points that would enhance responses to anticipated queries. Motivated by issues such as data control and prior-free revenue maximization, we focus on the scenario where each data point can be allocated to only one buyer. We show that the problem of welfare maximization in this setting is NP-hard even with two bidders, but design a polynomial-time $(1-1/e)$ approximation algorithm for any number of bidders. Unfortunately, however, this efficient allocation algorithm fails to be incentive compatible. The crux of our approach is a carefully tailored post-processing step called \emph{data burning} which retains the $(1-1/e)$ approximation factor but achieves incentive compatibility. Our thorough experiments on synthetic and real-world image and text datasets demonstrate the practical effectiveness of our algorithm compared to popular baseline algorithms for combinatorial auctions.
William Zhang, Yiwen Zhu, Yunlei Lu, Mathieu Demarne, Wenjing Wang, Kai Deng, Nutan Sahoo, Katherine Lin et al.
Recent advances in Large Language Models (LLMs) have driven the adoption of
copilots in complex technical scenarios, underscoring the growing need for
specialized information retrieval solutions. In this paper, we introduce FLAIR,
a lightweight, feedback learning framework that adapts copilot systems'
retrieval strategies by integrating domain-specific expert feedback. FLAIR
operates in two stages: an offline phase obtains indicators from (1) user
feedback and (2) questions synthesized from documentation, storing these
indicators in a decentralized manner. An online phase then employs a two-track
ranking mechanism to combine raw similarity scores with the collected
indicators. This iterative setup refines retrieval performance for any query.
Extensive real-world evaluations of FLAIR demonstrate significant performance
gains on both previously seen and unseen queries, surpassing state-of-the-art
approaches. The system has been successfully integrated into Copilot DECO,
serving thousands of users at Microsoft, demonstrating its scalability and
effectiveness in operational environments.
Authors' comments: Accepted to CIKM2025
Kemal Altwlkany, Elmedin Selmanovic, Sead Delalic
Conformers have shown great results in speech processing due to their ability to capture both local and global interactions. In this work, we utilize a self-supervised contrastive learning framework to train conformer-based encoders that are capable of generating unique embeddings for small segments of audio, generalizing well to previously unseen data. We achieve state-of-the-art results for audio retrieval tasks while using only 3 seconds of audio to generate embeddings. Our models are almost completely immune to temporal misalignments and achieve state-of-the-art results in cases of other audio distortions such as noise, reverb or extreme temporal stretching. Code and models are made publicly available and the results are easy to reproduce as we train and test using popular and freely available datasets of different sizes.
Ryan Sze-Yin Chan, Federico Nanni, Tomas Lazauskas, Rosie Wood, Penelope Yong, Lionel Tarassenko, Mark Girolami, James Geddes et al.
This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus, in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment. All implementation details and code are publicly released to support reproducibility and adaptation across domains.
Ruisong Han, Zongbo Han, Jiahao Zhang, Mingyue Cheng, Changqing Zhang
Out-of-Distribution (OOD) detection is crucial for the reliable deployment of machine learning models in-the-wild, enabling accurate identification of test samples that differ from the training data distribution. Existing methods rely on auxiliary outlier samples or in-distribution (ID) data to generate outlier information for training, but due to limited outliers and their mismatch with real test OOD samples, they often fail to provide sufficient semantic supervision, leading to suboptimal performance. To address this, we propose a novel OOD detection method called Retrieval-Augmented Prompt (RAP). RAP augments a pre-trained vision-language model's prompts by retrieving external knowledge, offering enhanced semantic supervision for OOD detection. During training, RAP retrieves descriptive words for outliers based on joint similarity with external textual knowledge and uses them to augment the model's OOD prompts. During testing, RAP dynamically updates OOD prompts in real-time based on the encountered OOD samples, enabling the model to rapidly adapt to the test environment. Our extensive experiments demonstrate that RAP achieves state-of-the-art performance on large-scale OOD detection benchmarks. For example, in 1-shot OOD detection on the ImageNet-1k dataset, RAP reduces the average FPR95 by 7.05% and improves the AUROC by 1.71% compared to previous methods. Additionally, comprehensive ablation studies validate the effectiveness of each module and the underlying motivations of our approach.
Shreya Meel, Mohamed Nomeir, Pasan Dissanayake, Sanghamitra Dutta, Sennur Ulukus
Transparency and explainability are two important aspects to be considered
when employing black-box machine learning models in high-stake applications.
Providing counterfactual explanations is one way of catering this requirement.
However, this also poses a threat to the privacy of the institution that is
providing the explanation, as well as the user who is requesting it. In this
work, we are primarily concerned with the user's privacy who wants to retrieve
a counterfactual instance, without revealing their feature vector to the
institution. Our framework retrieves the exact nearest neighbor counterfactual
explanation from a database of accepted points while achieving perfect,
information-theoretic, privacy for the user. First, we introduce the problem of
private counterfactual retrieval (PCR) and propose a baseline PCR scheme that
keeps the user's feature vector information-theoretically private from the
institution. Building on this, we propose two other schemes that reduce the
amount of information leaked about the institution database to the user,
compared to the baseline scheme. Second, we relax the assumption of mutability
of all features, and consider the setting of immutable PCR (I-PCR). Here, the
user retrieves the nearest counterfactual without altering a private subset of
their features, which constitutes the immutable set, while keeping their
feature vector and immutable set private from the institution. For this, we
propose two schemes that preserve the user's privacy information-theoretically,
but ensure varying degrees of database privacy. Third, we extend our PCR and
I-PCR schemes to incorporate user's preference on transforming their
attributes, so that a more actionable explanation can be received. Finally, we
present numerical results to support our theoretical findings, and compare the
database leakage of the proposed schemes.
Authors' comments: arXiv admin note: text overlap with arXiv:2410.13812,
arXiv:2411.10429
Kisu Yang, Yoonna Jang, Hwanseok Jang, Kenneth Choi, Isabelle Augenstein, Heuiseok Lim
Lowering the numerical precision of model parameters and computations is
widely adopted to improve the efficiency of retrieval systems. However, when
computing relevance scores between the query and documents in low-precision, we
observe spurious ties due to the reduced granularity. This introduces high
variability in the results based on tie resolution, making the evaluation less
reliable. To address this, we propose a more robust retrieval evaluation
protocol designed to reduce score variation. It consists of: (1) High-Precision
Scoring (HPS), which upcasts the final scoring step to higher precision to
resolve tied candidates with minimal computational cost; and (2) Tie-aware
Retrieval Metrics (TRM), which report expected scores, range, and bias to
quantify order uncertainty of tied candidates. Our experiments test multiple
models with three scoring functions on two retrieval datasets to demonstrate
that HPS dramatically reduces tie-induced instability, and TRM accurately
recovers expected metric values. This combination enables a more consistent and
reliable evaluation system for lower-precision retrievals.
Authors' comments: 11 pages, 5 figures, submitted to ARR
Yunming Xiao, Peizhi Liu, Ruijie Yu, Chenkai Weng, Matteo Varvello, Aleksandar Kuzmanovic
There has been a growing interest in Internet user privacy, demonstrated by the popularity of privacy-preserving products such as Telegram and Brave, and the widespread adoption of HTTPS. The Domain Name System (DNS) is a key component of Internet-based communication and its privacy has been neglected for years. Recently, DNS over HTTPS (DoH) has improved the situation by fixing the issue of in-path middleboxes. Further progress has been made with proxy-based solutions such as Oblivious DoH (ODoH), which separate a user's identity from their DNS queries. However, these solutions rely on non-collusion assumptions between DNS resolvers and proxies -- an assumption difficult to guarantee in practice. To address this, we explore integrating single-server Private Information Retrieval (PIR) into DNS to enable encrypted query processing without relying on trust assumptions. However, applying PIR to DNS is challenging due to its hierarchical nature -- particularly, interactions with recursive resolvers can still leak information. Navigating performance and privacy trade-offs, we propose PDNS, a DNS extension leveraging single-server PIR to strengthen privacy guarantees. We have implemented a prototype of PDNS and compared its performance against state-of-the-art solutions via trace-driven experiments. The results show that PDNS achieves acceptable performance (2x faster than DoH over Tor with similar privacy guarantees) and strong privacy guarantees today, mainly at the cost of its scalability, which specialized hardware for PIR can address in the near future.
Damith Premasiri, Tharindu Ranasinghe, Ruslan Mitkov
In common law systems, legal professionals such as lawyers and judges rely on
precedents to build their arguments. As the volume of cases has grown massively
over time, effectively retrieving prior cases has become essential. Prior case
retrieval (PCR) is an information retrieval (IR) task that aims to
automatically identify the most relevant court cases for a specific query from
a large pool of potential candidates. While IR methods have seen several
paradigm shifts over the last few years, the vast majority of PCR methods
continue to rely on traditional IR methods, such as BM25. The state-of-the-art
deep learning IR methods have not been successful in PCR due to two key
challenges: i. Lengthy legal text limitation; when using the powerful
BERT-based transformer models, there is a limit of input text lengths, which
inevitably requires to shorten the input via truncation or division with a loss
of legal context information. ii. Lack of legal training data; due to data
privacy concerns, available PCR datasets are often limited in size, making it
difficult to train deep learning-based models effectively. In this research, we
address these challenges by leveraging LLM-based text embedders in PCR.
LLM-based embedders support longer input lengths, and since we use them in an
unsupervised manner, they do not require training data, addressing both
challenges simultaneously. In this paper, we evaluate state-of-the-art
LLM-based text embedders in four PCR benchmark datasets and show that they
outperform BM25 and supervised transformer-based models.
Authors' comments: Accepted in Recent Advancements in Natural Language Processing (RANLP
2025) conference