Victor Sechaud, Patrice Abry, Laurent Jacques, Julián Tachella
In recent years, deep neural networks have emerged as a solution for inverse
imaging problems. These networks are generally trained using pairs of images:
one degraded and the other of high quality, the latter being called 'ground
truth'. However, in medical and scientific imaging, the lack of fully sampled
data limits supervised learning. Recent advances have made it possible to
reconstruct images from measurement data alone, eliminating the need for
references. However, these methods remain limited to linear problems, excluding
non-linear problems such as phase retrieval. We propose a self-supervised
method that overcomes this limitation in the case of phase retrieval by using
the natural invariance of images to translations.
Authors' comments: in French language. GRETSI, Aug 2025, Strasboug, France
Jungsoo Lee, Janghoon Cho, Hyojin Park, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, Sungha Choi
Despite their consistent performance improvements, cross-modal retrieval
models (e.g., CLIP) show degraded performances with retrieving keys composed of
fused image-text modality (e.g., Wikipedia pages with both images and text). To
address this critical challenge, multimodal retrieval has been recently
explored to develop a unified single retrieval model capable of retrieving keys
across diverse modality combinations. A common approach involves constructing
new composed sets of image-text triplets (e.g., retrieving a pair of image and
text given a query image). However, such an approach requires careful curation
to ensure the dataset quality and fails to generalize to unseen modality
combinations. To overcome these limitations, this paper proposes Generalized
Contrastive Learning (GCL), a novel loss formulation that improves multimodal
retrieval performance without the burdensome need for new dataset curation.
Specifically, GCL operates by enforcing contrastive learning across all
modalities within a mini-batch, utilizing existing image-caption paired
datasets to learn a unified representation space. We demonstrate the
effectiveness of GCL by showing consistent performance improvements on
off-the-shelf multimodal retrieval models (e.g., VISTA, CLIP, and TinyCLIP)
using the M-BEIR, MMEB, and CoVR benchmarks.
Authors' comments: Accepted to NeurIPS 2025
Ran Xu, Kaixin Ma, Wenhao Yu, Hongming Zhang, Joyce C. Ho, Carl Yang, Dong Yu
GUI agents powered by vision-language models (VLMs) show promise in
automating complex digital tasks. However, their effectiveness in real-world
applications is often limited by scarce training data and the inherent
complexity of these tasks, which frequently require long-tailed knowledge
covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that
leverages web tutorials at inference time. RAG-GUI is first warm-started via
supervised finetuning (SFT) and further refined through self-guided rejection
sampling finetuning (RSF). Designed to be model-agnostic, RAG-GUI functions as
a generic plug-in that enhances any VLM-based agent. Evaluated across three
distinct tasks, it consistently outperforms baseline agents and surpasses other
inference baselines by 2.6% to 13.3% across two model sizes, demonstrating
strong generalization and practical plug-and-play capabilities in real-world
scenarios.
Authors' comments: Accepted to EMNLP 2025 (Main Conference)
Yaxiong Wu, Jianyuan Bo, Yongyue Zhang, Sheng Liang, Yong Liu
Graph-based retrieval-augmented generation (RAG) enriches large language
models (LLMs) with external knowledge for long-context understanding and
multi-hop reasoning, but existing methods face a granularity dilemma:
fine-grained entity-level graphs incur high token costs and lose context, while
coarse document-level graphs fail to capture nuanced relations. We introduce
QCG-RAG, a query-centric graph RAG framework that enables query-granular
indexing and multi-hop chunk retrieval. Our query-centric approach leverages
Doc2Query and Doc2Query{-}{-} to construct query-centric graphs with
controllable granularity, improving graph quality and interpretability. A
tailored multi-hop retrieval mechanism then selects relevant chunks via the
generated queries. Experiments on LiHuaWorld and MultiHop-RAG show that QCG-RAG
consistently outperforms prior chunk-based and graph-based RAG methods in
question answering accuracy, establishing a new paradigm for multi-hop
reasoning.
Authors' comments: 25 pages, 6 figures, 1 table
Renxiang Wang, Li Zhang
Certain strong LLMs have shown promise for zero-shot formal planning by
generating planning languages like PDDL. Yet, performance of most open-source
models under 50B parameters has been reported to be close to zero due to the
low-resource nature of these languages. We significantly improve their
performance via a series of lightweight pipelines that integrates documentation
retrieval with modular code generation and error refinement. With models like
Llama-4-Maverick, our best pipeline improves plan correctness from 0\% to over
80\% on the common BlocksWorld domain. However, while syntactic errors are
substantially reduced, semantic errors persist in more challenging domains,
revealing fundamental limitations in current models' reasoning
capabilities.\footnote{Our code and data can be found at
https://github.com/Nangxxxxx/PDDL-RAG
Authors' comments: 12 pages, 14 figures, 1 table
Seunghan Yang, Juntae Lee, Jihwan Bang, Kyuhong Shim, Minsoo Kim, Simyung Chang
Effective conversational search demands a deep understanding of user intent
across multiple dialogue turns. Users frequently use abbreviations and shift
topics in the middle of conversations, posing challenges for conventional
retrievers. While query rewriting techniques improve clarity, they often incur
significant computational cost due to additional autoregressive steps.
Moreover, although LLM-based retrievers demonstrate strong performance, they
are not explicitly optimized to track user intent in multi-turn settings, often
failing under topic drift or contextual ambiguity. To address these
limitations, we propose ContextualRetriever, a novel LLM-based retriever that
directly incorporates conversational context into the retrieval process. Our
approach introduces: (1) a context-aware embedding mechanism that highlights
the current query within the dialogue history; (2) intent-guided supervision
based on high-quality rewritten queries; and (3) a training strategy that
preserves the generative capabilities of the base LLM. Extensive evaluations
across multiple conversational search benchmarks demonstrate that
ContextualRetriever significantly outperforms existing methods while incurring
no additional inference overhead.
Authors' comments: EMNLP 2025 main conference
Zeinab Sadat Taghavi, Ali Modarressi, Yunpu Ma, Hinrich Schütze
Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques -- like prompting or multi-hop retrieval -- that can help resolve complexity. In contrast, we present Impliret, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving "two days ago"), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 14.91%. We also test whether long-context models can overcome this limitation. But even with a short context of only thirty documents, including the positive document, GPT-o4-mini scores only 55.54%, showing that document-side reasoning remains a challenge. Our codes are available at github.com/ZeinabTaghavi/IMPLIRET.
Changjiang Zhou, Ruqing Zhang, Jiafeng Guo, Yu-An Liu, Fan Zhang, Ganyuan Luo, Xueqi Cheng
Formulating information retrieval as a variant of generative modeling,
specifically using autoregressive models to generate relevant identifiers for a
given query, has recently attracted considerable attention. However, its
application to personalized sticker retrieval remains largely unexplored and
presents unique challenges: existing relevance-based generative retrieval
methods typically lack personalization, leading to a mismatch between diverse
user expectations and the retrieved results. To address this gap, we propose
PEARL, a novel generative framework for personalized sticker retrieval, and
make two key contributions: (i) To encode user-specific sticker preferences, we
design a representation learning model to learn discriminative user
representations. It is trained on three prediction tasks that leverage personal
information and click history; and (ii) To generate stickers aligned with a
user's query intent, we propose a novel intent-aware learning objective that
prioritizes stickers associated with higher-ranked intents. Empirical results
from both offline evaluations and online tests demonstrate that PEARL
significantly outperforms state-of-the-art methods.
Authors' comments: Findings of EMNLP2025
Stanislas Ducotterd, Zhiyuan Hu, Michael Unser, Jonathan Dong
Phase retrieval seeks to recover a complex signal from amplitude-only measurements, a challenging nonlinear inverse problem. Current theory and algorithms often ignore signal priors. By contrast, we evaluate here a variety of image priors in the context of severe undersampling with structured random Fourier measurements. Our results show that those priors significantly improve reconstruction, allowing accurate reconstruction even below the weak recovery threshold.
Letian Zhang, Guanghao Meng, Xudong Ren, Yiming Wang, Shu-Tao Xia
Retrieval-augmented Generation (RAG) is a prevalent approach for domain-specific LLMs, yet it is often plagued by "Retrieval Hallucinations"--a phenomenon where fine-tuned models fail to recognize and act upon poor-quality retrieved documents, thus undermining performance. To address this, we propose the Adversarial Collaboration RAG (AC-RAG) framework. AC-RAG employs two heterogeneous agents: a generalist Detector that identifies knowledge gaps, and a domain-specialized Resolver that provides precise solutions. Guided by a moderator, these agents engage in an adversarial collaboration, where the Detector's persistent questioning challenges the Resolver's expertise. This dynamic process allows for iterative problem dissection and refined knowledge retrieval. Extensive experiments show that AC-RAG significantly improves retrieval accuracy and outperforms state-of-the-art RAG methods across various vertical domains.
Ying Li, Mengyu Wang, Miguel de Carvalho, Sotirios Sabanis, Tiejun Ma
Financial disclosures such as 10-K filings present challenging retrieval problems due to their length, regulatory section hierarchy, and domain-specific language, which standard retrieval-augmented generation (RAG) models underuse. We introduce FinGEAR (Financial Mapping-Guided Enhanced Answer Retrieval), a retrieval framework tailored to financial documents. FinGEAR combines a finance lexicon for Item-level guidance (FLAM), dual hierarchical indices for within-Item search (Summary Tree and Question Tree), and a two-stage cross-encoder reranker. This design aligns retrieval with disclosure structure and terminology, enabling fine-grained, query-aware context selection. Evaluated on full 10-Ks with queries aligned to the FinQA dataset, FinGEAR delivers consistent gains in precision, recall, F1, and relevancy, improving F1 by up to 56.7% over flat RAG, 12.5% over graph-based RAGs, and 217.6% over prior tree-based systems, while also increasing downstream answer accuracy with a fixed reader. By jointly modeling section hierarchy and domain lexicon signals, FinGEAR improves retrieval fidelity and provides a practical foundation for high-stakes financial analysis.
Jiahao Zhang, Wenzhe Yin, Shujian Yu
Effective cross-modal retrieval requires robust alignment of heterogeneous data types. Most existing methods focus on bi-modal retrieval tasks and rely on distributional alignment techniques such as Kullback-Leibler divergence, Maximum Mean Discrepancy, and correlation alignment. However, these methods often suffer from critical limitations, including numerical instability, sensitivity to hyperparameters, and their inability to capture the full structure of the underlying distributions. In this paper, we introduce the Cauchy-Schwarz (CS) divergence, a hyperparameter-free measure that improves both training stability and retrieval performance. We further propose a novel Generalized CS (GCS) divergence inspired by Hölder's inequality. This extension enables direct alignment of three or more modalities within a unified mathematical framework through a bidirectional circular comparison scheme, eliminating the need for exhaustive pairwise comparisons. Extensive experiments on six benchmark datasets demonstrate the effectiveness of our method in both bi-modal and tri-modal retrieval tasks. The code of our CS/GCS divergence is publicly available at https://github.com/JiahaoZhang666/CSD.
Authors' comments: Accepted by ACMMM-25
Chao Huang, Fengran Mo, Yufeng Chen, Changhao Guan, Zhenrui Yue, Xinyu Wang, Jinan Xu, Kaiyu Huang
Multilingual dense retrieval aims to retrieve relevant documents across
different languages based on a unified retriever model. The challenge lies in
aligning representations of different languages in a shared vector space. The
common practice is to fine-tune the dense retriever via contrastive learning,
whose effectiveness highly relies on the quality of the negative sample and the
efficacy of mini-batch data. Different from the existing studies that focus on
developing sophisticated model architecture, we propose a method to boost data
utilization for multilingual dense retrieval by obtaining high-quality hard
negative samples and effective mini-batch data. The extensive experimental
results on a multilingual retrieval benchmark, MIRACL, with 16 languages
demonstrate the effectiveness of our method by outperforming several existing
strong baselines.
Authors' comments: Accepted by EMNLP 2025 (main)
Jihyun Moon, Charmgil Hong
Accurate and early diagnosis of malignant melanoma is critical for improving
patient outcomes. While convolutional neural networks (CNNs) have shown promise
in dermoscopic image analysis, they often neglect clinical metadata and require
extensive preprocessing. Vision-language models (VLMs) offer a multimodal
alternative but struggle to capture clinical specificity when trained on
general-domain data. To address this, we propose a retrieval-augmented VLM
framework that incorporates semantically similar patient cases into the
diagnostic prompt. Our method enables informed predictions without fine-tuning
and significantly improves classification accuracy and error correction over
conventional baselines. These results demonstrate that retrieval-augmented
prompting provides a robust strategy for clinical decision support.
Authors' comments: Medical Image Computing and Computer-Assisted Intervention (MICCAI)
ISIC Skin Image Analysis Workshop (MICCAI ISIC) 2025; 10 pages
Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2
Mpoki Mwaisela, Peterson Yuhala, Pascal Felber, Valerio Schiavoni
Private information retrieval (PIR) is a cryptographic primitive that allows a client to securely query one or multiple servers without revealing their specific interests. In spite of their strong security guarantees, current PIR constructions are computationally costly. Specifically, most PIR implementations are memory-bound due to the need to scan extensive databases (in the order of GB), making them inherently constrained by the limited memory bandwidth in traditional processor-centric computing architectures.Processing-in-memory (PIM) is an emerging computing paradigm that augments memory with compute capabilities, addressing the memory bandwidth bottleneck while simultaneously providing extensive parallelism.Recent research has demonstrated PIM's potential to significantly improve performance across a range of data-intensive workloads, including graph processing, genome analysis, and machine learning. In this work, we propose the first PIM-based architecture for multi-server PIR. We discuss the algorithmic foundations of the latter and show how its operations align with the core strengths of PIM architectures: extensive parallelism and high memory bandwidth. Based on this observation, we design and implement IM-PIR, a PIM-based multi-server PIR approach on top of UPMEM PIM, the first openly commercialized PIM architecture. Our evaluation demonstrates that a PIM-based multi-server PIR implementation significantly improves query throughput by more than 3.7x when compared to a standard CPU-based PIR approach.
Tingwei Zhang, Fnu Suya, Rishi Jha, Collin Zhang, Vitaly Shmatikov
Hubness is a phenomenon in high-dimensional vector spaces where a point from the natural distribution is unusually close to many other points. This is a well-known problem in information retrieval that causes some items to accidentally (and incorrectly) appear relevant to many queries. In this paper, we investigate how attackers can exploit hubness to turn any image or audio input in a multi-modal retrieval system into an adversarial hub. Adversarial hubs can be used to inject universal adversarial content (e.g., spam) that will be retrieved in response to thousands of different queries, and also for targeted attacks on queries related to specific, attacker-chosen concepts. We present a method for creating adversarial hubs and evaluate the resulting hubs on benchmark multi-modal retrieval datasets and an image-to-image retrieval system implemented by Pinecone, a popular vector database. For example, in text-caption-to-image retrieval, a single adversarial hub, generated using 100 random queries, is retrieved as the top-1 most relevant image for more than 21,000 out of 25,000 test queries (by contrast, the most common natural hub is the top-1 response to only 102 queries), demonstrating the strong generalization capabilities of adversarial hubs. We also investigate whether techniques for mitigating natural hubness can also mitigate adversarial hubs, and show that they are not effective against hubs that target queries related to specific concepts.
Songjiang Lai, Tsun-Hin Cheung, Ka-Chun Fung, Kaiwen Xue, Kwan-Ho Lin, Yan-Ming Choi, Vincent Ng, Kin-Man Lam
In this paper, we introduce Technical-Embeddings, a novel framework designed to optimize semantic retrieval in technical documentation, with applications in both hardware and software development. Our approach addresses the challenges of understanding and retrieving complex technical content by leveraging the capabilities of Large Language Models (LLMs). First, we enhance user queries by generating expanded representations that better capture user intent and improve dataset diversity, thereby enriching the fine-tuning process for embedding models. Second, we apply summary extraction techniques to encode essential contextual information, refining the representation of technical documents. To further enhance retrieval performance, we fine-tune a bi-encoder BERT model using soft prompting, incorporating separate learning parameters for queries and document context to capture fine-grained semantic nuances. We evaluate our approach on two public datasets, RAG-EDA and Rust-Docs-QA, demonstrating that Technical-Embeddings significantly outperforms baseline models in both precision and recall. Our findings highlight the effectiveness of integrating query expansion and contextual summarization to enhance information access and comprehension in technical domains. This work advances the state of Retrieval-Augmented Generation (RAG) systems, offering new avenues for efficient and accurate technical document retrieval in engineering and product development workflows.
Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Shiji Song, Gao Huang
Vision-language retrieval is an important multi-modal learning topic, where the goal is to retrieve the most relevant visual candidate for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on retrieval tasks. However, as pre-trained models are scaling up, fully fine-tuning them on donwstream retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel Cross-Modal Adapter for parameter-efficient transfer learning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows encoder-level implicit cross-modal interactions between vision and language encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces the vast majority of fine-tuned parameters, (2) saves training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, our approach outperforms adapter-based methods on image-text retrieval datasets (MSCOCO, Flickr30K) and video-text retrieval datasets (MSR-VTT, DiDeMo, and ActivityNet).
Authors' comments: The accepted manuscript by Pattern Recognition 25 Journal. The published journal article is available at: https://doi.org/10.1016/j.patcog.2024.111144
Rohan Phanse, Yijie Zhou, Kejian Shi, Wencai Zhang, Yixin Liu, Yilun Zhao, Arman Cohan
Retrieval-augmented systems are typically evaluated in settings where
information required to answer the query can be found within a single source or
the answer is short-form or factoid-based. However, many real-world
applications demand the ability to integrate and summarize information
scattered across multiple sources, where no single source is sufficient to
respond to the user's question. In such settings, the retrieval component of a
RAG pipeline must recognize a variety of relevance signals, and the generation
component must connect and synthesize information across multiple sources. We
present a scalable framework for constructing evaluation benchmarks that
challenge RAG systems to integrate information across distinct sources and
generate long-form responses. Using our framework, we build two new benchmarks
on Multi-Source Retrieval and Synthesis: MSRS-Story and MSRS-Meet, representing
narrative synthesis and summarization tasks, respectively, that require
retrieval from large collections. Our extensive experiments with various RAG
pipelines -- including sparse and dense retrievers combined with frontier LLMs
-- reveal that generation quality is highly dependent on retrieval
effectiveness, which varies greatly by task. While multi-source synthesis
proves challenging even in an oracle retrieval setting, we find that reasoning
models significantly outperform standard LLMs at this distinct step.
Authors' comments: COLM 2025; this article supersedes the preprint: arXiv:2309.08960