Daniil Zelezetsky, Egor Cherepanov, Alexey K. Kovalev, Aleksandr I. Panov
Offline reinforcement learning (RL) often deals with suboptimal data when
collecting large expert datasets is unavailable or impractical. This limitation
makes it difficult for agents to generalize and achieve high performance, as
they must learn primarily from imperfect or inconsistent trajectories. A
central challenge is therefore how to best leverage scarce expert
demonstrations alongside abundant but lower-quality data. We demonstrate that
incorporating even a tiny amount of expert experience can substantially improve
RL agent performance. We introduce Re:Frame (Retrieving Experience From
Associative Memory), a plug-in module that augments a standard offline RL
policy (e.g., Decision Transformer) with a small external Associative Memory
Buffer (AMB) populated by expert trajectories drawn from a separate dataset.
During training on low-quality data, the policy learns to retrieve expert data
from the Associative Memory Buffer (AMB) via content-based associations and
integrate them into decision-making; the same AMB is queried at evaluation.
This requires no environment interaction and no modifications to the backbone
architecture. On D4RL MuJoCo tasks, using as few as 60 expert trajectories
(0.1% of a 6000-trajectory dataset), Re:Frame consistently improves over a
strong Decision Transformer baseline in three of four settings, with gains up
to +10.7 normalized points. These results show that Re:Frame offers a simple
and data-efficient way to inject scarce expert knowledge and substantially
improve offline RL from low-quality datasets.
Authors' comments: 11 pages, 3 figures
Yosef Dayani, Omer Benishu, Sagie Benaim
Text-to-3D generation approaches have advanced significantly by leveraging
pretrained 2D diffusion priors, producing high-quality and 3D-consistent
outputs. However, they often fail to produce out-of-domain (OOD) or rare
concepts, yielding inconsistent or inaccurate results. To this end, we propose
MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images
from a large in-the-wild 2D database and then conditions a multiview diffusion
model on these images to synthesize consistent and accurate multiview outputs.
Training such a retrieval-conditioned model is achieved via a novel hybrid
strategy bridging structured multiview data and diverse 2D image collections.
This involves training on multiview data using augmented conditioning views
that simulate retrieval variance for view-specific reconstruction, alongside
training on sets of retrieved real-world 2D images using a distinctive held-out
view prediction objective: the model predicts the held-out view from the other
views to infer 3D consistency from 2D data. To facilitate a rigorous OOD
evaluation, we introduce a new collection of challenging OOD prompts.
Experiments against state-of-the-art text-to-3D, image-to-3D, and
personalization baselines show that our approach significantly improves 3D
consistency, photorealism, and text adherence for OOD/rare concepts, while
maintaining competitive performance on standard benchmarks.
Authors' comments: Project page: https://yosefdayani.github.io/MV-RAG
Minbiao Han, Seyed A. Esmaeili, Michael Albert, Haifeng Xu
We study the problem of data selling for Retrieval Augmented Generation (RAG) tasks in Generative AI applications. We model each buyer's valuation of a dataset with a natural coverage-based valuation function that increases with the inclusion of more relevant data points that would enhance responses to anticipated queries. Motivated by issues such as data control and prior-free revenue maximization, we focus on the scenario where each data point can be allocated to only one buyer. We show that the problem of welfare maximization in this setting is NP-hard even with two bidders, but design a polynomial-time $(1-1/e)$ approximation algorithm for any number of bidders. Unfortunately, however, this efficient allocation algorithm fails to be incentive compatible. The crux of our approach is a carefully tailored post-processing step called \emph{data burning} which retains the $(1-1/e)$ approximation factor but achieves incentive compatibility. Our thorough experiments on synthetic and real-world image and text datasets demonstrate the practical effectiveness of our algorithm compared to popular baseline algorithms for combinatorial auctions.
William Zhang, Yiwen Zhu, Yunlei Lu, Mathieu Demarne, Wenjing Wang, Kai Deng, Nutan Sahoo, Katherine Lin et al.
Recent advances in Large Language Models (LLMs) have driven the adoption of
copilots in complex technical scenarios, underscoring the growing need for
specialized information retrieval solutions. In this paper, we introduce FLAIR,
a lightweight, feedback learning framework that adapts copilot systems'
retrieval strategies by integrating domain-specific expert feedback. FLAIR
operates in two stages: an offline phase obtains indicators from (1) user
feedback and (2) questions synthesized from documentation, storing these
indicators in a decentralized manner. An online phase then employs a two-track
ranking mechanism to combine raw similarity scores with the collected
indicators. This iterative setup refines retrieval performance for any query.
Extensive real-world evaluations of FLAIR demonstrate significant performance
gains on both previously seen and unseen queries, surpassing state-of-the-art
approaches. The system has been successfully integrated into Copilot DECO,
serving thousands of users at Microsoft, demonstrating its scalability and
effectiveness in operational environments.
Authors' comments: Accepted to CIKM2025
Kemal Altwlkany, Elmedin Selmanovic, Sead Delalic
Conformers have shown great results in speech processing due to their ability to capture both local and global interactions. In this work, we utilize a self-supervised contrastive learning framework to train conformer-based encoders that are capable of generating unique embeddings for small segments of audio, generalizing well to previously unseen data. We achieve state-of-the-art results for audio retrieval tasks while using only 3 seconds of audio to generate embeddings. Our models are almost completely immune to temporal misalignments and achieve state-of-the-art results in cases of other audio distortions such as noise, reverb or extreme temporal stretching. Code and models are made publicly available and the results are easy to reproduce as we train and test using popular and freely available datasets of different sizes.
Ryan Sze-Yin Chan, Federico Nanni, Tomas Lazauskas, Rosie Wood, Penelope Yong, Lionel Tarassenko, Mark Girolami, James Geddes et al.
This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus, in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment. All implementation details and code are publicly released to support reproducibility and adaptation across domains.
Ruisong Han, Zongbo Han, Jiahao Zhang, Mingyue Cheng, Changqing Zhang
Out-of-Distribution (OOD) detection is crucial for the reliable deployment of machine learning models in-the-wild, enabling accurate identification of test samples that differ from the training data distribution. Existing methods rely on auxiliary outlier samples or in-distribution (ID) data to generate outlier information for training, but due to limited outliers and their mismatch with real test OOD samples, they often fail to provide sufficient semantic supervision, leading to suboptimal performance. To address this, we propose a novel OOD detection method called Retrieval-Augmented Prompt (RAP). RAP augments a pre-trained vision-language model's prompts by retrieving external knowledge, offering enhanced semantic supervision for OOD detection. During training, RAP retrieves descriptive words for outliers based on joint similarity with external textual knowledge and uses them to augment the model's OOD prompts. During testing, RAP dynamically updates OOD prompts in real-time based on the encountered OOD samples, enabling the model to rapidly adapt to the test environment. Our extensive experiments demonstrate that RAP achieves state-of-the-art performance on large-scale OOD detection benchmarks. For example, in 1-shot OOD detection on the ImageNet-1k dataset, RAP reduces the average FPR95 by 7.05% and improves the AUROC by 1.71% compared to previous methods. Additionally, comprehensive ablation studies validate the effectiveness of each module and the underlying motivations of our approach.
Shreya Meel, Mohamed Nomeir, Pasan Dissanayake, Sanghamitra Dutta, Sennur Ulukus
Transparency and explainability are two important aspects to be considered
when employing black-box machine learning models in high-stake applications.
Providing counterfactual explanations is one way of catering this requirement.
However, this also poses a threat to the privacy of the institution that is
providing the explanation, as well as the user who is requesting it. In this
work, we are primarily concerned with the user's privacy who wants to retrieve
a counterfactual instance, without revealing their feature vector to the
institution. Our framework retrieves the exact nearest neighbor counterfactual
explanation from a database of accepted points while achieving perfect,
information-theoretic, privacy for the user. First, we introduce the problem of
private counterfactual retrieval (PCR) and propose a baseline PCR scheme that
keeps the user's feature vector information-theoretically private from the
institution. Building on this, we propose two other schemes that reduce the
amount of information leaked about the institution database to the user,
compared to the baseline scheme. Second, we relax the assumption of mutability
of all features, and consider the setting of immutable PCR (I-PCR). Here, the
user retrieves the nearest counterfactual without altering a private subset of
their features, which constitutes the immutable set, while keeping their
feature vector and immutable set private from the institution. For this, we
propose two schemes that preserve the user's privacy information-theoretically,
but ensure varying degrees of database privacy. Third, we extend our PCR and
I-PCR schemes to incorporate user's preference on transforming their
attributes, so that a more actionable explanation can be received. Finally, we
present numerical results to support our theoretical findings, and compare the
database leakage of the proposed schemes.
Authors' comments: arXiv admin note: text overlap with arXiv:2410.13812,
arXiv:2411.10429
Kisu Yang, Yoonna Jang, Hwanseok Jang, Kenneth Choi, Isabelle Augenstein, Heuiseok Lim
Lowering the numerical precision of model parameters and computations is
widely adopted to improve the efficiency of retrieval systems. However, when
computing relevance scores between the query and documents in low-precision, we
observe spurious ties due to the reduced granularity. This introduces high
variability in the results based on tie resolution, making the evaluation less
reliable. To address this, we propose a more robust retrieval evaluation
protocol designed to reduce score variation. It consists of: (1) High-Precision
Scoring (HPS), which upcasts the final scoring step to higher precision to
resolve tied candidates with minimal computational cost; and (2) Tie-aware
Retrieval Metrics (TRM), which report expected scores, range, and bias to
quantify order uncertainty of tied candidates. Our experiments test multiple
models with three scoring functions on two retrieval datasets to demonstrate
that HPS dramatically reduces tie-induced instability, and TRM accurately
recovers expected metric values. This combination enables a more consistent and
reliable evaluation system for lower-precision retrievals.
Authors' comments: 11 pages, 5 figures, submitted to ARR
Yunming Xiao, Peizhi Liu, Ruijie Yu, Chenkai Weng, Matteo Varvello, Aleksandar Kuzmanovic
There has been a growing interest in Internet user privacy, demonstrated by the popularity of privacy-preserving products such as Telegram and Brave, and the widespread adoption of HTTPS. The Domain Name System (DNS) is a key component of Internet-based communication and its privacy has been neglected for years. Recently, DNS over HTTPS (DoH) has improved the situation by fixing the issue of in-path middleboxes. Further progress has been made with proxy-based solutions such as Oblivious DoH (ODoH), which separate a user's identity from their DNS queries. However, these solutions rely on non-collusion assumptions between DNS resolvers and proxies -- an assumption difficult to guarantee in practice. To address this, we explore integrating single-server Private Information Retrieval (PIR) into DNS to enable encrypted query processing without relying on trust assumptions. However, applying PIR to DNS is challenging due to its hierarchical nature -- particularly, interactions with recursive resolvers can still leak information. Navigating performance and privacy trade-offs, we propose PDNS, a DNS extension leveraging single-server PIR to strengthen privacy guarantees. We have implemented a prototype of PDNS and compared its performance against state-of-the-art solutions via trace-driven experiments. The results show that PDNS achieves acceptable performance (2x faster than DoH over Tor with similar privacy guarantees) and strong privacy guarantees today, mainly at the cost of its scalability, which specialized hardware for PIR can address in the near future.
Damith Premasiri, Tharindu Ranasinghe, Ruslan Mitkov
In common law systems, legal professionals such as lawyers and judges rely on
precedents to build their arguments. As the volume of cases has grown massively
over time, effectively retrieving prior cases has become essential. Prior case
retrieval (PCR) is an information retrieval (IR) task that aims to
automatically identify the most relevant court cases for a specific query from
a large pool of potential candidates. While IR methods have seen several
paradigm shifts over the last few years, the vast majority of PCR methods
continue to rely on traditional IR methods, such as BM25. The state-of-the-art
deep learning IR methods have not been successful in PCR due to two key
challenges: i. Lengthy legal text limitation; when using the powerful
BERT-based transformer models, there is a limit of input text lengths, which
inevitably requires to shorten the input via truncation or division with a loss
of legal context information. ii. Lack of legal training data; due to data
privacy concerns, available PCR datasets are often limited in size, making it
difficult to train deep learning-based models effectively. In this research, we
address these challenges by leveraging LLM-based text embedders in PCR.
LLM-based embedders support longer input lengths, and since we use them in an
unsupervised manner, they do not require training data, addressing both
challenges simultaneously. In this paper, we evaluate state-of-the-art
LLM-based text embedders in four PCR benchmark datasets and show that they
outperform BM25 and supervised transformer-based models.
Authors' comments: Accepted in Recent Advancements in Natural Language Processing (RANLP
2025) conference
Eleni Kamateri, Renukswamy Chikkamath, Michail Salampasis, Linda Andersson, Markus Endres
Effective query formulation is a key challenge in long-document Information
Retrieval (IR). This challenge is particularly acute in domain-specific
contexts like patent retrieval, where documents are lengthy, linguistically
complex, and encompass multiple interrelated technical topics. In this work, we
present the application of recent extractive and abstractive summarization
methods for generating concise, purpose-specific summaries of patent documents.
We further assess the utility of these automatically generated summaries as
surrogate queries across three benchmark patent datasets and compare their
retrieval performance against conventional approaches that use entire patent
sections. Experimental results show that summarization-based queries
significantly improve prior-art retrieval effectiveness, highlighting their
potential as an efficient alternative to traditional query formulation
techniques.
Authors' comments: This version was submitted and accepted for publication at the 6th
Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech
2025), held in conjunction with SIGIR 2025. A revised and polished version,
incorporating reviewers' feedback, will follow
Pranav Kasela, Gabriella Pasi, Raffaele Perego
Academic Search is a search task aimed to manage and retrieve scientific
documents like journal articles and conference papers. Personalization in this
context meets individual researchers' needs by leveraging, through user
profiles, the user related information (e.g. documents authored by a
researcher), to improve search effectiveness and to reduce the information
overload. While citation graphs are a valuable means to support the outcome of
recommender systems, their use in personalized academic search (with, e.g.
nodes as papers and edges as citations) is still under-explored.
Existing personalized models for academic search often struggle to fully
capture users' academic interests. To address this, we propose a two-step
approach: first, training a neural language model for retrieval, then
converting the academic graph into a knowledge graph and embedding it into a
shared semantic space with the language model using translational embedding
techniques. This allows user models to capture both explicit relationships and
hidden structures in citation graphs and paper content. We evaluate our
approach in four academic search domains, outperforming traditional graph-based
and personalized models in three out of four, with up to a 10\% improvement in
MAP@100 over the second-best model. This highlights the potential of knowledge
graph-based user models to enhance retrieval effectiveness.
Authors' comments: Accepted in Information Systems. [17 May 2025]
https://doi.org/10.1016/j.is.2025.102574
Isaac Shi, Zeyuan Li, Fan Liu, Wenli Wang, Lewei He, Yang Yang, Tianyu Shi
We introduce the THOR (Transformer Heuristics for On-Demand Retrieval) Module, designed and implemented by eSapiens, a secure, scalable engine that transforms natural-language questions into verified, read-only SQL analytics for enterprise databases. The Text-to-SQL module follows a decoupled orchestration/execution architecture: a Supervisor Agent routes queries, Schema Retrieval dynamically injects table and column metadata, and a SQL Generation Agent emits single-statement SELECT queries protected by a read-only guardrail. An integrated Self-Correction & Rating loop captures empty results, execution errors, or low-quality outputs and triggers up to five LLM-driven regeneration attempts. Finally, a Result Interpretation Agent produces concise, human-readable insights and hands raw rows to the Insight & Intelligence engine for visualization or forecasting. Smoke tests across finance, sales, and operations scenarios demonstrate reliable ad-hoc querying and automated periodic reporting. By embedding schema awareness, fault-tolerant execution, and compliance guardrails, the THOR Module empowers non-technical users to access live data with zero-SQL simplicity and enterprise-grade safety.
Alexander Frummet, Emanuel Slany, Jonas Amling, Moritz Lang, Stephan Scheele
Conversational agents such as Microsoft Copilot and Google Gemini assist
users with complex search tasks but often generate misleading or fabricated
references. This undermines trust, particularly in high-stakes domains such as
medicine and finance. Explainable information retrieval (XIR) aims to address
this by making search results more transparent and interpretable. While most
XIR research is domain-agnostic, this paper focuses on auditing -- a critical
yet underexplored area. We argue that XIR systems can support auditors in
completing their complex task. We outline key challenges and future research
directions to advance XIR in this domain.
Authors' comments: Extended abstract accepted at the Workshop on Explainability in
Information Retrieval (WExIR), co-located with SIGIR 2025
Paul J. L. Ammann, Jonas Golde, Alan Akbik
Grounding large language models (LLMs) in verifiable external sources is a
well-established strategy for generating reliable answers. Retrieval-augmented
generation (RAG) is one such approach, particularly effective for tasks like
question answering: it retrieves passages that are semantically related to the
question and then conditions the model on this evidence. However, multi-hop
questions, such as "Which company among NVIDIA, Apple, and Google made the
biggest profit in 2023?," challenge RAG because relevant facts are often
distributed across multiple documents rather than co-occurring in one source,
making it difficult for standard RAG to retrieve sufficient information. To
address this, we propose a RAG pipeline that incorporates question
decomposition: (i) an LLM decomposes the original query into sub-questions,
(ii) passages are retrieved for each sub-question, and (iii) the merged
candidate pool is reranked to improve the coverage and precision of the
retrieved evidence. We show that question decomposition effectively assembles
complementary documents, while reranking reduces noise and promotes the most
relevant passages before answer generation. Although reranking itself is
standard, we show that pairing an off-the-shelf cross-encoder reranker with
LLM-driven question decomposition bridges the retrieval gap on multi-hop
questions and provides a practical, drop-in enhancement, without any extra
training or specialized indexing. We evaluate our approach on the MultiHop-RAG
and HotpotQA, showing gains in retrieval (MRR@10: +36.7%) and answer accuracy
(F1: +11.6%) over standard RAG baselines.
Authors' comments: Accepted to ACL SRW 2025. 9 Pages, 2 Figures, 4 Tables
Hanzhong Liang, Jinghao Shi, Xiang Shen, Zixuan Wang, Vera Wen, Ardalan Mehrani, Zhiqian Chen, Yifan Wu et al.
Video understanding plays a fundamental role for content moderation on short video platforms, enabling the detection of inappropriate content. While classification remains the dominant approach for content moderation, it often struggles in scenarios requiring rapid and cost-efficient responses, such as trend adaptation and urgent escalations. To address this issue, we introduce an Embedding-Based Retrieval (EBR) method designed to complement traditional classification approaches. We first leverage a Supervised Contrastive Learning (SCL) framework to train a suite of foundation embedding models, including both single-modal and multi-modal architectures. Our models demonstrate superior performance over established contrastive learning methods such as CLIP and MoCo. Building on these embedding models, we design and implement the embedding-based retrieval system that integrates embedding generation and video retrieval to enable efficient and effective trend handling. Comprehensive offline experiments on 25 diverse emerging trends show that EBR improves ROC-AUC from 0.85 to 0.99 and PR-AUC from 0.35 to 0.95. Further online experiments reveal that EBR increases action rates by 10.32% and reduces operational costs by over 80%, while also enhancing interpretability and flexibility compared to classification-based solutions.
Authors' comments: Camera ready for SIGIR 2025
Xuan Zhang, Ziyan Jiang, Rui Meng, Yifei Leng, Zhenbang Xiao, Zora Zhiruo Wang, Yanyi Shang, Dehan Kong
Trajectory data, capturing human actions and environmental states across
various modalities, holds significant potential for enhancing AI agent
capabilities, particularly in GUI environments. However, how to model the
representation of trajectory-level data presents a significant challenge that
has not been systematically addressed amid explosive trajectory data growth. In
this work, we introduce Multimodal Trajectory Retrieval, bridging the gap
between universal retrieval and agent-centric trajectory modeling. We construct
the Unified Agent Trajectory Dataset (UATD) from annotated demonstrations and
states across diverse real-world scenarios. Based on this, we present
GAE-Bench, a benchmark containing a large number of trajectory-based retrieval
pairs. In addition, we propose GAE-Retriever, a multimodal retrieval framework
that adopts vision-language models and incorporates optimized contrastive
learning through a token selection and the GradCache mechanism. Comprehensive
evaluations across multiple datasets show that GAE-Retriever consistently
outperforms strong baselines in retrieval recall, highlighting its
effectiveness in advancing multimodal trajectory retrieval.
Authors' comments: 18 pages, 3 figures, accepted by Workshop on Computer-use Agents @
ICML 2025
Zunran Wang, Zheng Shenpeng, Wang Shenglan, Minghui Zhao, Zhonghua Li
Hybrid-based retrieval methods, which unify dense-vector and lexicon-based retrieval, have garnered considerable attention in the industry due to performance enhancement. However, despite their promising results, the application of these hybrid paradigms in Chinese retrieval contexts has remained largely underexplored. In this paper, we introduce HyReC, an innovative end-to-end optimization method tailored specifically for hybrid-based retrieval in Chinese. HyReC enhances performance by integrating the semantic union of terms into the representation model. Additionally, it features the Global-Local-Aware Encoder (GLAE) to promote consistent semantic sharing between lexicon-based and dense retrieval while minimizing the interference between them. To further refine alignment, we incorporate a Normalization Module (NM) that fosters mutual benefits between the retrieval approaches. Finally, we evaluate HyReC on the C-MTEB retrieval benchmark to demonstrate its effectiveness.
Masaki Watabe, Joe Sakamoto, Hideaki Yoshimura, Tomomi Nemoto, Kazunari Kaizu
Although the transport of intensity equation (TIE) can be used to reconstruct
the spatial phase variations produced by samples such as magnetic materials and
biological cells, the impact of complex refractive indices on quantitative
phase imaging remains unexplored. To overcome this difficulty, we provide
herein a more physically generalized TIE framework that enables the
reconstruction of spatial variations in both refractive-index fluctuations and
attenuation coefficients. We then demonstrate this method using bright-field
microscopy imaging. The results reveal robust performance in retrieving
heterogeneous optical structures within measurable parameter regions. Finally,
we analyze the symmetry of the attenuation reversal in the TIE framework, thus
revealing the invariant nature of the absorptive and scattering properties in
the samples of interest.
Authors' comments: 6 pages, 3 figures in main text; 14 pages, 9 figures in the
supplemental material