Kostadin Koroutchev, Jian Shen, Elka Koroutcheva, Manuel Cebrian
In this work, we suggest a parameterized statistical model (the gamma
distribution) for the frequency of word occurrences in long strings of English
text and use this model to build a corresponding thermodynamic picture by
constructing the partition function. We then use our partition function to
compute thermodynamic quantities such as the free energy and the specific heat.
In this approach, the parameters of the word frequency model vary from word to
word so that each word has a different corresponding thermodynamics and we
suggest that differences in the specific heat reflect differences in how the
words are used in language, differentiating keywords from common and function
words. Finally, we apply our thermodynamic picture to the problem of retrieval
of texts based on keywords and suggest some advantages over traditional
information retrieval methods.
Authors' comments: 12 pages, 7 figures
Pere Constans
An approximate textual retrieval algorithm for searching sources with high levels of defects is presented. It considers splitting the words in a query into two overlapping segments and subsequently building composite regular expressions from interlacing subsets of the segments. This procedure reduces the probability of missed occurrences due to source defects, yet diminishes the retrieval of irrelevant, non-contextual occurrences.
Michael J. Kurtz, Guenther Eichhorn, Alberto Accomazzi, Carolyn Grant, Edwin Henneken, Stephen S. Murray
Since it was first announced at ADASS 2 the Smithsonian/NASA Astrophysics
System Abstract Service (ADS) has played a central role in the information
seeking behavior of astronomers. Central to the ability of the ADS to act as a
search and discovery tool is its role as metadata agregator. Over the past 13
years the ADS has introduced many new techniques to facilitate information
retrieval, broadly defined. We discuss some of these developments; with
particular attention to how the ADS might interact with the virtual
observatory, and to the new myADS-arXiv customized open access virtual journal.
The ADS is at http://ads.harvard.edu
Authors' comments: Invited talk, to appear in ADASS XV Proceedings
Song Tang, Guangquan Jie, Henghui Ding, Yu-Gang Jiang
Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate up-to-date knowledge. To address this challenge, we introduce the Novel Emerging Segmentation Task (NEST), which focuses on segmenting (i) novel entities that MLLMs fail to recognize due to their absence from training data, and (ii) emerging entities that exist within the model's knowledge but demand up-to-date external information for accurate recognition. To support the study of NEST, we construct a NEST benchmark using an automated pipeline that generates news-related data samples for comprehensive evaluation. Additionally, we propose ROSE: Retrieval-Oriented Segmentation Enhancement, a plug-and-play framework designed to augment any MLLM-based segmentation model. ROSE comprises four key components. First, an Internet Retrieval-Augmented Generation module is introduced to employ user-provided multimodal inputs to retrieve real-time web information. Then, a Textual Prompt Enhancer enriches the model with up-to-date information and rich background knowledge, improving the model's perception ability for emerging entities. Furthermore, a Visual Prompt Enhancer is proposed to compensate for MLLMs' lack of exposure to novel entities by leveraging internet-sourced images. To maintain efficiency, a WebSense module is introduced to intelligently decide when to invoke retrieval mechanisms based on user input. Experimental results demonstrate that ROSE significantly boosts performance on the NEST benchmark, outperforming a strong Gemini-2.0 Flash-based retrieval baseline by 19.2 in gIoU.
Authors' comments: CVPR 2026 Findings, Project Page: https://henghuiding.com/ROSE/
Simon Popelier, Matthieu X. B. Sarazin, Maximilien Bohm, Mathieu Gierski, Hanna Mergui, Matthieu Ospici, Adrien Bernhardt
The Sales Comparison Approach (SCA) is one of the most popular when it comes to real estate appraisal. Used as a reference in real estate expertise and as one of the major types of Automatic Valuation Models (AVM), it recently gained popularity within machine learning methods. The performance of models able to use data represented as sets and graphs made it possible to adapt this methodology efficiently, yielding substantial results. SCA relies on taking past transactions (comparables) as references, selected according to their similarity with the target property's sale. In this study, we focus on the selection of these comparables for real estate appraisal. We demonstrate that the selection of comparables used in many state-of-the-art algorithms can be significantly improved by learning a selection policy instead of imposing it. Our method relies on a hybrid vector-geographical retrieval module capable of adapting to different datasets and optimized jointly with an estimation module. We further show that the use of carefully selected comparables makes it possible to build models that require fewer comparables and fewer parameters with performance close to state-of-the-art models. All our evaluations are made on five datasets which span areas in the United States, Brazil, and France.
Authors' comments: Accepted at NFMCP 2024 workshop (New Frontiers in Mining Complex Patterns), held in conjunction with ECML 2024
Tingting Tang, James Flemings, Yongqin Wang, Murali Annavaram
Retrieval-augmented generation (RAG) is a widely used framework for reducing hallucinations in large language models (LLMs) on domain-specific tasks by retrieving relevant documents from a database to support accurate responses. However, when the database contains sensitive corpora, such as medical records or legal documents, RAG poses serious privacy risks by potentially exposing private information through its outputs. Prior work has demonstrated that one can practically craft adversarial prompts that force an LLM to regurgitate the augmented contexts. A promising direction is to integrate differential privacy (DP), a privacy notion that offers strong formal guarantees, into RAG systems. However, naively applying DP mechanisms into existing systems often leads to significant utility degradation. Particularly for RAG systems, DP can reduce the usefulness of the augmented contexts leading to increase risk of hallucination from the LLMs. Motivated by these challenges, we present DP-KSA, a novel privacy-preserving RAG algorithm that integrates DP using the propose-test-release paradigm. DP-KSA follows from a key observation that most question-answering (QA) queries can be sufficiently answered with a few keywords. Hence, DP-KSA first obtains an ensemble of relevant contexts, each of which will be used to generate a response from an LLM. We utilize these responses to obtain the most frequent keywords in a differentially private manner. Lastly, the keywords are augmented into the prompt for the final output. This approach effectively compresses the semantic space while preserving both utility and privacy. We formally show that DP-KSA provides formal DP guarantees on the generated output with respect to the RAG database. We evaluate DP-KSA on two QA benchmarks using three instruction-tuned LLMs, and our empirical results demonstrate that DP-KSA achieves a strong privacy-utility tradeoff.
Elias Jääsaari, Ville Hyvönen, Teemu Roos
Multi-vector representations generated by late interaction models, such as ColBERT, enable superior retrieval quality compared to single-vector representations in information retrieval applications. In multi-vector retrieval systems, both queries and documents are encoded using one embedding for each token, and similarity between queries and documents is measured by the MaxSim similarity measure. However, the improved recall of multi-vector retrieval comes at the expense of significantly increased latency. This necessitates designing efficient approximate nearest neighbor search (ANNS) algorithms for multi-vector search. In this work, we introduce LEMUR, a simple-yet-efficient framework for multi-vector similarity search. LEMUR consists of two consecutive problem reductions: We first formulate multi-vector similarity search as a supervised learning problem that can be solved using a one-hidden-layer neural network. Second, we reduce inference under this model to single-vector similarity search in its latent space, which enables the use of existing single-vector ANNS methods for speeding up retrieval. In addition to performance evaluation on ColBERTv2 embeddings, we evaluate LEMUR on embeddings generated by modern multi-vector text models and multi-vector visual document retrieval models. LEMUR is an order of magnitude faster than earlier multi-vector similarity search methods.
Authors' comments: 17 pages
Dominik Stammbach, Kylie Zhang, Patty Liu, Nimra Nadeem, Lucia Zheng, Peter Henderson
AI tools are increasingly suggested as solutions to assist public agencies with heavy workloads. In public defense, where a constitutional right to counsel meets the complexities of law, overwhelming caseloads and constrained resources, practitioners face especially taxing conditions. Yet, there is little evidence of how AI could meaningfully support defenders' day-to-day work. In partnership with the New Jersey Office of the Public Defender, we develop the NJ BriefBank, a retrieval tool which surfaces relevant appellate briefs to streamline legal research and writing. We show that existing legal retrieval benchmarks fail to transfer to public defense search, however adding domain knowledge improves retrieval quality. This includes query expansion with legal reasoning, domain-specific data and curated synthetic examples. To facilitate further research, we provide a taxonomy of realistic defender search queries and release a manually annotated public defense retrieval dataset. Together, our work offers starting points towards building practical, reliable retrieval AI tools for public defense, and towards more realistic legal retrieval benchmarks.
Max McKinnon
The ability of large language models (LLMs) to recall and retrieve
information from long contexts is critical for many real-world applications.
Prior work (Liu et al., 2023) reported that LLMs suffer significant drops in
retrieval accuracy for facts placed in the middle of large contexts, an effect
known as "Lost in the Middle" (LITM). We find the model Gemini 2.5 Flash can
answer needle-in-a-haystack questions with great accuracy regardless of
document position including when the document is nearly at the input context
limit. Our results suggest that the "Lost in the Middle" effect is not present
for simple factoid Q\&A in Gemini 2.5 Flash, indicating substantial
improvements in long-context retrieval.
Authors' comments: 3 pages, 0 figures
Weijian Jian, Yajun Zhang, Dawei Liang, Chunyu Xie, Yixiao He, Dawei Leng, Yuhui Yin
The rapid advancement of Multimodal Large Language Models (MLLMs) has extended CLIP-based frameworks to produce powerful, universal embeddings for retrieval tasks. However, existing methods primarily focus on natural images, offering limited support for other crucial visual modalities such as videos and visual documents. To bridge this gap, we introduce RzenEmbed, a unified framework to learn embeddings across a diverse set of modalities, including text, images, videos, and visual documents. We employ a novel two-stage training strategy to learn discriminative representations. The first stage focuses on foundational text and multimodal retrieval. In the second stage, we introduce an improved InfoNCE loss, incorporating two key enhancements. Firstly, a hardness-weighted mechanism guides the model to prioritize challenging samples by assigning them higher weights within each batch. Secondly, we implement an approach to mitigate the impact of false negatives and alleviate data noise. This strategy not only enhances the model's discriminative power but also improves its instruction-following capabilities. We further boost performance with learnable temperature parameter and model souping. RzenEmbed sets a new state-of-the-art on the MMEB benchmark. It not only achieves the best overall score but also outperforms all prior work on the challenging video and visual document retrieval tasks. Our models are available in https://huggingface.co/qihoo360/RzenEmbed.
Bill Psomas, George Retsinas, Nikos Efthymiadis, Panagiotis Filntisis, Yannis Avrithis, Petros Maragos, Ondrej Chum, Giorgos Tolias
The progress of composed image retrieval (CIR), a popular research direction
in image retrieval, where a combined visual and textual query is used, is held
back by the absence of high-quality training and evaluation data. We introduce
a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an
instance-level class definition. The goal is to retrieve images that contain
the same particular object as the visual query, presented under a variety of
modifications defined by textual queries. Its design and curation process keep
the dataset compact to facilitate future research, while maintaining its
challenge-comparable to retrieval among more than 40M random
distractors-through a semi-automated selection of hard negatives.
To overcome the challenge of obtaining clean, diverse, and suitable training
data, we leverage pre-trained vision-and-language models (VLMs) in a
training-free approach called BASIC. The method separately estimates
query-image-to-image and query-text-to-image similarities, performing late
fusion to upweight images that satisfy both queries, while down-weighting those
that exhibit high similarity with only one of the two. Each individual
similarity is further improved by a set of components that are simple and
intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR
datasets that follow a semantic-level class definition. Project page:
https://vrg.fel.cvut.cz/icir/.
Authors' comments: NeurIPS 2025
Ruibo Hou, Shiyu Teng, Jiaqing Liu, Shurong Chai, Yinhao Li, Lanfen Lin, Yen-Wei Chen
Multimodal deep learning has shown promise in depression detection by
integrating text, audio, and video signals. Recent work leverages sentiment
analysis to enhance emotional understanding, yet suffers from high
computational cost, domain mismatch, and static knowledge limitations. To
address these issues, we propose a novel Retrieval-Augmented Generation (RAG)
framework. Given a depression-related text, our method retrieves semantically
relevant emotional content from a sentiment dataset and uses a Large Language
Model (LLM) to generate an Emotion Prompt as an auxiliary modality. This prompt
enriches emotional representation and improves interpretability. Experiments on
the AVEC 2019 dataset show our approach achieves state-of-the-art performance
with CCC of 0.593 and MAE of 3.95, surpassing previous transfer learning and
multi-task learning baselines.
Authors' comments: Accepted in IEEE EMBC 2025
Yu Xia, Zhiqiang Xu
Compressed sensing has demonstrated that a general signal $\boldsymbol{x} \in
\mathbb{F}^n$ ($\mathbb{F}\in \{\mathbb{R},\mathbb{C}\}$) can be estimated from
few linear measurements with an error {proportional to} the best $k$-term
approximation error, a property known as instance optimality. In this paper, we
investigate instance optimality in the context of phaseless measurements using
the $\ell_p$-minimization decoder, where $p \in (0, 1]$, for both real and
complex cases. More specifically, we prove that $(2,1)$ and $(1,1)$-instance
optimality of order $k$ can be achieved with $m =O(k \log(n/k))$ phaseless
measurements, paralleling results from linear measurements. These results imply
that one can stably recover approximately $k$-sparse signals from $m = O(k
\log(n/k))$ phaseless measurements. Our approach leverages the phaseless
bi-Lipschitz condition. Additionally, we present a non-uniform version of
$(2,2)$-instance optimality result in probability applicable to any fixed
vector $\boldsymbol{x} \in \mathbb{F}^n$. These findings reveal striking
parallels between compressive phase retrieval and classical compressed sensing,
enhancing our understanding of both phase retrieval and instance optimality.
Authors' comments: 18 pages
Ori nizan, Oren Shrout, Ayellet Tal
A concept may reflect either a concrete or abstract idea. Given an input image, this paper seeks to retrieve other images that share its central concepts, capturing aspects of the underlying narrative. This goes beyond conventional retrieval or clustering methods, which emphasize visual or semantic similarity. We formally define the problem, outline key requirements, and introduce appropriate evaluation metrics. We propose a novel approach grounded in two key observations: (1) While each neighbor in the embedding space typically shares at least one concept with the query, not all neighbors necessarily share the same concept with one another. (2) Modeling this neighborhood with a bimodal Gaussian distribution uncovers meaningful structure that facilitates concept identification. Qualitative, quantitative, and human evaluations confirm the effectiveness of our approach. See the package on PyPI: https://pypi.org/project/coret/
Anant Gupta, Karthik Singaravadivelan, Zekun Wang
Neural document retrieval often treats a corpus as a flat cloud of vectors scored at a single granularity, leaving corpus structure underused and explanations opaque. We use Cobweb--a hierarchy-aware framework--to organize sentence embeddings into a prototype tree and rank documents via coarse-to-fine traversal. Internal nodes act as concept prototypes, providing multi-granular relevance signals and a transparent rationale through retrieval paths. We instantiate two inference approaches: a generalized best-first search and a lightweight path-sum ranker. We evaluate our approaches on MS MARCO and QQP with encoder (e.g., BERT/T5) and decoder (GPT-2) representations. Our results show that our retrieval approaches match the dot product search on strong encoder embeddings while remaining robust when kNN degrades: with GPT-2 vectors, dot product performance collapses whereas our approaches still retrieve relevant results. Overall, our experiments suggest that Cobweb provides competitive effectiveness, improved robustness to embedding quality, scalability, and interpretable retrieval via hierarchical prototypes.
Authors' comments: 20 pages, 7 tables, 4 figures
Gennian Ge, Hao Wang, Zixiang Xu, Yijun Zhang
The problem of PIR in graph-based replication systems has received
significant attention in recent years. A systematic study was conducted by
Sadeh, Gu, and Tamo, where each file is replicated across two servers and the
storage topology is modeled by a graph. The PIR capacity of a graph $G$,
denoted by $\mathcal{C}(G)$, is defined as the supremum of retrieval rates
achievable by schemes that preserve user privacy, with the rate measured as the
ratio between the file size and the total number of bits downloaded. This paper
makes the following key contributions.
(1) The complete graph $K_N$ has emerged as a central benchmark in the study
of PIR over graphs. The asymptotic gap between the upper and lower bounds for
$\mathcal{C}(K_N)$ was previously 2 and was only recently reduced to $5/3$. We
shrink this gap to $1.0444$, bringing it close to resolution. More precisely,
(i) Sadeh, Gu, and Tamo proved that $\mathcal{C}(K_N)\le 2/(N+1)$ and
conjectured this bound to be tight. We refute this conjecture by establishing
the strictly stronger bound $\mathcal{C}(K_N) \le \frac{1.3922}{N}.$ We also
improve the upper bound for the balanced complete bipartite graph
$\mathcal{C}(K_{N/2,N/2})$. (ii) The first lower bound on $\mathcal{C}(K_N)$
was $(1+o(1))/N$, which was recently sharpened to $(6/5+o(1))/N$. We provide
explicit, systematic constructions that further improve this bound, proving
$\mathcal{C}(K_N)\ge(4/3-o(1))/N,$ which in particular implies $\mathcal{C}(G)
\ge (4/3-o(1))/|G|$ for every graph $G$.
(2) We establish a conceptual bridge between deterministic and probabilistic
PIR schemes on graphs. This connection has significant implications for
reducing the required subpacketization in practical implementations and is of
independent interest. We also design a general probabilistic PIR scheme that
performs particularly well on sparse graphs.
Authors' comments: 72 pages
Kayla Farivar
Information retrieval systems have progressed notably from lexical techniques such as BM25 and TF-IDF to modern semantic retrievers. This survey provides a brief overview of the BM25 baseline, then discusses the architecture of modern state-of-the-art semantic retrievers. Advancing from BERT, we introduce dense bi-encoders (DPR), late-interaction models (ColBERT), and neural sparse retrieval (SPLADE). Finally, we examine MonoT5, a cross-encoder model. We conclude with common evaluation tactics, pressing challenges, and propositions for future directions.
Mohsen Dehghankar, Raghav Mittal, Suraj Shetiya, Abolfazl Asudeh, Gautam Das
We study the problem of Direct-Access Ranked Retrieval (DAR) for interactive data tooling, where evolving data exploration practices, combined with large-scale and high-dimensional datasets, create new challenges. DAR concerns the problem of enabling efficient access to arbitrary rank positions according to a ranking function, without enumerating all preceding tuples. To address this need, we formalize the DAR problem and propose a theoretically efficient algorithm based on geometric arrangements, achieving logarithmic query time. However, this method suffers from exponential space complexity in high dimensions. Therefore, we develop a second class of algorithms based on $\varepsilon$-sampling, which consume a linear space. Since exactly locating the tuple at a specific rank is challenging due to its connection to the range counting problem, we introduce a relaxed variant called Conformal Set Ranked Retrieval (CSR), which returns a small subset guaranteed to contain the target tuple. To solve the CSR problem efficiently, we define an intermediate problem, Stripe Range Retrieval (SRR), and design a hierarchical sampling data structure tailored for narrow-range queries. Our method achieves practical scalability in both data size and dimensionality. We prove near-optimal bounds on the efficiency of our algorithms and validate their performance through extensive experiments on real and synthetic datasets, demonstrating scalability to millions of tuples and hundreds of dimensions.
Pengcheng Zhou, Yinglun Feng, Zhongliang Yang
Although Retrieval-Augmented Generation (RAG) systems have been widely applied, the privacy and security risks they face, such as data leakage and data poisoning, have not been systematically addressed yet. Existing defense strategies primarily rely on heuristic filtering or enhancing retriever robustness, which suffer from limited interpretability, lack of formal security guarantees, and vulnerability to adaptive attacks. To address these challenges, this paper proposes the first provably secure framework for RAG systems(SAG). Our framework employs a pre-storage full-encryption scheme to ensure dual protection of both retrieved content and vector embeddings, guaranteeing that only authorized entities can access the data. Through formal security proofs, we rigorously verify the scheme's confidentiality and integrity under a computational security model. Extensive experiments across multiple benchmark datasets demonstrate that our framework effectively resists a range of state-of-the-art attacks. This work establishes a theoretical foundation and practical paradigm for verifiably secure RAG systems, advancing AI-powered services toward formally guaranteed security.
Thi Thu Uyen Hoang, Viet Anh Nguyen
This paper presents an advancement in Question-Answering (QA) systems using a Retrieval Augmented Generation (RAG) framework to enhance information extraction from PDF files. Recognizing the richness and diversity of data within PDFs--including text, images, vector diagrams, graphs, and tables--poses unique challenges for existing QA systems primarily designed for textual content. We seek to develop a comprehensive RAG-based QA system that will effectively address complex multimodal questions, where several data types are combined in the query. This is mainly achieved by refining approaches to processing and integrating non-textual elements in PDFs into the RAG framework to derive precise and relevant answers, as well as fine-tuning large language models to better adapt to our system. We provide an in-depth experimental evaluation of our solution, demonstrating its capability to extract accurate information that can be applied to different types of content across PDFs. This work not only pushes the boundaries of retrieval-augmented QA systems but also lays a foundation for further research in multimodal data integration and processing.