Richard Zhu
Exact nearest neighbor search is a computationally intensive process, and
even its simpler sibling -- vector retrieval -- can be computationally complex.
This is exacerbated when retrieving vectors which have high-dimension $d$
relative to the number of vectors, $N$, in the database. Exact nearest neighbor
retrieval has been generally acknowledged to be a $O(Nd)$ problem with no
sub-linear solutions. Attention has instead shifted towards Approximate
Nearest-Neighbor (ANN) retrieval techniques, many of which have sub-linear or
even logarithmic time complexities. However, if our intuition from binary
search problems (e.g. $d=1$ vector retrieval) carries, there ought to be a way
to retrieve an organized representation of vectors without brute-forcing our
way to a solution. For low dimension (e.g. $d=2$ or $d=3$ cases),
\texttt{kd-trees} provide a $O(d\log N)$ algorithm for retrieval. Unfortunately
the algorithm deteriorates rapidly to a $O(dN)$ solution at high dimensions
(e.g. $k=128$), in practice. We propose a novel algorithm for logarithmic Fast
Exact Retrieval for Nearest-neighbor lookup (FERN), inspired by
\texttt{kd-trees}. The algorithm achieves $O(d\log N)$ look-up with 100\%
recall on 10 million $d=128$ uniformly randomly generated
vectors.\footnote{Code available at https://github.com/RichardZhu123/ferns}
Authors' comments: NAACL 2024 SRW
Nina Willis, Abraham Bernstein, Luca Rossetto
Interactive video retrieval is a cooperative process between humans and retrieval systems. Large-scale evaluation campaigns, however, often overlook human factors, such as the effects of perception, attention, and memory, when assessing media retrieval systems. Consequently, their setups fall short of emulating realistic retrieval scenarios. In this paper, we design novel task presentation modes based on concepts in media memorability, implement the pipelines necessary for processing target video segments, and build a custom experimental platform for the final evaluation. In order to study the effects of different task representation schemes, we conduct a large crowdsourced experiment. Our findings demonstrate that the way in which the target of a video retrieval task is presented has a substantial influence on the difficulty of the retrieval task and that individuals can successfully retrieve a target video segment despite reducing or even altering the provided hints, opening up a discussion around future evaluation protocols in the domain of interactive media retrieval.
Junting Zhao, Yang Zhou, Zhihao Chen, Huazhu Fu, Liang Wan
Automated radiology reporting holds immense clinical potential in alleviating the burdensome workload of radiologists and mitigating diagnostic bias. Recently, retrieval-based report generation methods have garnered increasing attention due to their inherent advantages in terms of the quality and consistency of generated reports. However, due to the long-tail distribution of the training data, these models tend to learn frequently occurring sentences and topics, overlooking the rare topics. Regrettably, in many cases, the descriptions of rare topics often indicate critical findings that should be mentioned in the report. To address this problem, we introduce a Topicwise Separable Sentence Retrieval (Teaser) for medical report generation. To ensure comprehensive learning of both common and rare topics, we categorize queries into common and rare types to learn differentiated topics, and then propose Topic Contrastive Loss to effectively align topics and queries in the latent space. Moreover, we integrate an Abstractor module following the extraction of visual features, which aids the topic decoder in gaining a deeper understanding of the visual observational intent. Experiments on the MIMIC-CXR and IU X-ray datasets demonstrate that Teaser surpasses state-of-the-art models, while also validating its capability to effectively represent rare topics and establish more dependable correspondences between queries and topics.
Junren Chen, Ming Yuan
In this paper, we study the sample complexity and develop efficient optimal
algorithms for 1-bit phase retrieval: recovering a signal
$\mathbf{x}\in\mathbb{R}^n$ from $m$ phaseless bits
$\{\mathrm{sign}(|\mathbf{a}_i^\top\mathbf{x}|-\tau)\}_{i=1}^m$ generated by
standard Gaussian $\mathbf{a}_i$s. By investigating a phaseless version of
random hyperplane tessellation, we show that (constrained) hamming distance
minimization uniformly recovers all unstructured signals with Euclidean norm
bounded away from zero and infinity to the error $\mathcal{O}((n/m)\log(m/n))$,
and $\mathcal{O}((k/m)\log(mn/k^2))$ when restricting to $k$-sparse signals.
Both error rates are shown to be information-theoretically optimal, up to a
logarithmic factor. Intriguingly, the optimal rate for sparse recovery matches
that of 1-bit compressed sensing, suggesting that the phase information is
non-essential for 1-bit compressed sensing. We also develop efficient
algorithms for 1-bit (sparse) phase retrieval that can achieve these error
rates. Specifically, we prove that (thresholded) gradient descent with respect
to the one-sided $\ell_1$-loss, when initialized via spectral methods,
converges linearly and attains the near optimal reconstruction error, with
sample complexity $\mathcal{O}(n)$ for unstructured signals and
$\mathcal{O}(k^2\log(n)\log^2(m/k))$ for $k$-sparse signals. Our proof is based
upon the observation that a certain local (restricted) approximate
invertibility condition is respected by Gaussian measurements.
Authors' comments: substantially cleaner proofs provided
Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu
Retrieval-Augmented Language Modeling (RALM) by integrating large language
models (LLM) with relevant documents from an external corpus is a proven method
for enabling the LLM to generate information beyond the scope of its
pre-training corpus. Previous work utilizing retrieved content by simply
prepending it to the input poses a high runtime issue, which degrades the
inference efficiency of the LLMs because they fail to use the Key-Value (KV)
cache efficiently. In this paper, we propose FlashBack, a modular RALM designed
to improve the inference efficiency of RALM with appending context pattern
while maintaining decent performance after fine-tuning by Low-Rank Adaption.
FlashBack appends retrieved documents at the end of the context for efficiently
utilizing the KV cache instead of prepending them. And we introduce Marking
Token as two special prompt tokens for marking the boundary of the appending
context during fine-tuning. Our experiments on testing generation quality show
that FlashBack can remain decent generation quality in perplexity. And the
inference speed of FlashBack is up to $4\times$ faster than the prepending
counterpart on a 7B LLM (Llama 2) in the runtime test. Via bypassing
unnecessary re-computation, it demonstrates an advancement by achieving
significantly faster inference speed, and this heightened efficiency will
substantially reduce inferential cost.
Authors' comments: ACL 2025 Findings, 14 pages
Jiacheng Cheng, Hijung Valentina Shin, Nuno Vasconcelos, Bryan Russell, Fabian Caba Heilbron
In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.
Li Mi, Xianjie Dai, Javiera Castillo-Navarro, Devis Tuia
Image-based retrieval in large Earth observation archives is challenging
because one needs to navigate across thousands of candidate matches only with
the query image as a guide. By using text as information supporting the visual
query, the retrieval system gains in usability, but at the same time faces
difficulties due to the diversity of visual signals that cannot be summarized
by a short caption only. For this reason, as a matching-based task, cross-modal
text-image retrieval often suffers from information asymmetry between texts and
images. To address this challenge, we propose a Knowledge-aware Text-Image
Retrieval (KTIR) method for remote sensing images. By mining relevant
information from an external knowledge graph, KTIR enriches the text scope
available in the search query and alleviates the information gaps between texts
and images for better matching. Moreover, by integrating domain-specific
knowledge, KTIR also enhances the adaptation of pre-trained vision-language
models to remote sensing applications. Experimental results on three commonly
used remote sensing text-image retrieval benchmarks show that the proposed
knowledge-aware method leads to varied and consistent retrievals, outperforming
state-of-the-art retrieval methods.
Authors' comments: Accepted by IEEE TGRS
Lorenzo Agnolucci, Alberto Baldrati, Marco Bertini, Alberto Del Bimbo
Given a query consisting of a reference image and a relative caption,
Composed Image Retrieval (CIR) aims to retrieve target images visually similar
to the reference one while incorporating the changes specified in the relative
caption. The reliance of supervised methods on labor-intensive manually labeled
datasets hinders their broad applicability. In this work, we introduce a new
task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled
training dataset. We propose an approach named iSEARLE (improved zero-Shot
composEd imAge Retrieval with textuaL invErsion) that involves mapping the
visual information of the reference image into a pseudo-word token in CLIP
token embedding space and combining it with the relative caption. To foster
research on ZS-CIR, we present an open-domain benchmarking dataset named CIRCO
(Composed Image Retrieval on Common Objects in context), the first CIR dataset
where each query is labeled with multiple ground truths and a semantic
categorization. The experimental results illustrate that iSEARLE obtains
state-of-the-art performance on three different CIR datasets -- FashionIQ,
CIRR, and the proposed CIRCO -- and two additional evaluation settings, namely
domain conversion and object composition. The dataset, the code, and the model
are publicly available at https://github.com/miccunifi/SEARLE.
Authors' comments: Extended version of the ICCV2023 paper arXiv:2303.15247
Xinran Zhao, Tong Chen, Sihao Chen, Hongming Zhang, Tongshuang Wu
The task of Information Retrieval (IR) requires a system to identify relevant documents based on users' information needs. In real-world scenarios, retrievers are expected to not only rely on the semantic relevance between the documents and the queries but also recognize the nuanced intents or perspectives behind a user query. For example, when asked to verify a claim, a retrieval system is expected to identify evidence from both supporting vs. contradicting perspectives, for the downstream system to make a fair judgment call. In this work, we study whether retrievers can recognize and respond to different perspectives of the queries -- beyond finding relevant documents for a claim, can retrievers distinguish supporting vs. opposing documents? We reform and extend six existing tasks to create a benchmark for retrieval, where we have diverse perspectives described in free-form text, besides root, neutral queries. We show that current retrievers covered in our experiments have limited awareness of subtly different perspectives in queries and can also be biased toward certain perspectives. Motivated by the observation, we further explore the potential to leverage geometric features of retriever representation space to improve the perspective awareness of retrievers in a zero-shot manner. We demonstrate the efficiency and effectiveness of our projection-based methods on the same set of tasks. Further analysis also shows how perspective awareness improves performance on various downstream tasks, with 4.2% higher accuracy on AmbigQA and 29.9% more correlation with designated viewpoints on essay writing, compared to non-perspective-aware baselines.
Ra'Fat Al-Msie'deen
A Bug Tracking System (BTS), such as Bugzilla, is generally utilized to track
submitted Bug Reports (BRs) for a particular software system. Duplicate Bug
Report (DBR) retrieval is the process of obtaining a DBR in the BTS. This
process is important to avoid needless work from engineers on DBRs. To prevent
wasting engineer resources, such as effort and time, on previously submitted
(or duplicate) BRs, it is essential to find and retrieve DBRs as soon as they
are submitted by software users. Thus, this paper proposes an automatic
approach (called BushraDBR) that aims to assist an engineer (called a triager)
to retrieve DBRs and stop the duplicates before they start. Where BushraDBR
stands for Bushra Duplicate Bug Reports retrieval process. Therefore, when a
new BR is sent to the Bug Repository (BRE), an engineer checks whether it is a
duplicate of an existing BR in BRE or not via BushraDBR approach. If it is, the
engineer marks it as DBR, and the BR is excluded from consideration for any
additional work; otherwise, the BR is added to the BRE. BushraDBR approach
relies on Textual Similarity (TS) between the newly submitted BR and the rest
of the BRs in BRE to retrieve DBRs. BushraDBR exploits unstructured data from
BRs to apply Information Retrieval (IR) methods in an efficient way. BushraDBR
approach uses two techniques to retrieve DBRs: Latent Semantic Indexing (LSI)
and Formal Concept Analysis (FCA). The originality of BushraDBR is to stop DBRs
before they occur by comparing the newly reported BR with the rest of the BRs
in the BTS, thus saving time and effort during the Software Maintenance (SM)
process. BushraDBR also uniquely retrieves DBR through the use of LSI and FCA
techniques. BushraDBR approach had been validated and evaluated on several
publicly available data sets from Bugzilla. Experiments show the ability of
BushraDBR approach to retrieve DBRs in an efficient manner.
Authors' comments: 18 pages, 12 figures, 16 tables, 92 references
Dmytro Mozolevskyi, Waseem AlShikh
This research paper presents a comprehensive analysis of integrating advanced language models with search and retrieval systems in the fields of information retrieval and natural language processing. The objective is to evaluate and compare various state-of-the-art methods based on their performance in terms of accuracy and efficiency. The analysis explores different combinations of technologies, including Azure Cognitive Search Retriever with GPT-4, Pinecone's Canopy framework, Langchain with Pinecone and different language models (OpenAI, Cohere), LlamaIndex with Weaviate Vector Store's hybrid search, Google's RAG implementation on Cloud VertexAI-Search, Amazon SageMaker's RAG, and a novel approach called KG-FID Retrieval. The motivation for this analysis arises from the increasing demand for robust and responsive question-answering systems in various domains. The RobustQA metric is used to evaluate the performance of these systems under diverse paraphrasing of questions. The report aims to provide insights into the strengths and weaknesses of each method, facilitating informed decisions in the deployment and development of AI-driven search and retrieval systems.
Jiawei Zhou, Li Dong, Furu Wei, Lei Chen
Information retrieval has transitioned from standalone systems into essential components across broader applications, with indexing efficiency, cost-effectiveness, and freshness becoming increasingly critical yet often overlooked. In this paper, we introduce SemI-parametric Disentangled Retrieval (SiDR), a bi-encoder retrieval framework that decouples retrieval index from neural parameters to enable efficient, low-cost, and parameter-agnostic indexing for emerging use cases. Specifically, in addition to using embeddings as indexes like existing neural retrieval methods, SiDR supports a non-parametric tokenization index for search, achieving BM25-like indexing complexity with significantly better effectiveness. Our comprehensive evaluation across 16 retrieval benchmarks demonstrates that SiDR outperforms both neural and term-based retrieval baselines under the same indexing workload: (i) When using an embedding-based index, SiDR exceeds the performance of conventional neural retrievers while maintaining similar training complexity; (ii) When using a tokenization-based index, SiDR drastically reduces indexing cost and time, matching the complexity of traditional term-based retrieval, while consistently outperforming BM25 on all in-domain datasets; (iii) Additionally, we introduce a late parametric mechanism that matches BM25 index preparation time while outperforming other neural retrieval baselines in effectiveness.
Yifei Ming, Yixuan Li
Pre-trained contrastive vision-language models have demonstrated remarkable
performance across a wide range of tasks. However, they often struggle on
fine-trained datasets with categories not adequately represented during
pre-training, which makes adaptation necessary. Recent works have shown
promising results by utilizing samples from web-scale databases for
retrieval-augmented adaptation, especially in low-data regimes. Despite the
empirical success, understanding how retrieval impacts the adaptation of
vision-language models remains an open research question. In this work, we
adopt a reflective perspective by presenting a systematic study to understand
the roles of key components in retrieval-augmented adaptation. We unveil new
insights on uni-modal and cross-modal retrieval and highlight the critical role
of logit ensemble for effective adaptation. We further present theoretical
underpinnings that directly support our empirical observations.
Authors' comments: The paper is accepted at ICML 2024
Antonio Mallia, Torten Suel, Nicola Tonellotto
Learned sparse retrieval systems aim to combine the effectiveness of
contextualized language models with the scalability of conventional data
structures such as inverted indexes. Nevertheless, the indexes generated by
these systems exhibit significant deviations from the ones that use traditional
retrieval models, leading to a discrepancy in the performance of existing query
optimizations that were specifically developed for traditional structures.
These disparities arise from structural variations in query and document
statistics, including sub-word tokenization, leading to longer queries, smaller
vocabularies, and different score distributions within posting lists. This
paper introduces Block-Max Pruning (BMP), an innovative dynamic pruning
strategy tailored for indexes arising in learned sparse retrieval environments.
BMP employs a block filtering mechanism to divide the document space into
small, consecutive document ranges, which are then aggregated and sorted on the
fly, and fully processed only as necessary, guided by a defined safe early
termination criterion or based on approximate retrieval requirements. Through
rigorous experimentation, we show that BMP substantially outperforms existing
dynamic pruning strategies, offering unparalleled efficiency in safe retrieval
contexts and improved tradeoffs between precision and efficiency in approximate
retrieval tasks.
Authors' comments: SIGIR 2024 (short paper track)
Dawn Lawrie, Efsun Kayi, Eugene Yang, James Mayfield, Douglas W. Oard
PLAID, an efficient implementation of the ColBERT late interaction bi-encoder
using pretrained language models for ranking, consistently achieves
state-of-the-art performance in monolingual, cross-language, and multilingual
retrieval. PLAID differs from ColBERT by assigning terms to clusters and
representing those terms as cluster centroids plus compressed residual vectors.
While PLAID is effective in batch experiments, its performance degrades in
streaming settings where documents arrive over time because representations of
new tokens may be poorly modeled by the earlier tokens used to select cluster
centroids. PLAID Streaming Hierarchical Indexing that Runs on Terabytes of
Temporal Text (PLAID SHIRTTT) addresses this concern using multi-phase
incremental indexing based on hierarchical sharding. Experiments on ClueWeb09
and the multilingual NeuCLIR collection demonstrate the effectiveness of this
approach both for the largest collection indexed to date by the ColBERT
architecture and in the multilingual setting, respectively.
Authors' comments: 5 pages, 1 figure, accepted at SIGIR 2024 as short paper
Mingchen Li, Halil Kilicoglu, Hua Xu, Rui Zhang
Large Language Models (LLMs) have swiftly emerged as vital resources for different applications in the biomedical and healthcare domains; however, these models encounter issues such as generating inaccurate information or hallucinations. Retrieval-augmented generation provided a solution for these models to update knowledge and enhance their performance. In contrast to previous retrieval-augmented LMs, which utilize specialized cross-attention mechanisms to help LLM encode retrieved text, BiomedRAG adopts a simpler approach by directly inputting the retrieved chunk-based documents into the LLM. This straightforward design is easily applicable to existing retrieval and language models, effectively bypassing noise information in retrieved documents, particularly in noise-intensive tasks. Moreover, we demonstrate the potential for utilizing the LLM to supervise the retrieval model in the biomedical domain, enabling it to retrieve the document that assists the LM in improving its predictions. Our experiments reveal that with the tuned scorer,\textsc{ BiomedRAG} attains superior performance across 5 biomedical NLP tasks, encompassing information extraction (triple extraction, relation extraction), text classification, link prediction, and question-answering, leveraging over 9 datasets. For instance, in the triple extraction task, \textsc{BiomedRAG} outperforms other triple extraction systems with micro-F1 scores of 81.42 and 88.83 on GIT and ChemProt corpora, respectively.
Zihao Li, Yuyi Ao, Jingrui He
Knowledge graphs (KGs), which store an extensive number of relational facts
(head, relation, tail), serve various applications. While many downstream tasks
highly rely on the expressive modeling and predictive embedding of KGs, most of
the current KG representation learning methods, where each entity is embedded
as a vector in the Euclidean space and each relation is embedded as a
transformation, follow an entity ranking protocol. On one hand, such an
embedding design cannot capture many-to-many relations. On the other hand, in
many retrieval cases, the users wish to get an exact set of answers without any
ranking, especially when the results are expected to be precise, e.g., which
genes cause an illness. Such scenarios are commonly referred to as "set
retrieval". This work presents a pioneering study on the KG set retrieval
problem. We show that the set retrieval highly depends on expressive modeling
of many-to-many relations, and propose a new KG embedding model SpherE to
address this problem. SpherE is based on rotational embedding methods, but each
entity is embedded as a sphere instead of a vector. While inheriting the high
interpretability of rotational-based models, our SpherE can more expressively
model one-to-many, many-to-one, and many-to-many relations. Through extensive
experiments, we show that our SpherE can well address the set retrieval problem
while still having a good predictive ability to infer missing facts. The code
is available at https://github.com/Violet24K/SpherE.
Authors' comments: Accepted by SIGIR 2024, Camera Ready Version
Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini
Learned sparse representations form an attractive class of contextual embeddings for text retrieval. That is so because they are effective models of relevance and are interpretable by design. Despite their apparent compatibility with inverted indexes, however, retrieval over sparse embeddings remains challenging. That is due to the distributional differences between learned embeddings and term frequency-based lexical models of relevance such as BM25. Recognizing this challenge, a great deal of research has gone into, among other things, designing retrieval algorithms tailored to the properties of learned sparse representations, including approximate retrieval systems. In fact, this task featured prominently in the latest BigANN Challenge at NeurIPS 2023, where approximate algorithms were evaluated on a large benchmark dataset by throughput and recall. In this work, we propose a novel organization of the inverted index that enables fast yet effective approximate retrieval over learned sparse embeddings. Our approach organizes inverted lists into geometrically-cohesive blocks, each equipped with a summary vector. During query processing, we quickly determine if a block must be evaluated using the summaries. As we show experimentally, single-threaded query processing using our method, Seismic, reaches sub-millisecond per-query latency on various sparse embeddings of the MS MARCO dataset while maintaining high recall. Our results indicate that Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions and further outperforms the winning (graph-based) submissions to the BigANN Challenge by a significant margin.
Liying Gao, Bingliang Jiao, Peng Wang, Shizhou Zhang, Hanwang Zhang, Yanning Zhang
Sketch-based image retrieval (SBIR) associates hand-drawn sketches with their corresponding realistic images. In this study, we aim to tackle two major challenges of this task simultaneously: i) zero-shot, dealing with unseen categories, and ii) fine-grained, referring to intra-category instance-level retrieval. Our key innovation lies in the realization that solely addressing this cross-category and fine-grained recognition task from the generalization perspective may be inadequate since the knowledge accumulated from limited seen categories might not be fully valuable or transferable to unseen target categories. Inspired by this, in this work, we propose a dual-modal prompting CLIP (DP-CLIP) network, in which an adaptive prompting strategy is designed. Specifically, to facilitate the adaptation of our DP-CLIP toward unpredictable target categories, we employ a set of images within the target category and the textual category label to respectively construct a set of category-adaptive prompt tokens and channel scales. By integrating the generated guidance, DP-CLIP could gain valuable category-centric insights, efficiently adapting to novel categories and capturing unique discriminative clues for effective retrieval within each target category. With these designs, our DP-CLIP outperforms the state-of-the-art fine-grained zero-shot SBIR method by 7.3% in Acc.@1 on the Sketchy dataset. Meanwhile, in the other two category-level zero-shot SBIR benchmarks, our method also achieves promising performance.
Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Yanqiao Zhu, May D. Wang, Joyce C. Ho, Chao Zhang et al.
Developing effective biomedical retrieval models is important for excelling
at knowledge-intensive biomedical tasks but still challenging due to the
deficiency of sufficient publicly annotated biomedical data and computational
resources. We present BMRetriever, a series of dense retrievers for enhancing
biomedical retrieval via unsupervised pre-training on large biomedical corpora,
followed by instruction fine-tuning on a combination of labeled datasets and
synthetic pairs. Experiments on 5 biomedical tasks across 11 datasets verify
BMRetriever's efficacy on various biomedical applications. BMRetriever also
exhibits strong parameter efficiency, with the 410M variant outperforming
baselines up to 11.7 times larger, and the 2B variant matching the performance
of models with over 5B parameters. The training data and model checkpoints are
released at \url{https://huggingface.co/BMRetriever} to ensure transparency,
reproducibility, and application to new domains.
Authors' comments: Accepted to EMNLP 2024. The model and data are uploaded to
\url{https://github.com/ritaranx/BMRetriever}