Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng
Generative retrieval uses differentiable search indexes to directly generate
relevant document identifiers in response to a query. Recent studies have
highlighted the potential of a strong generative retrieval model, trained with
carefully crafted pre-training tasks, to enhance downstream retrieval tasks via
fine-tuning. However, the full power of pre-training for generative retrieval
remains underexploited due to its reliance on pre-defined static document
identifiers, which may not align with evolving model parameters. In this work,
we introduce BootRet, a bootstrapped pre-training method for generative
retrieval that dynamically adjusts document identifiers during pre-training to
accommodate the continuing memorization of the corpus. BootRet involves three
key training phases: (i) initial identifier generation, (ii) pre-training via
corpus indexing and relevance prediction tasks, and (iii) bootstrapping for
identifier updates. To facilitate the pre-training phase, we further introduce
noisy documents and pseudo-queries, generated by large language models, to
resemble semantic connections in both indexing and retrieval tasks.
Experimental results demonstrate that BootRet significantly outperforms
existing pre-training generative retrieval baselines and performs well even in
zero-shot settings.
Authors' comments: Accepted by ACL Findings 2024
Ruijie Yang, Yan Zhu, Peiyao Fu, Yizhe Zhang, Zhihua Wang, Quanlin Li, Pinghong Zhou, Xian Yang et al.
Determining the necessity of resecting malignant polyps during colonoscopy
screen is crucial for patient outcomes, yet challenging due to the
time-consuming and costly nature of histopathology examination. While deep
learning-based classification models have shown promise in achieving optical
biopsy with endoscopic images, they often suffer from a lack of explainability.
To overcome this limitation, we introduce EndoFinder, a content-based image
retrieval framework to find the 'digital twin' polyp in the reference database
given a newly detected polyp. The clinical semantics of the new polyp can be
inferred referring to the matched ones. EndoFinder pioneers a polyp-aware image
encoder that is pre-trained on a large polyp dataset in a self-supervised way,
merging masked image modeling with contrastive learning. This results in a
generic embedding space ready for different downstream clinical tasks based on
image retrieval. We validate the framework on polyp re-identification and
optical biopsy tasks, with extensive experiments demonstrating that EndoFinder
not only achieves explainable diagnostics but also matches the performance of
supervised classification models. EndoFinder's reliance on image retrieval has
the potential to support diverse downstream decision-making tasks during
real-time colonoscopy procedures.
Authors' comments: MICCAI 2024
Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi et al.
Existing retrieval benchmarks primarily consist of information-seeking
queries (e.g., aggregated questions from search engines) where keyword or
semantic-based retrieval is usually sufficient. However, many complex
real-world queries require in-depth reasoning to identify relevant documents
that go beyond surface form matching. For example, finding documentation for a
coding question requires understanding the logic and syntax of the functions
involved. To better benchmark retrieval on such challenging queries, we
introduce BRIGHT, the first text retrieval benchmark that requires intensive
reasoning to retrieve relevant documents. Our dataset consists of 1,384
real-world queries spanning diverse domains, such as economics, psychology,
mathematics, and coding. These queries are drawn from naturally occurring and
carefully curated human data. Extensive evaluation reveals that even
state-of-the-art retrieval models perform poorly on BRIGHT. The leading model
on the MTEB leaderboard (Muennighoff et al., 2023) SFR-Embedding-Mistral (Meng
et al., 2024), which achieves a score of 59.0 nDCG@10,1 produces a score of
nDCG@10 of 18.3 on BRIGHT. We show that incorporating explicit reasoning about
the query improves retrieval performance by up to 12.2 points. Moreover,
incorporating retrieved documents from the top-performing retriever boosts
question-answering performance. We believe that BRIGHT paves the way for future
research on retrieval systems in more realistic and challenging settings.
Authors' comments: 51 pages
Satya Almasian, Milena Bruseva, Michael Gertz
Quantitative information plays a crucial role in understanding and interpreting the content of documents. Many user queries contain quantities and cannot be resolved without understanding their semantics, e.g., ``car that costs less than $10k''. Yet, modern search engines apply the same ranking mechanisms for both words and quantities, overlooking magnitude and unit information. In this paper, we introduce two quantity-aware ranking techniques designed to rank both the quantity and textual content either jointly or independently. These techniques incorporate quantity information in available retrieval systems and can address queries with numerical conditions equal, greater than, and less than. To evaluate the effectiveness of our proposed models, we introduce two novel quantity-aware benchmark datasets in the domains of finance and medicine and compare our method against various lexical and neural models. The code and data are available under https://github.com/satya77/QuantityAwareRankers.
Hongsong Wang, Jianhua Zhao, Jie Gui
Human action understanding is a fundamental and challenging task in computer vision. Although there exists tremendous research on this area, most works focus on action recognition, while action retrieval has received less attention. In this paper, we focus on the neglected but important task of image-based action retrieval which aims to find images that depict the same action as a query image. We establish benchmarks for this task and set up important baseline methods for fair comparison. We present an end-to-end model that learns rich action representations from three aspects: the anchored person, contextual regions, and the global image. A novel fusion transformer module is designed to model the relationships among different features and effectively fuse them into an action representation. Experiments on the Stanford-40 and PASCAL VOC 2012 Action datasets show that the proposed method significantly outperforms previous approaches for image-based action retrieval.
Saber Zerhoudi, Michael Granitzer
Large Language Models (LLMs) struggle with generating reliable outputs due to outdated knowledge and hallucinations. Retrieval-Augmented Generation (RAG) models address this by enhancing LLMs with external knowledge, but often fail to personalize the retrieval process. This paper introduces PersonaRAG, a novel framework incorporating user-centric agents to adapt retrieval and generation based on real-time user data and interactions. Evaluated across various question answering datasets, PersonaRAG demonstrates superiority over baseline models, providing tailored answers to user needs. The results suggest promising directions for user-adapted information retrieval systems.
Jack J. Davey, Kai Hou Yip, Ahmed F. Al-Refaie, Ingo P. Waldmann
With the JWST offering higher resolution data in space-based transmission
spectroscopy, understanding the capabilities of our current atmospheric
retrieval pipelines is essential. These new data cover wider wavelength ranges
and at much higher spectral resolution than previous instruments have been able
to offer. Therefore, it is often appealing to bin spectra to fewer points,
better constrained in their transit depth, before using them as inputs for
atmospheric retrievals. As such, we produce a simulation replicating the
observations of WASP-39b by the Near Infrared Spectrograph (NIRSpec) instrument
on board JWST using the PRISM dispersion element. Then, we assess the accuracy
and consistency of retrievals while varying both the resolution and the average
photometric error of this simulated spectrum. We repeat this analysis on three
different simulation setups where each includes an opaque cloud layer at a
different height in the atmosphere. In agreement with previous studies, we find
that a much greater resolution is needed in the case of a high cloud deck since
features are already heavily muted by the presence of the clouds. In the other
two cases, there are large 'safe zones' in the parameter space where accurate
estimations are made. If these maps can be generalized, they could be used to
inform future observations on how long to observe a given target in order to
achieve the most accurate retrieval results. We also find that the resolution
required to fully resolve the degeneracies between the parameters contributing
to the spectra is much greater than that needed to constrain the marginalized
posterior distributions for each parameter individually.
Authors' comments: Published in Monthly Notices of the Royal Astronomical Society, 27
pages, 30 figures
Chanwoong Yoon, Taewhoo Lee, Hyeon Hwang, Minbyul Jeong, Jaewoo Kang
Retrieval-augmented generation supports language models to strengthen their
factual groundings by providing external contexts. However, language models
often face challenges when given extensive information, diminishing their
effectiveness in solving questions. Context compression tackles this issue by
filtering out irrelevant information, but current methods still struggle in
realistic scenarios where crucial information cannot be captured with a
single-step approach. To overcome this limitation, we introduce CompAct, a
novel framework that employs an active strategy to condense extensive documents
without losing key information. Our experiments demonstrate that CompAct brings
significant improvements in both performance and compression rate on multi-hop
question-answering benchmarks. CompAct flexibly operates as a cost-efficient
plug-in module with various off-the-shelf retrievers or readers, achieving
exceptionally high compression rates (47x).
Authors' comments: Accepted to the main conference at EMNLP 2024
Saurav K. Shastri, Philip Schniter
Accurately recovering images from phaseless measurements is a challenging and long-standing problem. In this work, we present "deepECpr," which combines expectation-consistent (EC) approximation with deep denoising networks to surpass state-of-the-art phase-retrieval methods in both speed and accuracy. In addition to applying EC in a non-traditional manner, deepECpr includes a novel stochastic damping scheme that is inspired by recent diffusion methods. Like existing phase-retrieval methods based on plug-and-play priors, regularization by denoising, or diffusion, deepECpr iterates a denoising stage with a measurement-exploitation stage. But unlike existing methods, deepECpr requires far fewer denoiser calls. We compare deepECpr to the state-of-the-art prDeep (Metzler et al., 2018), Deep-ITA (Wang et al., 2020), DOLPH (Shoushtari et al., 2023), and Diffusion Posterior Sampling (Chung et al., 2023) methods for noisy phase-retrieval of color, natural, and unnatural grayscale images on oversampled-Fourier and coded-diffraction-pattern measurements and find improvements in both PSNR and SSIM with significantly fewer denoiser calls.
Junjie Huang, Jizheng Chen, Jianghao Lin, Jiarui Qin, Ziming Feng, Weinan Zhang, Yong Yu
In an era dominated by information overload, effective recommender systems
are essential for managing the deluge of data across digital platforms.
Multi-stage cascade ranking systems are widely used in the industry, with
retrieval and ranking being two typical stages. Retrieval methods sift through
vast candidates to filter out irrelevant items, while ranking methods
prioritize these candidates to present the most relevant items to users. Unlike
studies focusing on the ranking stage, this survey explores the critical yet
often overlooked retrieval stage of recommender systems. To achieve precise and
efficient personalized retrieval, we summarize existing work in three key
areas: improving similarity computation between user and item, enhancing
indexing mechanisms for efficient retrieval, and optimizing training methods of
retrieval. We also provide a comprehensive set of benchmarking experiments on
three public datasets. Furthermore, we highlight current industrial
applications through a case study on retrieval practices at a specific company,
covering the entire retrieval process and online serving, along with practical
implications and challenges. By detailing the retrieval stage, which is
fundamental for effective recommendation, this survey aims to bridge the
existing knowledge gap and serve as a cornerstone for researchers interested in
optimizing this critical component of cascade recommender systems.
Authors' comments: 38 pages
Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli et al.
Retrieval augmented generation (RAG) combines the generative abilities of
large language models (LLMs) with external knowledge sources to provide more
accurate and up-to-date responses. Recent RAG advancements focus on improving
retrieval outcomes through iterative LLM refinement or self-critique
capabilities acquired through additional instruction tuning of LLMs. In this
work, we introduce Speculative RAG - a framework that leverages a larger
generalist LM to efficiently verify multiple RAG drafts produced in parallel by
a smaller, distilled specialist LM. Each draft is generated from a distinct
subset of retrieved documents, offering diverse perspectives on the evidence
while reducing input token counts per draft. This approach enhances
comprehension of each subset and mitigates potential position bias over long
context. Our method accelerates RAG by delegating drafting to the smaller
specialist LM, with the larger generalist LM performing a single verification
pass over the drafts. Extensive experiments demonstrate that Speculative RAG
achieves state-of-the-art performance with reduced latency on TriviaQA,
MuSiQue, PopQA, PubHealth, and ARC-Challenge benchmarks. It notably enhances
accuracy by up to 12.97% while reducing latency by 50.83% compared to
conventional RAG systems on PubHealth.
Authors' comments: Accepted to ICLR 2025
Marco Peer, Robert Sablatnig, Olga Serbaeva, Isabelle Marthot-Santaniello
This paper presents a character-based approach for enhancing writer retrieval
performance in the context of Greek papyri. Our contribution lies in
introducing character-level annotations for frequently used characters, in our
case the trigram kai and four additional letters (epsilon, kappa, mu, omega),
in Greek texts. We use a state-of-the-art writer retrieval approach based on
NetVLAD and compare a character-level-based feature aggregation method against
the current default baseline of using small patches located at SIFT keypoint
locations for building the page descriptors. We demonstrate that by using only
about 15 characters per page, we are able to boost the performance up to 4% mAP
(a relative improvement of 11%) on the GRK-120 dataset. Additionally, our
qualitative analysis offers insights into the similarity scores of SIFT patches
and specific characters. We publish the dataset with character-level
annotations, including a quality label and our binarized images for further
research.
Authors' comments: submitted to ICPR2024
Shiqi Li, Jihua Zhu, Yifan Xie, Mingchen Zhu
Multiview point cloud registration serves as a cornerstone of various computer vision tasks. Previous approaches typically adhere to a global paradigm, where a pose graph is initially constructed followed by motion synchronization to determine the absolute pose. However, this separated approach may not fully leverage the characteristics of multiview registration and might struggle with low-overlap scenarios. In this paper, we propose an incremental multiview point cloud registration method that progressively registers all scans to a growing meta-shape. To determine the incremental ordering, we employ a two-stage coarse-to-fine strategy for point cloud candidate retrieval. The first stage involves the coarse selection of scans based on neighbor fusion-enhanced global aggregation features, while the second stage further reranks candidates through geometric-based matching. Additionally, we apply a transformation averaging technique to mitigate accumulated errors during the registration process. Finally, we utilize a Reservoir sampling-based technique to address density variance issues while reducing computational load. Comprehensive experimental results across various benchmarks validate the effectiveness and generalization of our approach.
Rama Akkiraju, Anbang Xu, Deepak Bora, Tan Yu, Lu An, Vishal Seth, Aaditya Shukla, Pritam Gundecha et al.
Enterprise chatbots, powered by generative AI, are emerging as key
applications to enhance employee productivity. Retrieval Augmented Generation
(RAG), Large Language Models (LLMs), and orchestration frameworks like
Langchain and Llamaindex are crucial for building these chatbots. However,
creating effective enterprise chatbots is challenging and requires meticulous
RAG pipeline engineering. This includes fine-tuning embeddings and LLMs,
extracting documents from vector databases, rephrasing queries, reranking
results, designing prompts, honoring document access controls, providing
concise responses, including references, safeguarding personal information, and
building orchestration agents. We present a framework for building RAG-based
chatbots based on our experience with three NVIDIA chatbots: for IT/HR
benefits, financial earnings, and general content. Our contributions are
three-fold: introducing the FACTS framework (Freshness, Architectures, Cost,
Testing, Security), presenting fifteen RAG pipeline control points, and
providing empirical results on accuracy-latency tradeoffs between large and
small LLMs. To the best of our knowledge, this is the first paper of its kind
that provides a holistic view of the factors as well as solutions for building
secure enterprise-grade chatbots."
Authors' comments: 8 pages, 6 figures, 2 tables, Preprint submission to ACM CIKM 2024
Xinyu Zhu, Zhiguo Jiang, Kun Wu, Jun Shi, Yushan Zheng
Content-based histopathological image retrieval (CBHIR) has gained attention in recent years, offering the capability to return histopathology images that are content-wise similar to the query one from an established database. However, in clinical practice, the continuously expanding size of WSI databases limits the practical application of the current CBHIR methods. In this paper, we propose a Lifelong Whole Slide Retrieval (LWSR) framework to address the challenges of catastrophic forgetting by progressive model updating on continuously growing retrieval database. Our framework aims to achieve the balance between stability and plasticity during continuous learning. To preserve system plasticity, we utilize local memory bank with reservoir sampling method to save instances, which can comprehensively encompass the feature spaces of both old and new tasks. Furthermore, A distance consistency rehearsal (DCR) module is designed to ensure the retrieval queue's consistency for previous tasks, which is regarded as stability within a lifelong CBHIR system. We evaluated the proposed method on four public WSI datasets from TCGA projects. The experimental results have demonstrated the proposed method is effective and is superior to the state-of-the-art methods.
Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, Tianyu Gao
Literature search questions, such as "where can I find research on the
evaluation of consistency in generated summaries?" pose significant challenges
for modern search engines and retrieval systems. These questions often require
a deep understanding of research concepts and the ability to reason over entire
articles. In this work, we introduce LitSearch, a retrieval benchmark
comprising 597 realistic literature search queries about recent ML and NLP
papers. LitSearch is constructed using a combination of (1) questions generated
by GPT-4 based on paragraphs containing inline citations from research papers
and (2) questions about recently published papers, manually written by their
authors. All LitSearch questions were manually examined or edited by experts to
ensure high quality. We extensively benchmark state-of-the-art retrieval models
and also evaluate two LLM-based reranking pipelines. We find a significant
performance gap between BM25 and state-of-the-art dense retrievers, with a
24.8% difference in absolute recall@5. The LLM-based reranking strategies
further improve the best-performing dense retriever by 4.4%. Additionally,
commercial search engines and research tools like Google Search perform poorly
on LitSearch, lagging behind the best dense retriever by 32 points. Taken
together, these results show that LitSearch is an informative new testbed for
retrieval systems while catering to a real-world use case.
Authors' comments: Dataset and code available at
https://github.com/princeton-nlp/LitSearch
Jingxuan Yang, Juan Alday, Patrick Irwin
NEMESISPY is a Python package developed to perform parametric atmospheric
modelling and radiative transfer calculation for the retrievals of exoplanetary
spectra. It is a recent development of the well-established Fortran NEMESIS
library (P. G. J. Irwin et al., 2008), which has been applied to the
atmospheric retrievals of both solar system planets and exoplanets employing
numerous different observing geometries. NEMESISPY can be easily interfaced
with Bayesian inference algorithms to retrieve atmospheric properties from
spectroscopic observations. Recently, NEMESISPY has been applied to the
retrievals of Hubble and Spitzer data of a hot Jupiter (Yang et al., 2023), as
well as to JWST/Mid-Infrared Instrument (JWST/MIRI) data of a hot Jupiter (Yang
et al., 2024).
Authors' comments: Under review at the Journal of Open Source Software. Project website:
https://jingxuan97.github.io/nemesispy/
Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon Min, Luke Zettlemoyer, Pang Wei Koh
Scaling laws with respect to the amount of training data and the number of parameters allow us to predict the cost-benefit trade-offs of pretraining language models (LMs) in different configurations. In this paper, we consider another dimension of scaling: the amount of data available at inference time. Specifically, we find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation, such that a smaller model augmented with a large datastore outperforms a larger LM-only model on knowledge-intensive tasks. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget. We carry out our study by constructing a 1.4 trillion-token datastore named MassiveDS, which is the largest and the most diverse open-sourced datastore for retrieval-based LMs to date, and designing an efficient pipeline for studying datastore scaling in a computationally accessible manner. Finally, we analyze the effect of improving the retriever, datastore quality filtering, and other design choices on our observed scaling trends. Overall, our results show that datastore size should be considered as an integral part of LM efficiency and performance trade-offs. To facilitate future research, we open-source our datastore and code at https://github.com/RulinShao/retrieval-scaling.
Silvio Kalaj, Clarissa Lauditi, Gabriele Perugini, Carlo Lucibello, Enrico M. Malatesta, Matteo Negri
It has been recently shown that a learning transition happens when a Hopfield Network stores examples generated as superpositions of random features, where new attractors corresponding to such features appear in the model. In this work we reveal that the network also develops attractors corresponding to previously unseen examples generated with the same set of features. We explain this surprising behaviour in terms of spurious states of the learned features: we argue that, increasing the number of stored examples beyond the learning transition, the model also learns to mix the features to represent both stored and previously unseen examples. We support this claim with the computation of the phase diagram of the model.
Bohan Hou, Haoqiang Lin, Haokun Wen, Meng Liu, Mingzhu Xu, Xuemeng Song
Composed Image Retrieval (CIR) is a challenging task that aims to retrieve
the target image with a multimodal query, i.e., a reference image, and its
complementary modification text. As previous supervised or zero-shot learning
paradigms all fail to strike a good trade-off between the model's
generalization ability and retrieval performance, recent researchers have
introduced the task of few-shot CIR (FS-CIR) and proposed a textual
inversion-based network based on pretrained CLIP model to realize it. Despite
its promising performance, the approach encounters two key limitations: simply
relying on the few annotated samples for CIR model training and
indiscriminately selecting training triplets for CIR model fine-tuning. To
address these two limitations, we propose a novel two-stage pseudo triplet
guided few-shot CIR scheme, dubbed PTG-FSCIR. In the first stage, we propose an
attentive masking and captioning-based pseudo triplet generation method, to
construct pseudo triplets from pure image data and use them to fulfill the
CIR-task specific pertaining. In the second stage, we propose a challenging
triplet-based CIR fine-tuning method, where we design a pseudo modification
text-based sample challenging score estimation strategy and a robust top
range-based random sampling strategy for sampling robust challenging triplets
to promote the model fine-tuning. Notably, our scheme is plug-and-play and
compatible with any existing supervised CIR models. We test our scheme across
two backbones on three public datasets (i.e., FashionIQ, CIRR, and
Birds-to-Words), achieving maximum improvements of 13.3%, 22.2%, and 17.4%
respectively, demonstrating our scheme's efficacy.
Authors' comments: 10pages