Shuai Wang, Shengyao Zhuang, Bevan Koopman, Guido Zuccon
2D Matryoshka Training is an advanced embedding representation training approach designed to train an encoder model simultaneously across various layer-dimension setups. This method has demonstrated higher effectiveness in Semantic Text Similarity (STS) tasks over traditional training approaches when using sub-layers for embeddings. Despite its success, discrepancies exist between two published implementations, leading to varied comparative results with baseline models. In this reproducibility study, we implement and evaluate both versions of 2D Matryoshka Training on STS tasks and extend our analysis to retrieval tasks. Our findings indicate that while both versions achieve higher effectiveness than traditional Matryoshka training on sub-dimensions, and traditional full-sized model training approaches, they do not outperform models trained separately on specific sub-layer and sub-dimension setups. Moreover, these results generalize well to retrieval tasks, both in supervised (MSMARCO) and zero-shot (BEIR) settings. Further explorations of different loss computations reveals more suitable implementations for retrieval tasks, such as incorporating full-dimension loss and training on a broader range of target dimensions. Conversely, some intuitive approaches, such as fixing document encoders to full model outputs, do not yield improvements. Our reproduction code is available at https://github.com/ielab/2DMSE-Reproduce.
Mohammad Hassan Heydari, Arshia Hemmat, Erfan Naman, Afsaneh Fatemi
Retrieval Augmented Generation (RAG) has emerged as a widely adopted approach to mitigate the limitations of large language models (LLMs) in answering domain-specific questions. Previous research has predominantly focused on improving the accuracy and quality of retrieved data chunks to enhance the overall performance of the generation pipeline. However, despite ongoing advancements, the critical issue of retrieving irrelevant information -- which can impair the ability of the model to utilize its internal knowledge effectively -- has received minimal attention. In this work, we investigate the impact of retrieving irrelevant information in open-domain question answering, highlighting its significant detrimental effect on the quality of LLM outputs. To address this challenge, we propose the Context Awareness Gate (CAG) architecture, a novel mechanism that dynamically adjusts the LLMs' input prompt based on whether the user query necessitates external context retrieval. Additionally, we introduce the Vector Candidates method, a core mathematical component of CAG that is statistical, LLM-independent, and highly scalable. We further examine the distributions of relationships between contexts and questions, presenting a statistical analysis of these distributions. This analysis can be leveraged to enhance the context retrieval process in Retrieval Augmented Generation (RAG) systems.
Yunli Wang, Zixuan Yang, Zhen Zhang, Zhiqiang Wang, Jian Yang, Shiyang Wen, Peng Jiang, Kun Gai
The scaling law is a notable property of neural network models and has
significantly propelled the development of large language models. Scaling laws
hold great promise in guiding model design and resource allocation. Recent
research increasingly shows that scaling laws are not limited to NLP tasks or
Transformer architectures; they also apply to domains such as recommendation.
However, there is still a lack of literature on scaling law research in online
advertisement retrieval systems. This may be because 1) identifying the scaling
law for resource cost and online revenue is often expensive in both time and
training resources for large-scale industrial applications, and 2) varying
settings for different systems prevent the scaling law from being applied
across various scenarios. To address these issues, we propose a lightweight
paradigm to identify the scaling law of online revenue and machine cost for a
certain online advertisement retrieval scenario with a low experimental cost.
Specifically, we focus on a sole factor (FLOPs) and propose an offline metric
named R/R* that exhibits a high linear correlation with online revenue for
retrieval models. We estimate the machine cost offline via a simulation
algorithm. Thus, we can transform most online experiments into low-cost offline
experiments. We conduct comprehensive experiments to verify the effectiveness
of our proposed metric R/R* and to identify the scaling law in the online
advertisement retrieval system of Kuaishou. With the scaling law, we
demonstrate practical applications for ROI-constrained model designing and
multi-scenario resource allocation in Kuaishou advertising system. To the best
of our knowledge, this is the first work to study the scaling laws for online
advertisement retrieval of real-world systems, showing great potential for
scaling law in advertising system optimization.
Authors' comments: 10 pages, 8 figures
Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Saee Paliwal, Arash Vahdat, Weili Nie
Fragment-based drug discovery, in which molecular fragments are assembled
into new molecules with desirable biochemical properties, has achieved great
success. However, many fragment-based molecule generation methods show limited
exploration beyond the existing fragments in the database as they only
reassemble or slightly modify the given ones. To tackle this problem, we
propose a new fragment-based molecule generation framework with retrieval
augmentation, namely Fragment Retrieval-Augmented Generation (f-RAG). f-RAG is
based on a pre-trained molecular generative model that proposes additional
fragments from input fragments to complete and generate a new molecule. Given a
fragment vocabulary, f-RAG retrieves two types of fragments: (1) hard
fragments, which serve as building blocks that will be explicitly included in
the newly generated molecule, and (2) soft fragments, which serve as reference
to guide the generation of new fragments through a trainable fragment injection
module. To extrapolate beyond the existing fragments, f-RAG updates the
fragment vocabulary with generated fragments via an iterative refinement
process which is further enhanced with post-hoc genetic fragment modification.
f-RAG can achieve an improved exploration-exploitation trade-off by maintaining
a pool of fragments and expanding it with novel and high-quality fragments
through a strong generative prior.
Authors' comments: NeurIPS 2024
Po-han Li, Yunhao Yang, Mohammad Omama, Sandeep Chinchali, Ufuk Topcu
Autonomous agents perceive and interpret their surroundings by integrating multimodal inputs, such as vision, audio, and LiDAR. These perceptual modalities support retrieval tasks, such as place recognition in robotics. However, current multimodal retrieval systems encounter difficulties when parts of the data are missing due to sensor failures or inaccessibility, such as silent videos or LiDAR scans lacking RGB information. We propose Any2Any-a novel retrieval framework that addresses scenarios where both query and reference instances have incomplete modalities. Unlike previous methods limited to the imputation of two modalities, Any2Any handles any number of modalities without training generative models. It calculates pairwise similarities with cross-modal encoders and employs a two-stage calibration process with conformal prediction to align the similarities. Any2Any enables effective retrieval across multimodal datasets, e.g., text-LiDAR and text-time series. It achieves a Recall@5 of 35% on the KITTI dataset, which is on par with baseline models with complete modalities.
Shreya Meel, Pasan Dissanayake, Mohamed Nomeir, Sanghamitra Dutta, Sennur Ulukus
In a classification task, counterfactual explanations provide the minimum change needed for an input to be classified into a favorable class. We consider the problem of privately retrieving the exact closest counterfactual from a database of accepted samples while enforcing that certain features of the input sample cannot be changed, i.e., they are \emph{immutable}. An applicant (user) whose feature vector is rejected by a machine learning model wants to retrieve the sample closest to them in the database without altering a private subset of their features, which constitutes the immutable set. While doing this, the user should keep their feature vector, immutable set and the resulting counterfactual index information-theoretically private from the institution. We refer to this as immutable private counterfactual retrieval (I-PCR) problem which generalizes PCR to a more practical setting. In this paper, we propose two I-PCR schemes by leveraging techniques from private information retrieval (PIR) and characterize their communication costs. Further, we quantify the information that the user learns about the database and compare it for the proposed schemes.
Chi Liu, Jiangxia Cao, Rui Huang, Kai Zheng, Qiang Luo, Kun Gai, Guorui Zhou
In large-scale content recommendation systems, retrieval serves as the initial stage in the pipeline, responsible for selecting thousands of candidate items from billions of options to pass on to ranking modules. Traditionally, the dominant retrieval method has been Embedding-Based Retrieval (EBR) using a Deep Neural Network (DNN) dual-tower structure. However, applying transformer in retrieval tasks has been the focus of recent research, though real-world industrial deployment still presents significant challenges. In this paper, we introduce KuaiFormer, a novel transformer-based retrieval framework deployed in a large-scale content recommendation system. KuaiFormer fundamentally redefines the retrieval process by shifting from conventional score estimation tasks (such as click-through rate estimate) to a transformer-driven Next Action Prediction paradigm. This shift enables more effective real-time interest acquisition and multi-interest extraction, significantly enhancing retrieval performance. KuaiFormer has been successfully integrated into Kuaishou App's short-video recommendation system since May 2024, serving over 400 million daily active users and resulting in a marked increase in average daily usage time of Kuaishou users. We provide insights into both the technical and business aspects of deploying transformer in large-scale recommendation systems, addressing practical challenges encountered during industrial implementation. Our findings offer valuable guidance for engineers and researchers aiming to leverage transformer models to optimize large-scale content recommendation systems.
Alexandria Leto, Cecilia Aguerrebere, Ishwar Bhati, Ted Willke, Mariano Tepper, Vy Ai Vo
Retrieval-augmented generation (RAG) is a promising method for addressing
some of the memory-related challenges associated with Large Language Models
(LLMs). Two separate systems form the RAG pipeline, the retriever and the
reader, and the impact of each on downstream task performance is not
well-understood. Here, we work towards the goal of understanding how retrievers
can be optimized for RAG pipelines for common tasks such as Question Answering
(QA). We conduct experiments focused on the relationship between retrieval and
RAG performance on QA and attributed QA and unveil a number of insights useful
to practitioners developing high-performance RAG pipelines. For example,
lowering search accuracy has minor implications for RAG performance while
potentially increasing retrieval speed and memory efficiency.
Authors' comments: Accepted to NeurIPS 2024 Workshop ATTRIB
Tevin Wang, Jingyuan He, Chenyan Xiong
Retrieval-augmented generation (RAG) combines knowledge from domain-specific sources into large language models to ground answer generation. Current RAG systems lack customizable visibility on the context documents and the model's attentiveness towards such documents. We propose RAGViz, a RAG diagnosis tool that visualizes the attentiveness of the generated tokens in retrieved documents. With a built-in user interface, retrieval index, and Large Language Model (LLM) backbone, RAGViz provides two main functionalities: (1) token and document-level attention visualization, and (2) generation comparison upon context document addition and removal. As an open-source toolkit, RAGViz can be easily hosted with a custom embedding model and HuggingFace-supported LLM backbone. Using a hybrid ANN (Approximate Nearest Neighbor) index, memory-efficient LLM inference tool, and custom context snippet method, RAGViz operates efficiently with a median query time of about 5 seconds on a moderate GPU node. Our code is available at https://github.com/cxcscmu/RAGViz. A demo video of RAGViz can be found at https://youtu.be/cTAbuTu6ur4.
Neil Chowdhury, Franklin Wang, Sumedh Shenoy, Douwe Kiela, Sarah Schwettmann, Tristan Thrush
Multimodal models leverage large-scale pre-training to achieve strong but still imperfect performance on tasks such as image captioning, visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP, BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.
Yifan Du, Yong Meng Sua, Santosh Kumar, Jiuyi Zhang, Xiangzhi Li, Yongxiang Hu, Parminder Ghuman, Yuping Huang
We demonstrate a chip-integrated emission spectroscope capable of retrieving
the temperature of the light sources. It consists of a single photon detector
with low dark counts and a sweeping on-chip filter with 2 pm spectral
resolution in the visible and near-infrared regimes. With wildfire sensing
applications in mind, we test our system with a hollow cathode lamp to simulate
the K-line emission, and show how the models of Doppler and collision
broadening in the plasma can be used for temperature retrieval. With favorable
device parameters, high spectral resolution, and a novel temperature retrieval
capability, our technique may find broad applications in environmental
monitoring, astrophysics, plasma physics, and so on.
Authors' comments: 12 pages, 13 figures
Xinyu Zhao, Fangcong Yin, Greg Durrett
Long-context LLMs are increasingly in demand for applications such as
retrieval-augmented generation. To defray the cost of pretraining LLMs over
long contexts, recent work takes an approach of synthetic context extension:
fine-tuning LLMs with synthetically generated long-context data in a
post-training stage. However, it remains unclear how and why this synthetic
context extension imparts abilities for downstream long-context tasks. In this
paper, we investigate fine-tuning on synthetic data for three long-context
tasks that require retrieval and reasoning. We vary the realism of "needle"
concepts to be retrieved and diversity of the surrounding "haystack" context,
from using LLMs to construct synthetic documents to using templated relations
and creating symbolic datasets. We find that models trained on synthetic data
fall short of the real data, but surprisingly, the mismatch can be interpreted
and even predicted in terms of a special set of attention heads that are
responsible for retrieval over long context, retrieval heads (Wu et al., 2024).
The retrieval heads learned on synthetic data have high overlap with retrieval
heads learned on real data, and there is a strong correlation between the
recall of heads learned and the downstream performance of a model. Furthermore,
with attention knockout and activation patching, we mechanistically show that
retrieval heads are necessary and explain model performance, although they are
not totally sufficient. Our results shed light on how to interpret synthetic
data fine-tuning performance and how to approach creating better data for
learning real-world capabilities over long contexts.
Authors' comments: Published at ICML 2025
Chaeyun Jang, Hyungi Lee, Seanie Lee, Juho Lee
Recently, large language models (LLMs) have been increasingly used to support various decision-making tasks, assisting humans in making informed decisions. However, when LLMs confidently provide incorrect information, it can lead humans to make suboptimal decisions. To prevent LLMs from generating incorrect information on topics they are unsure of and to improve the accuracy of generated content, prior works have proposed Retrieval Augmented Generation (RAG), where external documents are referenced to generate responses. However, traditional RAG methods focus only on retrieving documents most relevant to the input query, without specifically aiming to ensure that the human user's decisions are well-calibrated. To address this limitation, we propose a novel retrieval method called Calibrated Retrieval-Augmented Generation (CalibRAG), which ensures that decisions informed by the retrieved documents are well-calibrated. Then we empirically validate that CalibRAG improves calibration performance as well as accuracy, compared to other baselines across various datasets.
Mandeep Rathee, Sean MacAvaney, Avishek Anand
Building relevance models to rank documents based on user information needs
is a central task in information retrieval and the NLP community. Beyond the
direct ad-hoc search setting, many knowledge-intense tasks are powered by a
first-stage retrieval stage for context selection, followed by a more involved
task-specific model. However, most first-stage ranking stages are inherently
limited by the recall of the initial ranking documents. Recently, adaptive
re-ranking techniques have been proposed to overcome this issue by continually
selecting documents from the whole corpus, rather than only considering an
initial pool of documents. However, so far these approaches have been limited
to heuristic design choices, particularly in terms of the criteria for document
selection. In this work, we propose a unifying view of the nascent area of
adaptive retrieval by proposing, Quam, a \textit{query-affinity model} that
exploits the relevance-aware document similarity graph to improve recall,
especially for low re-ranking budgets. Our extensive experimental evidence
shows that our proposed approach, Quam improves the recall performance by up to
26\% over the standard re-ranking baselines. Further, the query affinity
modelling and relevance-aware document graph modules can be injected into any
adaptive retrieval approach. The experimental results show the existing
adaptive retrieval approach improves recall by up to 12\%. The code of our work
is available at \url{https://github.com/Mandeep-Rathee/quam}.
Authors' comments: 15 pages, 10 figures
Dehai Min, Zhiyang Xu, Guilin Qi, Lifu Huang, Chenyu You
Existing information retrieval (IR) models often assume a homogeneous structure for knowledge sources and user queries, limiting their applicability in real-world settings where retrieval is inherently heterogeneous and diverse. In this paper, we introduce UniHGKR, a unified instruction-aware heterogeneous knowledge retriever that (1) builds a unified retrieval space for heterogeneous knowledge and (2) follows diverse user instructions to retrieve knowledge of specified types. UniHGKR consists of three principal stages: heterogeneous self-supervised pretraining, text-anchored embedding alignment, and instruction-aware retriever fine-tuning, enabling it to generalize across varied retrieval contexts. This framework is highly scalable, with a BERT-based version and a UniHGKR-7B version trained on large language models. Also, we introduce CompMix-IR, the first native heterogeneous knowledge retrieval benchmark. It includes two retrieval scenarios with various instructions, over 9,400 question-answer (QA) pairs, and a corpus of 10 million entries, covering four different types of data. Extensive experiments show that UniHGKR consistently outperforms state-of-the-art methods on CompMix-IR, achieving up to 6.36% and 54.23% relative improvements in two scenarios, respectively. Finally, by equipping our retriever for open-domain heterogeneous QA systems, we achieve a new state-of-the-art result on the popular ConvMix task, with an absolute improvement of up to 4.80 points.
Dong-han Yeom
In this article, we review the information loss paradox in the spirit of the
Euclidean path integral approach. First, we argue that there is a long debate
about the information loss paradox, and the non-perturbative quantum
gravitational wave function must include the clue to the paradox. The Euclidean
path integral approach provides the best way to describe the wave function.
From this wave function, we can notice that there are not only semi-classical
but also non-perturbative contributions, which are highly suppressed but
preserved information. Information retrieval will be sufficiently explained if
such non-perturbative contributions must be dominated by the late time. We will
show that there is sufficient evidence that this scenario can be realized in
generic circumstances. Finally, we compare this scenario with alternative
approaches. Also, we comment on some unresolved issues that need to be
clarified.
Authors' comments: 28 pages, 10 figures; Invited chapter for the edited book "The Black
Hole Information Paradox'' (Eds. Ali Akil and Cosimo Bambi, Springer
Singapore, expected in 2025)
Hadeel Saadany, Swapnil Bhosale, Samarth Agrawal, Diptesh Kanojia, Constantin Orasan, Zhe Wu
This paper addresses the challenge of improving user experience on e-commerce
platforms by enhancing product ranking relevant to users' search queries.
Ambiguity and complexity of user queries often lead to a mismatch between the
user's intent and retrieved product titles or documents. Recent approaches have
proposed the use of Transformer-based models, which need millions of annotated
query-title pairs during the pre-training stage, and this data often does not
take user intent into account. To tackle this, we curate samples from existing
datasets at eBay, manually annotated with buyer-centric relevance scores and
centrality scores, which reflect how well the product title matches the users'
intent. We introduce a User-intent Centrality Optimization (UCO) approach for
existing models, which optimises for the user intent in semantic product
search. To that end, we propose a dual-loss based optimisation to handle hard
negatives, i.e., product titles that are semantically relevant but do not
reflect the user's intent. Our contributions include curating challenging
evaluation sets and implementing UCO, resulting in significant product ranking
efficiency improvements observed for different evaluation metrics. Our work
aims to ensure that the most buyer-centric titles for a query are ranked
higher, thereby, enhancing the user experience on e-commerce platforms.
Authors' comments: EMNLP 2024: Industry track
Lu Dai, Hao Liu, Hui Xiong
Retrieval module can be plugged into many downstream NLP tasks to improve
their performance, such as open-domain question answering and
retrieval-augmented generation. The key to a retrieval system is to calculate
relevance scores to query and passage pairs. However, the definition of
relevance is often ambiguous. We observed that a major class of relevance
aligns with the concept of entailment in NLI tasks. Based on this observation,
we designed a method called entailment tuning to improve the embedding of dense
retrievers. Specifically, we unify the form of retrieval data and NLI data
using existence claim as a bridge. Then, we train retrievers to predict the
claims entailed in a passage with a variant task of masked prediction. Our
method can be efficiently plugged into current dense retrieval methods, and
experiments show the effectiveness of our method.
Authors' comments: EMNLP 2024 Main
Paul Youssef, Jörg Schlötterer, Christin Seifert
Pre-trained Language Models (PLMs) encode various facts about the world at their pre-training phase as they are trained to predict the next or missing word in a sentence. There has a been an interest in quantifying and improving the amount of facts that can be extracted from PLMs, as they have been envisioned to act as soft knowledge bases, which can be queried in natural language. Different approaches exist to enhance fact retrieval from PLM. Recent work shows that the hidden states of PLMs can be leveraged to determine the truthfulness of the PLMs' inputs. Leveraging this finding to improve factual knowledge retrieval remains unexplored. In this work, we investigate the use of a helper model to improve fact retrieval. The helper model assesses the truthfulness of an input based on the corresponding hidden states representations from the PLMs. We evaluate this approach on several masked PLMs and show that it enhances fact retrieval by up to 33\%. Our findings highlight the potential of hidden states representations from PLMs in improving their factual knowledge retrieval.
Wenjia Zhai
Traditional Retrieval-Augmented Generation (RAG) methods are limited by their reliance on a fixed number of retrieved documents, often resulting in incomplete or noisy information that undermines task performance. Although recent adaptive approaches alleviated these problems, their application in intricate and real-world multimodal tasks remains limited. To address these, we propose a new approach called Self-adaptive Multimodal Retrieval-Augmented Generation (SAM-RAG), tailored specifically for multimodal contexts. SAM-RAG not only dynamically filters relevant documents based on the input query, including image captions when needed, but also verifies the quality of both the retrieved documents and the output. Extensive experimental results show that SAM-RAG surpasses existing state-of-the-art methods in both retrieval accuracy and response generation. By further ablation experiments and effectiveness analysis, SAM-RAG maintains high recall quality while improving overall task performance in multimodal RAG task. Our codes are available at https://github.com/SAM-RAG/SAM_RAG.