Fabian Paischer, Liu Yang, Linfeng Liu, Shuai Shao, Kaveh Hassani, Jiacheng Li, Ricky Chen, Zhang Gabriel Li et al.
Sequential recommendation systems aim to provide personalized recommendations
for users based on their interaction history. To achieve this, they often
incorporate auxiliary information, such as textual descriptions of items and
auxiliary tasks, like predicting user preferences and intent. Despite numerous
efforts to enhance these models, they still suffer from limited
personalization. To address this issue, we propose a new paradigm, which we
term preference discerning. In preference dscerning, we explicitly condition a
generative sequential recommendation system on user preferences within its
context. To this end, we generate user preferences using Large Language Models
(LLMs) based on user reviews and item-specific data. To evaluate preference
discerning capabilities of sequential recommendation systems, we introduce a
novel benchmark that provides a holistic evaluation across various scenarios,
including preference steering and sentiment following. We assess current
state-of-the-art methods using our benchmark and show that they struggle to
accurately discern user preferences. Therefore, we propose a new method named
Mender ($\textbf{M}$ultimodal Prefer$\textbf{en}$ce
$\textbf{d}$iscern$\textbf{er}$), which improves upon existing methods and
achieves state-of-the-art performance on our benchmark. Our results show that
Mender can be effectively guided by human preferences even though they have not
been observed during training, paving the way toward more personalized
sequential recommendation systems. We will open-source the code and benchmarks
upon publication.
Authors' comments: 11 pages + references and appendix
Haiyang Peng, Deren Han, Linbin Li, Meng Huang
This paper aims to address the phase retrieval problem from subgaussian measurements with arbitrary noise, with a focus on devising robust and efficient algorithms for solving non-convex problems. To ensure uniqueness of solutions in the subgaussian setting, we explore two commonly used assumptions: either the subgaussian measurements satisfy a fourth-moment condition or the target signals exhibit non-peakiness. For each scenario, we introduce a novel spectral initialization method that yields robust initial estimates. Building on this, we employ leave-one-out arguments to show that the classical Wirtinger flow algorithm achieves a linear rate of convergence for both real-valued and complex-valued cases, provided the sampling complexity $m\ge O(n \log^3 m)$, where $n$ is the dimension of the underlying signals. In contrast to existing work, our algorithms are regularization-free, requiring no truncation, trimming, or additional penalty terms, and they permit the algorithm step sizes as large as $O(1)$, compared to the $O(1/n)$ in previous literature. Furthermore, our results accommodate arbitrary noise vectors that meet certain statistical conditions, covering a wide range of noise scenarios, with sub-exponential noise as a notable special case. The effectiveness of our algorithms is validated through various numerical experiments. We emphasize that our findings provide the first theoretical guarantees for recovering non-peaky signals using non-convex methods from Bernoulli measurements, which is of independent interest.
Nadia Sheikh, Anne-Laure Jousse, Daniel Buades Marcos, Akintunde Oladipo, Olivier Rousseau, Jimmy Lin
Given the dominance of dense retrievers that do not generalize well beyond their training dataset distributions, domain-specific test sets are essential in evaluating retrieval. There are few test datasets for retrieval systems intended for use by healthcare providers in a point-of-care setting. To fill this gap we have collaborated with medical professionals to create CURE, an ad-hoc retrieval test dataset for passage ranking with 2000 queries spanning 10 medical domains with a monolingual (English) and two cross-lingual (French/Spanish -> English) conditions. In this paper, we describe how CURE was constructed and provide baseline results to showcase its effectiveness as an evaluation tool. CURE is published with a Creative Commons Attribution Non Commercial 4.0 license and can be accessed on Hugging Face.
Jiaan Wang, Fandong Meng, Yingxue Zhang, Jie Zhou
Retrieval-augmented generation (RAG) introduces additional information to enhance large language models (LLMs). In machine translation (MT), previous work typically retrieves in-context examples from paired MT corpora, or domain-specific knowledge from knowledge graphs, to enhance models' MT ability. However, a large amount of world knowledge is organized in unstructured documents, and might not be fully paired across different languages. In this paper, we study retrieval-augmented MT using unstructured documents. Specifically, we build RAGtrans, the first benchmark to train and evaluate LLMs' retrieval-augmented MT ability. RAGtrans contains 79K MT samples collected via GPT-4o and human translators. Besides, documents from different languages are also provided to supply the knowledge to these samples. Based on RAGtrans, we further propose a multi-task training method to teach LLMs how to use information from multilingual documents during their translation. The method uses existing multilingual corpora to create auxiliary training objectives without additional labeling requirements. Extensive experiments show that the method improves LLMs by 1.58-3.09 BLEU and 1.00-2.03 COMET scores.
Puxuan Yu, Luke Merrick, Gaurav Nuti, Daniel Campos
This paper presents the training methodology of Arctic-Embed 2.0, a set of
open-source text embedding models built for accurate and efficient multilingual
retrieval. While prior works have suffered from degraded English retrieval
quality, Arctic-Embed 2.0 delivers competitive retrieval quality on
multilingual and English-only benchmarks, and supports Matryoshka
Representation Learning (MRL) for efficient embedding storage with
significantly lower compressed quality degradation compared to alternatives. We
detail the design and implementation, presenting several important open
research questions that arose during model development. We conduct experiments
exploring these research questions and include extensive discussion aimed at
fostering further discussion in this field.
Authors' comments: 10 pages, 5 figures, 3 tables
Joel Suro
Retrieval-Augmented Generation (RAG) architectures have recently garnered significant attention for their ability to improve truth grounding and coherence in natural language processing tasks. However, the reliability of RAG systems in producing accurate answers diminishes as the volume of data they access increases. Even with smaller datasets, these systems occasionally fail to address simple queries. This issue arises from their dependence on state-of-the-art large language models (LLMs), which can introduce uncertainty into the system's outputs. In this work, I propose a novel Comparative RAG system that introduces an evaluator module to bridge the gap between probabilistic RAG systems and deterministically verifiable responses. The evaluator compares external recommendations with the retrieved document chunks, adding a decision-making layer that enhances the system's reliability. This approach ensures that the chunks retrieved are both semantically relevant and logically consistent with deterministic insights, thereby improving the accuracy and overall efficiency of RAG systems. This framework paves the way for more reliable and scalable question-answering applications in domains requiring high precision and verifiability.
Hongji Yang, Yiru Li, Yingying Zhu
Information retrieval techniques have demonstrated exceptional capabilities in identifying semantic similarities across diverse domains through robust feature representations. However, their potential in guiding synthesis tasks, particularly cross-view image synthesis, remains underexplored. Cross-view image synthesis presents significant challenges in establishing reliable correspondences between drastically different viewpoints. To address this, we propose a novel retrieval-guided framework that reimagines how retrieval techniques can facilitate effective cross-view image synthesis. Unlike existing methods that rely on auxiliary information, such as semantic segmentation maps or preprocessing modules, our retrieval-guided framework captures semantic similarities across different viewpoints, trained through contrastive learning to create a smooth embedding space. Furthermore, a novel fusion mechanism leverages these embeddings to guide image synthesis while learning and encoding both view-invariant and view-specific features. To further advance this area, we introduce VIGOR-GEN, a new urban-focused dataset with complex viewpoint variations in real-world scenarios. Extensive experiments demonstrate that our retrieval-guided approach significantly outperforms existing methods on the CVUSA, CVACT and VIGOR-GEN datasets, particularly in retrieval accuracy (R@1) and synthesis quality (FID). Our work bridges information retrieval and synthesis tasks, offering insights into how retrieval techniques can address complex cross-domain synthesis challenges.
Batuhan Sariturk, Rabia Bayraktar, Merve Elmas Erdem
With the rise of online education platforms, there is a growing abundance of educational content across various domain. It can be difficult to navigate the numerous available resources to find the most suitable training, especially in domains that include many interconnected areas, such as ICT. In this study, we propose a domain-specific chatbot application that requires limited resources, utilizing versions of the Phi language model to help learners with educational content. In the proposed method, Phi-2 and Phi-3 models were fine-tuned using QLoRA. The data required for fine-tuning was obtained from the Huawei Talent Platform, where courses are available at different levels of expertise in the field of computer science. RAG system was used to support the model, which was fine-tuned by 500 Q&A pairs. Additionally, a total of 420 Q&A pairs of content were extracted from different formats such as JSON, PPT, and DOC to create a vector database to be used in the RAG system. By using the fine-tuned model and RAG approach together, chatbots with different competencies were obtained. The questions and answers asked to the generated chatbots were saved separately and evaluated using ROUGE, BERTScore, METEOR, and BLEU metrics. The precision value of the Phi-2 model supported by RAG was 0.84 and the F1 score was 0.82. In addition to a total of 13 different evaluation metrics in 4 different categories, the answers of each model were compared with the created content and the most appropriate method was selected for real-life applications.
Shuai Wang, Shengyao Zhuang, Bevan Koopman, Guido Zuccon
2D Matryoshka Training is an advanced embedding representation training approach designed to train an encoder model simultaneously across various layer-dimension setups. This method has demonstrated higher effectiveness in Semantic Text Similarity (STS) tasks over traditional training approaches when using sub-layers for embeddings. Despite its success, discrepancies exist between two published implementations, leading to varied comparative results with baseline models. In this reproducibility study, we implement and evaluate both versions of 2D Matryoshka Training on STS tasks and extend our analysis to retrieval tasks. Our findings indicate that while both versions achieve higher effectiveness than traditional Matryoshka training on sub-dimensions, and traditional full-sized model training approaches, they do not outperform models trained separately on specific sub-layer and sub-dimension setups. Moreover, these results generalize well to retrieval tasks, both in supervised (MSMARCO) and zero-shot (BEIR) settings. Further explorations of different loss computations reveals more suitable implementations for retrieval tasks, such as incorporating full-dimension loss and training on a broader range of target dimensions. Conversely, some intuitive approaches, such as fixing document encoders to full model outputs, do not yield improvements. Our reproduction code is available at https://github.com/ielab/2DMSE-Reproduce.
Mohammad Hassan Heydari, Arshia Hemmat, Erfan Naman, Afsaneh Fatemi
Retrieval Augmented Generation (RAG) has emerged as a widely adopted approach to mitigate the limitations of large language models (LLMs) in answering domain-specific questions. Previous research has predominantly focused on improving the accuracy and quality of retrieved data chunks to enhance the overall performance of the generation pipeline. However, despite ongoing advancements, the critical issue of retrieving irrelevant information -- which can impair the ability of the model to utilize its internal knowledge effectively -- has received minimal attention. In this work, we investigate the impact of retrieving irrelevant information in open-domain question answering, highlighting its significant detrimental effect on the quality of LLM outputs. To address this challenge, we propose the Context Awareness Gate (CAG) architecture, a novel mechanism that dynamically adjusts the LLMs' input prompt based on whether the user query necessitates external context retrieval. Additionally, we introduce the Vector Candidates method, a core mathematical component of CAG that is statistical, LLM-independent, and highly scalable. We further examine the distributions of relationships between contexts and questions, presenting a statistical analysis of these distributions. This analysis can be leveraged to enhance the context retrieval process in Retrieval Augmented Generation (RAG) systems.
Yunli Wang, Zixuan Yang, Zhen Zhang, Zhiqiang Wang, Jian Yang, Shiyang Wen, Peng Jiang, Kun Gai
The scaling law is a notable property of neural network models and has
significantly propelled the development of large language models. Scaling laws
hold great promise in guiding model design and resource allocation. Recent
research increasingly shows that scaling laws are not limited to NLP tasks or
Transformer architectures; they also apply to domains such as recommendation.
However, there is still a lack of literature on scaling law research in online
advertisement retrieval systems. This may be because 1) identifying the scaling
law for resource cost and online revenue is often expensive in both time and
training resources for large-scale industrial applications, and 2) varying
settings for different systems prevent the scaling law from being applied
across various scenarios. To address these issues, we propose a lightweight
paradigm to identify the scaling law of online revenue and machine cost for a
certain online advertisement retrieval scenario with a low experimental cost.
Specifically, we focus on a sole factor (FLOPs) and propose an offline metric
named R/R* that exhibits a high linear correlation with online revenue for
retrieval models. We estimate the machine cost offline via a simulation
algorithm. Thus, we can transform most online experiments into low-cost offline
experiments. We conduct comprehensive experiments to verify the effectiveness
of our proposed metric R/R* and to identify the scaling law in the online
advertisement retrieval system of Kuaishou. With the scaling law, we
demonstrate practical applications for ROI-constrained model designing and
multi-scenario resource allocation in Kuaishou advertising system. To the best
of our knowledge, this is the first work to study the scaling laws for online
advertisement retrieval of real-world systems, showing great potential for
scaling law in advertising system optimization.
Authors' comments: 10 pages, 8 figures
Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Saee Paliwal, Arash Vahdat, Weili Nie
Fragment-based drug discovery, in which molecular fragments are assembled
into new molecules with desirable biochemical properties, has achieved great
success. However, many fragment-based molecule generation methods show limited
exploration beyond the existing fragments in the database as they only
reassemble or slightly modify the given ones. To tackle this problem, we
propose a new fragment-based molecule generation framework with retrieval
augmentation, namely Fragment Retrieval-Augmented Generation (f-RAG). f-RAG is
based on a pre-trained molecular generative model that proposes additional
fragments from input fragments to complete and generate a new molecule. Given a
fragment vocabulary, f-RAG retrieves two types of fragments: (1) hard
fragments, which serve as building blocks that will be explicitly included in
the newly generated molecule, and (2) soft fragments, which serve as reference
to guide the generation of new fragments through a trainable fragment injection
module. To extrapolate beyond the existing fragments, f-RAG updates the
fragment vocabulary with generated fragments via an iterative refinement
process which is further enhanced with post-hoc genetic fragment modification.
f-RAG can achieve an improved exploration-exploitation trade-off by maintaining
a pool of fragments and expanding it with novel and high-quality fragments
through a strong generative prior.
Authors' comments: NeurIPS 2024
Po-han Li, Yunhao Yang, Mohammad Omama, Sandeep Chinchali, Ufuk Topcu
Autonomous agents perceive and interpret their surroundings by integrating multimodal inputs, such as vision, audio, and LiDAR. These perceptual modalities support retrieval tasks, such as place recognition in robotics. However, current multimodal retrieval systems encounter difficulties when parts of the data are missing due to sensor failures or inaccessibility, such as silent videos or LiDAR scans lacking RGB information. We propose Any2Any-a novel retrieval framework that addresses scenarios where both query and reference instances have incomplete modalities. Unlike previous methods limited to the imputation of two modalities, Any2Any handles any number of modalities without training generative models. It calculates pairwise similarities with cross-modal encoders and employs a two-stage calibration process with conformal prediction to align the similarities. Any2Any enables effective retrieval across multimodal datasets, e.g., text-LiDAR and text-time series. It achieves a Recall@5 of 35% on the KITTI dataset, which is on par with baseline models with complete modalities.
Shreya Meel, Pasan Dissanayake, Mohamed Nomeir, Sanghamitra Dutta, Sennur Ulukus
In a classification task, counterfactual explanations provide the minimum change needed for an input to be classified into a favorable class. We consider the problem of privately retrieving the exact closest counterfactual from a database of accepted samples while enforcing that certain features of the input sample cannot be changed, i.e., they are \emph{immutable}. An applicant (user) whose feature vector is rejected by a machine learning model wants to retrieve the sample closest to them in the database without altering a private subset of their features, which constitutes the immutable set. While doing this, the user should keep their feature vector, immutable set and the resulting counterfactual index information-theoretically private from the institution. We refer to this as immutable private counterfactual retrieval (I-PCR) problem which generalizes PCR to a more practical setting. In this paper, we propose two I-PCR schemes by leveraging techniques from private information retrieval (PIR) and characterize their communication costs. Further, we quantify the information that the user learns about the database and compare it for the proposed schemes.
Chi Liu, Jiangxia Cao, Rui Huang, Kai Zheng, Qiang Luo, Kun Gai, Guorui Zhou
In large-scale content recommendation systems, retrieval serves as the initial stage in the pipeline, responsible for selecting thousands of candidate items from billions of options to pass on to ranking modules. Traditionally, the dominant retrieval method has been Embedding-Based Retrieval (EBR) using a Deep Neural Network (DNN) dual-tower structure. However, applying transformer in retrieval tasks has been the focus of recent research, though real-world industrial deployment still presents significant challenges. In this paper, we introduce KuaiFormer, a novel transformer-based retrieval framework deployed in a large-scale content recommendation system. KuaiFormer fundamentally redefines the retrieval process by shifting from conventional score estimation tasks (such as click-through rate estimate) to a transformer-driven Next Action Prediction paradigm. This shift enables more effective real-time interest acquisition and multi-interest extraction, significantly enhancing retrieval performance. KuaiFormer has been successfully integrated into Kuaishou App's short-video recommendation system since May 2024, serving over 400 million daily active users and resulting in a marked increase in average daily usage time of Kuaishou users. We provide insights into both the technical and business aspects of deploying transformer in large-scale recommendation systems, addressing practical challenges encountered during industrial implementation. Our findings offer valuable guidance for engineers and researchers aiming to leverage transformer models to optimize large-scale content recommendation systems.
Alexandria Leto, Cecilia Aguerrebere, Ishwar Bhati, Ted Willke, Mariano Tepper, Vy Ai Vo
Retrieval-augmented generation (RAG) is a promising method for addressing
some of the memory-related challenges associated with Large Language Models
(LLMs). Two separate systems form the RAG pipeline, the retriever and the
reader, and the impact of each on downstream task performance is not
well-understood. Here, we work towards the goal of understanding how retrievers
can be optimized for RAG pipelines for common tasks such as Question Answering
(QA). We conduct experiments focused on the relationship between retrieval and
RAG performance on QA and attributed QA and unveil a number of insights useful
to practitioners developing high-performance RAG pipelines. For example,
lowering search accuracy has minor implications for RAG performance while
potentially increasing retrieval speed and memory efficiency.
Authors' comments: Accepted to NeurIPS 2024 Workshop ATTRIB
Tevin Wang, Jingyuan He, Chenyan Xiong
Retrieval-augmented generation (RAG) combines knowledge from domain-specific sources into large language models to ground answer generation. Current RAG systems lack customizable visibility on the context documents and the model's attentiveness towards such documents. We propose RAGViz, a RAG diagnosis tool that visualizes the attentiveness of the generated tokens in retrieved documents. With a built-in user interface, retrieval index, and Large Language Model (LLM) backbone, RAGViz provides two main functionalities: (1) token and document-level attention visualization, and (2) generation comparison upon context document addition and removal. As an open-source toolkit, RAGViz can be easily hosted with a custom embedding model and HuggingFace-supported LLM backbone. Using a hybrid ANN (Approximate Nearest Neighbor) index, memory-efficient LLM inference tool, and custom context snippet method, RAGViz operates efficiently with a median query time of about 5 seconds on a moderate GPU node. Our code is available at https://github.com/cxcscmu/RAGViz. A demo video of RAGViz can be found at https://youtu.be/cTAbuTu6ur4.
Neil Chowdhury, Franklin Wang, Sumedh Shenoy, Douwe Kiela, Sarah Schwettmann, Tristan Thrush
Multimodal models leverage large-scale pre-training to achieve strong but still imperfect performance on tasks such as image captioning, visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP, BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.
Yifan Du, Yong Meng Sua, Santosh Kumar, Jiuyi Zhang, Xiangzhi Li, Yongxiang Hu, Parminder Ghuman, Yuping Huang
We demonstrate a chip-integrated emission spectroscope capable of retrieving
the temperature of the light sources. It consists of a single photon detector
with low dark counts and a sweeping on-chip filter with 2 pm spectral
resolution in the visible and near-infrared regimes. With wildfire sensing
applications in mind, we test our system with a hollow cathode lamp to simulate
the K-line emission, and show how the models of Doppler and collision
broadening in the plasma can be used for temperature retrieval. With favorable
device parameters, high spectral resolution, and a novel temperature retrieval
capability, our technique may find broad applications in environmental
monitoring, astrophysics, plasma physics, and so on.
Authors' comments: 12 pages, 13 figures
Xinyu Zhao, Fangcong Yin, Greg Durrett
Long-context LLMs are increasingly in demand for applications such as
retrieval-augmented generation. To defray the cost of pretraining LLMs over
long contexts, recent work takes an approach of synthetic context extension:
fine-tuning LLMs with synthetically generated long-context data in a
post-training stage. However, it remains unclear how and why this synthetic
context extension imparts abilities for downstream long-context tasks. In this
paper, we investigate fine-tuning on synthetic data for three long-context
tasks that require retrieval and reasoning. We vary the realism of "needle"
concepts to be retrieved and diversity of the surrounding "haystack" context,
from using LLMs to construct synthetic documents to using templated relations
and creating symbolic datasets. We find that models trained on synthetic data
fall short of the real data, but surprisingly, the mismatch can be interpreted
and even predicted in terms of a special set of attention heads that are
responsible for retrieval over long context, retrieval heads (Wu et al., 2024).
The retrieval heads learned on synthetic data have high overlap with retrieval
heads learned on real data, and there is a strong correlation between the
recall of heads learned and the downstream performance of a model. Furthermore,
with attention knockout and activation patching, we mechanistically show that
retrieval heads are necessary and explain model performance, although they are
not totally sufficient. Our results shed light on how to interpret synthetic
data fine-tuning performance and how to approach creating better data for
learning real-world capabilities over long contexts.
Authors' comments: Published at ICML 2025