Chen Xu, Jun Xu, Yiming Ding, Xiao Zhang, Qi Qi
In pursuit of fairness and balanced development, recommender systems (RS)
often prioritize group fairness, ensuring that specific groups maintain a
minimum level of exposure over a given period. For example, RS platforms aim to
ensure adequate exposure for new providers or specific categories of items
according to their needs. Modern industry RS usually adopts a two-stage
pipeline: stage-1 (retrieval stage) retrieves hundreds of candidates from
millions of items distributed across various servers, and stage-2 (ranking
stage) focuses on presenting a small-size but accurate selection from items
chosen in stage-1. Existing efforts for ensuring amortized group exposures
focus on stage-2, however, stage-1 is also critical for the task. Without a
high-quality set of candidates, the stage-2 ranker cannot ensure the required
exposure of groups. Previous fairness-aware works designed for stage-2
typically require accessing and traversing all items. In stage-1, however,
millions of items are distributively stored in servers, making it infeasible to
traverse all of them. How to ensure group exposures in the distributed
retrieval process is a challenging question. To address this issue, we
introduce a model named FairSync, which transforms the problem into a
constrained distributed optimization problem. Specifically, FairSync resolves
the issue by moving it to the dual space, where a central node aggregates
historical fairness data into a vector and distributes it to all servers. To
trade off the efficiency and accuracy, the gradient descent technique is used
to periodically update the parameter of the dual vector. The experiment results
on two public recommender retrieval datasets showcased that FairSync
outperformed all the baselines, achieving the desired minimum level of
exposures while maintaining a high level of retrieval accuracy.
Authors' comments: Accepted in WWW'24
Hongjin Qian, Zheng Liu, Kelong Mao, Yujia Zhou, Zhicheng Dou
This paper presents a novel Chunking-Free In-Context (CFIC) retrieval approach, specifically tailored for Retrieval-Augmented Generation (RAG) systems. Traditional RAG systems often struggle with grounding responses using precise evidence text due to the challenges of processing lengthy documents and filtering out irrelevant content. Commonly employed solutions, such as document chunking and adapting language models to handle longer contexts, have their limitations. These methods either disrupt the semantic coherence of the text or fail to effectively address the issues of noise and inaccuracy in evidence retrieval. CFIC addresses these challenges by circumventing the conventional chunking process. It utilizes the encoded hidden states of documents for in-context retrieval, employing auto-aggressive decoding to accurately identify the specific evidence text required for user queries, eliminating the need for chunking. CFIC is further enhanced by incorporating two decoding strategies, namely Constrained Sentence Prefix Decoding and Skip Decoding. These strategies not only improve the efficiency of the retrieval process but also ensure that the fidelity of the generated grounding text evidence is maintained. Our evaluations of CFIC on a range of open QA datasets demonstrate its superiority in retrieving relevant and accurate evidence, offering a significant improvement over traditional methods. By doing away with the need for document chunking, CFIC presents a more streamlined, effective, and efficient retrieval solution, making it a valuable advancement in the field of RAG systems.
Yannis Kalantidis, Mert Blent Saryldz, Rafael S. Rezende, Philippe Weinzaepfel, Diane Larlus, Gabriela Csurka
State-of-the-art visual localization approaches generally rely on a first
image retrieval step whose role is crucial. Yet, retrieval often struggles when
facing varying conditions, due to e.g. weather or time of day, with dramatic
consequences on the visual localization accuracy. In this paper, we improve
this retrieval step and tailor it to the final localization task. Among the
several changes we advocate for, we propose to synthesize variants of the
training set images, obtained from generative text-to-image models, in order to
automatically expand the training set towards a number of nameable variations
that particularly hurt visual localization. After expanding the training set,
we propose a training approach that leverages the specificities and the
underlying geometry of this mix of real and synthetic images. We experimentally
show that those changes translate into large improvements for the most
challenging visual localization datasets. Project page:
https://europe.naverlabs.com/ret4loc
Authors' comments: Accepted at ICLR 2024. Project Page:
https://europe.naverlabs.com/ret4loc
Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, Yang Liu
Large Language Models~(LLMs) have gained immense popularity and are being
increasingly applied in various domains. Consequently, ensuring the security of
these models is of paramount importance. Jailbreak attacks, which manipulate
LLMs to generate malicious content, are recognized as a significant
vulnerability. While existing research has predominantly focused on direct
jailbreak attacks on LLMs, there has been limited exploration of indirect
methods. The integration of various plugins into LLMs, notably Retrieval
Augmented Generation~(RAG), which enables LLMs to incorporate external
knowledge bases into their response generation such as GPTs, introduces new
avenues for indirect jailbreak attacks.
To fill this gap, we investigate indirect jailbreak attacks on LLMs,
particularly GPTs, introducing a novel attack vector named Retrieval Augmented
Generation Poisoning. This method, Pandora, exploits the synergy between LLMs
and RAG through prompt manipulation to generate unexpected responses. Pandora
uses maliciously crafted content to influence the RAG process, effectively
initiating jailbreak attacks. Our preliminary tests show that Pandora
successfully conducts jailbreak attacks in four different scenarios, achieving
higher success rates than direct attacks, with 64.3\% for GPT-3.5 and 34.8\%
for GPT-4.
Authors' comments: 6 pages
Weizhe Lin, Jingbiao Mei, Jinghong Chen, Bill Byrne
Large Multimodal Models (LMMs) excel in natural language and visual
understanding but are challenged by exacting tasks such as Knowledge-based
Visual Question Answering (KB-VQA) which involve the retrieval of relevant
information from document collections to use in shaping answers to questions.
We present an extensive training and evaluation framework, M2KR, for KB-VQA.
M2KR contains a collection of vision and language tasks which we have
incorporated into a single suite of benchmark tasks for training and evaluating
general-purpose multi-modal retrievers. We use M2KR to develop PreFLMR, a
pre-trained version of the recently developed Fine-grained Late-interaction
Multi-modal Retriever (FLMR) approach to KB-VQA, and we report new
state-of-the-art results across a range of tasks. We also present
investigations into the scaling behaviors of PreFLMR intended to be useful in
future developments in general-purpose multi-modal retrievers.
Authors' comments: 8 pages
Thomas Pouplin, Hao Sun, Samuel Holt, Mihaela van der Schaar
Large Language Models (LLMs) have demonstrated the strong potential to assist
both clinicians and the general public with their extensive medical knowledge.
However, their application in healthcare is constrained due to concerns about
the privacy of data used in training, which prevents the integration of private
and personal information because of security and ethical issues. Moreover, if
their capabilities can be enhanced with information retrieval to access
up-to-date knowledge, the current integration of LLMs with Information
retrieval lacks robustness to imperfect retrieval, which can hinder their
effectiveness and even reduce overall performance. In this work, we address
this challenge by introducing the Retrieval-Augmented Thought Process (RATP).
Given access to external knowledge, RATP formulates the thought generation of
LLMs as a multiple-step decision process. To optimise such a thought process,
RATP leverages Monte-Carlo Tree Search and learns a proxy reward function that
permits cost-efficient inference. On a private dataset of electronic medical
records, deliberately excluded from any LLM training set, RATP achieves 35%
additional accuracy compared to in-context retrieval-augmented generation for
the question-answering task.
Authors' comments: 17 pages, 18 figures
Haonan Chen, Zhicheng Dou, Kelong Mao, Jiongnan Liu, Ziliang Zhao
Conversational search utilizes muli-turn natural language contexts to retrieve relevant passages. Existing conversational dense retrieval models mostly view a conversation as a fixed sequence of questions and responses, overlooking the severe data sparsity problem -- that is, users can perform a conversation in various ways, and these alternate conversations are unrecorded. Consequently, they often struggle to generalize to diverse conversations in real-world scenarios. In this work, we propose a framework for generalizing Conversational dense retrieval via LLM-cognition data Augmentation (ConvAug). ConvAug first generates multi-level augmented conversations to capture the diverse nature of conversational contexts. Inspired by human cognition, we devise a cognition-aware process to mitigate the generation of false positives, false negatives, and hallucinations. Moreover, we develop a difficulty-adaptive sample filter that selects challenging samples for complex conversations, thereby giving the model a larger learning space. A contrastive learning objective is then employed to train a better conversational context encoder. Extensive experiments conducted on four public datasets, under both normal and zero-shot settings, demonstrate the effectiveness, generalizability, and applicability of ConvAug.
Zhibo Hu, Chen Wang, Yanfeng Shu, Helen, Paik, Liming Zhu
The robustness of large language models (LLMs) becomes increasingly important
as their use rapidly grows in a wide range of domains. Retrieval-Augmented
Generation (RAG) is considered as a means to improve the trustworthiness of
text generation from LLMs. However, how the outputs from RAG-based LLMs are
affected by slightly different inputs is not well studied. In this work, we
find that the insertion of even a short prefix to the prompt leads to the
generation of outputs far away from factually correct answers. We
systematically evaluate the effect of such prefixes on RAG by introducing a
novel optimization technique called Gradient Guided Prompt Perturbation (GGPP).
GGPP achieves a high success rate in steering outputs of RAG-based LLMs to
targeted wrong answers. It can also cope with instructions in the prompts
requesting to ignore irrelevant context. We also exploit LLMs' neuron
activation difference between prompts with and without GGPP perturbations to
give a method that improves the robustness of RAG-based LLMs through a highly
effective detector trained on neuron activation triggered by GGPP generated
prompts. Our evaluation on open-sourced LLMs demonstrates the effectiveness of
our methods.
Authors' comments: 12 pages, 9 figures
João Daniel Silva, João Magalhães, Devis Tuia, Bruno Martins
Image captioning and cross-modal retrieval are examples of tasks that involve the joint analysis of visual and linguistic information. In connection to remote sensing imagery, these tasks can help non-expert users in extracting relevant Earth observation information for a variety of applications. Still, despite some previous efforts, the development and application of vision and language models to the remote sensing domain have been hindered by the relatively small size of the available datasets and models used in previous studies. In this work, we propose RS-CapRet, a Vision and Language method for remote sensing tasks, in particular image captioning and text-image retrieval. We specifically propose to use a highly capable large decoder language model together with image encoders adapted to remote sensing imagery through contrastive language-image pre-training. To bridge together the image encoder and language decoder, we propose training simple linear layers with examples from combining different remote sensing image captioning datasets, keeping the other parameters frozen. RS-CapRet can then generate descriptions for remote sensing images and retrieve images from textual descriptions, achieving SOTA or competitive performance with existing methods. Qualitative results illustrate that RS-CapRet can effectively leverage the pre-trained large language model to describe remote sensing images, retrieve them based on different types of queries, and also show the ability to process interleaved sequences of images and text in a dialogue manner.
Riccardo Cappuzzo, Aimee Coelho, Felix Lefebvre, Paolo Papotti, Gael Varoquaux
Machine-learning from a disparate set of tables, a data lake, requires
assembling features by merging and aggregating tables. Data discovery can
extend autoML to data tables by automating these steps. We present an in-depth
analysis of such automated table augmentation for machine learning tasks,
analyzing different methods for the three main steps: retrieving joinable
tables, merging information, and predicting with the resultant table. We use
two data lakes: Open Data US, a well-referenced real data lake, and a novel
semi-synthetic dataset, YADL (Yet Another Data Lake), which we developed as a
tool for benchmarking this data discovery task. Systematic exploration on both
lakes outlines 1) the importance of accurately retrieving join candidates, 2)
the efficiency of simple merging methods, and 3) the resilience of tree-based
learners to noisy conditions. Our experimental environment is easily
reproducible and based on open data, to foster more research on feature
engineering, autoML, and learning in data lakes.
Authors' comments: 12 pages + references, 6 figures in main body. 15 pages + 11 figures
in appendix
Austin Xu, Will Monroe, Klinton Bicknell
We study the problem of zero-shot exercise retrieval in the context of online
language learning, to give learners the ability to explicitly request
personalized exercises via natural language. Using real-world data collected
from language learners, we observe that vector similarity approaches poorly
capture the relationship between exercise content and the language that
learners use to express what they want to learn. This semantic gap between
queries and content dramatically reduces the effectiveness of general-purpose
retrieval models pretrained on large scale information retrieval datasets like
MS MARCO. We leverage the generative capabilities of large language models to
bridge the gap by synthesizing hypothetical exercises based on the learner's
input, which are then used to search for relevant exercises. Our approach,
which we call mHyER, overcomes three challenges: (1) lack of relevance labels
for training, (2) unrestricted learner input content, and (3) low semantic
similarity between input and retrieval candidates. mHyER outperforms several
strong baselines on two novel benchmarks created from crowdsourced data and
publicly available data.
Authors' comments: Presented at Learning Analytics and Knowledge 2024. 11 pages, 4
figures, 5 tables
Dipankar Sarkar
Information retrieval is a rapidly evolving field of information retrieval, which is characterized by a continuous refinement of techniques and technologies, from basic hyperlink-based navigation to sophisticated algorithm-driven search engines. This paper aims to provide a comprehensive overview of the evolution of Information Retrieval Technology, with a particular focus on the role of Large Language Models (LLMs) in bridging the gap between traditional search methods and the emerging paradigm of answer retrieval. The integration of LLMs in the realms of response retrieval and indexing signifies a paradigm shift in how users interact with information systems. This paradigm shift is driven by the integration of large language models (LLMs) like GPT-4, which are capable of understanding and generating human-like text, thus enabling them to provide more direct and contextually relevant answers to user queries. Through this exploration, we seek to illuminate the technological milestones that have shaped this journey and the potential future directions in this rapidly changing field.
Julien Pierre Edmond Ghali, Kosuke Shima, Koichi Moriyama, Atsuko Mutoh, Nobuhiro Inuzuka
In the rapidly changing world of smart technology, searching for documents
has become more challenging due to the rise of advanced language models. These
models sometimes face difficulties, like providing inaccurate information,
commonly known as "hallucination." This research focuses on addressing this
issue through Retrieval-Augmented Generation (RAG), a technique that guides
models to give accurate responses based on real facts. To overcome scalability
issues, the study explores connecting user queries with sophisticated language
models such as BERT and Orca2, using an innovative query optimization process.
The study unfolds in three scenarios: first, without RAG, second, without
additional assistance, and finally, with extra help. Choosing the compact yet
efficient Orca2 7B model demonstrates a smart use of computing resources. The
empirical results indicate a significant improvement in the initial language
model's performance under RAG, particularly when assisted with prompts
augmenters. Consistency in document retrieval across different encodings
highlights the effectiveness of using language model-generated queries. The
introduction of UMAP for BERT further simplifies document retrieval while
maintaining strong results.
Authors' comments: 28 pages, 10 annexes, 2 figures
Harshit Mehrotra, Jamie Callan, Zhen Fan
The ClueWeb22 dataset containing nearly 10 billion documents was released in 2022 to support academic and industry research. The goal of this project was to build retrieval baselines for the English section of the "super head" part (category B) of this dataset. These baselines can then be used by the research community to compare their systems and also to generate data to train/evaluate new retrieval and ranking algorithms. The report covers sparse and dense first stage retrievals as well as neural rerankers that were implemented for this dataset. These systems are available as a service on a Carnegie Mellon University cluster.
Christopher Liao, Christian So, Theodoros Tsiligkaridis, Brian Kulis
Domain generalization (DG) is an important problem that learns a model which generalizes to unseen test domains leveraging one or more source domains, under the assumption of shared label spaces. However, most DG methods assume access to abundant source data in the target label space, a requirement that proves overly stringent for numerous real-world applications, where acquiring the same label space as the target task is prohibitively expensive. For this setting, we tackle the multimodal version of the unsupervised domain generalization (MUDG) problem, which uses a large task-agnostic unlabeled source dataset during finetuning. Our framework does not explicitly assume any relationship between the source dataset and target task. Instead, it relies only on the premise that the source dataset can be accurately and efficiently searched in a joint vision-language space. We make three contributions in the MUDG setting. Firstly, we show theoretically that cross-modal approximate nearest neighbor search suffers from low recall due to the large distance between text queries and the image centroids used for coarse quantization. Accordingly, we propose paired k-means, a simple clustering algorithm that improves nearest neighbor recall by storing centroids in query space instead of image space. Secondly, we propose an adaptive text augmentation scheme for target labels designed to improve zero-shot accuracy and diversify retrieved image data. Lastly, we present two simple but effective components to further improve downstream target accuracy. We compare against state-of-the-art name-only transfer, source-free DG and zero-shot (ZS) methods on their respective benchmarks and show consistent improvement in accuracy on 20 diverse datasets. Code is available: https://github.com/Chris210634/mudg
Junyoung Seo, Susung Hong, Wooseok Jang, Ins Hyeonsu Kim, Minseop Kwak, Doyup Lee, Seungryong Kim
Text-to-3D generation has achieved significant success by incorporating
powerful 2D diffusion models, but insufficient 3D prior knowledge also leads to
the inconsistency of 3D geometry. Recently, since large-scale multi-view
datasets have been released, fine-tuning the diffusion model on the multi-view
datasets becomes a mainstream to solve the 3D inconsistency problem. However,
it has confronted with fundamental difficulties regarding the limited quality
and diversity of 3D data, compared with 2D data. To sidestep these trade-offs,
we explore a retrieval-augmented approach tailored for score distillation,
dubbed RetDream. We postulate that both expressiveness of 2D diffusion models
and geometric consistency of 3D assets can be fully leveraged by employing the
semantically relevant assets directly within the optimization process. To this
end, we introduce novel framework for retrieval-based quality enhancement in
text-to-3D generation. We leverage the retrieved asset to incorporate its
geometric prior in the variational objective and adapt the diffusion model's 2D
prior toward view consistency, achieving drastic improvements in both geometry
and fidelity of generated scenes. We conduct extensive experiments to
demonstrate that RetDream exhibits superior quality with increased geometric
consistency. Project page is available at https://ku-cvlab.github.io/RetDream/.
Authors' comments: Project Page: https://ku-cvlab.github.io/RetDream/
Antonio Jimeno Yepes, Yao You, Jan Milczek, Sebastian Laverde, Renyu Li
Chunking information is a key step in Retrieval Augmented Generation (RAG). Current research primarily centers on paragraph-level chunking. This approach treats all texts as equal and neglects the information contained in the structure of documents. We propose an expanded approach to chunk documents by moving beyond mere paragraph-level chunking to chunk primary by structural element components of documents. Dissecting documents into these constituent elements creates a new way to chunk documents that yields the best chunk size without tuning. We introduce a novel framework that evaluates how chunking based on element types annotated by document understanding models contributes to the overall context and accuracy of the information retrieved. We also demonstrate how this approach impacts RAG assisted Question & Answer task performance. Our research includes a comprehensive analysis of various element types, their role in effective information retrieval, and the impact they have on the quality of RAG outputs. Findings support that element type based chunking largely improve RAG results on financial reporting. Through this research, we are also able to answer how to uncover highly accurate RAG.
Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li
Despite the impressive capabilities of large language models (LLMs) across diverse applications, they still suffer from trustworthiness issues, such as hallucinations and misalignments. Retrieval-augmented language models (RAG) have been proposed to enhance the credibility of generations by grounding external knowledge, but the theoretical understandings of their generation risks remains unexplored. In this paper, we answer: 1) whether RAG can indeed lead to low generation risks, 2) how to provide provable guarantees on the generation risks of RAG and vanilla LLMs, and 3) what sufficient conditions enable RAG models to reduce generation risks. We propose C-RAG, the first framework to certify generation risks for RAG models. Specifically, we provide conformal risk analysis for RAG models and certify an upper confidence bound of generation risks, which we refer to as conformal generation risk. We also provide theoretical guarantees on conformal generation risks for general bounded risk functions under test distribution shifts. We prove that RAG achieves a lower conformal generation risk than that of a single LLM when the quality of the retrieval model and transformer is non-trivial. Our intensive empirical results demonstrate the soundness and tightness of our conformal generation risk guarantees across four widely-used NLP datasets on four state-of-the-art retrieval models.
Zifei, Han, Jionghao Lin, Ashish Gurung, Danielle R. Thomas, Eason Chen, Conrad Borchers, Shivang Gupta et al.
One-on-one tutoring is an effective instructional method for enhancing
learning, yet its efficacy hinges on tutor competencies. Novice math tutors
often prioritize content-specific guidance, neglecting aspects such as
social-emotional learning. Social-emotional learning promotes equity and
inclusion and nurturing relationships with students, which is crucial for
holistic student development. Assessing the competencies of tutors accurately
and efficiently can drive the development of tailored tutor training programs.
However, evaluating novice tutor ability during real-time tutoring remains
challenging as it typically requires experts-in-the-loop. To address this
challenge, this preliminary study aims to harness Generative Pre-trained
Transformers (GPT), such as GPT-3.5 and GPT-4 models, to automatically assess
tutors' ability of using social-emotional tutoring strategies. Moreover, this
study also reports on the financial dimensions and considerations of employing
these models in real-time and at scale for automated assessment. The current
study examined four prompting strategies: two basic Zero-shot prompt
strategies, Tree of Thought prompt, and Retrieval-Augmented Generator (RAG)
based prompt. The results indicate that the RAG prompt demonstrated more
accurate performance (assessed by the level of hallucination and correctness in
the generated assessment texts) and lower financial costs than the other
strategies evaluated. These findings inform the development of personalized
tutor training interventions to enhance the the educational effectiveness of
tutored learning.
Authors' comments: 11 page Workshop paper, AAAI2024 Workshop on AI for Education -
Bridging Innovation and Responsibility, Large Language Model, Personalized
Tutor Training, Automatic Assessment
Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey
We recently developed SLM, a joint speech and language model, which fuses a pretrained foundational speech model and a large language model (LLM), while preserving the in-context learning capability intrinsic to the pretrained LLM. In this paper, we apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal. Task-oriented dialogs often contain domain-specific entities, i.e., restaurants, hotels, train stations, and city names, which are difficult to recognize, however, critical for the downstream applications. Inspired by the RAG (retrieval-augmented generation) paradigm, we propose a retrieval augmented SLM (ReSLM) that overcomes this weakness. We first train a speech retriever to retrieve text entities mentioned in the audio. The retrieved entities are then added as text inputs to the underlying SLM to bias model predictions. We evaluated ReSLM on speech MultiWoz task (DSTC-11 challenge), and found that this retrieval augmentation boosts model performance, achieving joint goal accuracy (38.6% vs 32.7%), slot error rate (20.6% vs 24.8%) and ASR word error rate (5.5% vs 6.7%). While demonstrated on dialog state tracking, our approach is broadly applicable to other speech tasks requiring contextual information or domain-specific entities, such as contextual ASR with biasing capability.