Bjrn Engelmann, Timo Breuer, Philipp Schaer
Considering the multimodal signals of search items is beneficial for
retrieval effectiveness. Especially in web table retrieval (WTR) experiments,
accounting for multimodal properties of tables boosts effectiveness. However,
it still remains an open question how the single modalities affect user
experience in particular. Previous work analyzed WTR performance in ad-hoc
retrieval benchmarks, which neglects interactive search behavior and limits the
conclusion about the implications for real-world user environments.
To this end, this work presents an in-depth evaluation of simulated
interactive WTR search sessions as a more cost-efficient and reproducible
alternative to real user studies. As a first of its kind, we introduce
interactive query reformulation strategies based on Doc2Query, incorporating
cognitive states of simulated user knowledge. Our evaluations include two
perspectives on user effectiveness by considering different cost paradigms,
namely query-wise and time-oriented measures of effort. Our multi-perspective
evaluation scheme reveals new insights about query strategies, the impact of
modalities, and different user types in simulated WTR search sessions.
Authors' comments: 4 pages + references; accepted at CIKM'23
Philippe Jaming, Martin Rathmair
We consider the problem of reconstructing a function $f\in L^2(\mathbb{R})$ given phase-less samples of its Gabor transform, which is defined by $$\mathcal{G} f(x,\omega) := 2^{\frac14} \int_{\mathbb{R}} f(t) e^{-\pi (t-x)^2} e^{-2\pi i y t}\,\mbox{d}t,\quad (x,y)\in\mathbb{R}^2.$$More precisely, given sampling positions $\Omega\subseteq \mathbb{R}^2$ the task is to reconstruct $f$ (up to global phase) from measurements $\{|\mathcal{G} f(\omega)|: \,\omega\in\Omega\}$. This non-linear inverse problem is known to suffer from severe ill-posedness. As for any other phase retrieval problem, constructive recovery is a notoriously delicate affair due to the lack of convexity. One of the fundamental insights in this line of research is that the connectivity of the measurements is both necessary and sufficient for reconstruction of phase information to be theoretically possible. In this article we propose a reconstruction algorithm which is based on solving two convex problems and, as such, amenable to numerical analysis. We show, empirically as well as analytically, that the scheme accurately reconstructs from noisy data within the connected regime.Moreover, to emphasize the practicability of the algorithm we argue that both convex problems can actually be reformulated as semi-definite programs for which efficient solvers are readily available. The approach is based on ideas from complex analysis, Gabor frame theory as well as matrix completion.
Carlos Dominguez, Jon Ander Campos, Eneko Agirre, Gorka Azkune
Neural information retrieval requires costly annotated data for each target domain to be competitive. Synthetic annotation by query generation using Large Language Models or rule-based string manipulation has been proposed as an alternative, but their relative merits have not been analysed. In this paper, we compare both methods head-to-head using the same neural IR architecture. We focus on the BEIR benchmark, which includes test datasets from several domains with no training data, and explore two scenarios: zero-shot, where the supervised system is trained in a large out-of-domain dataset (MS-MARCO); and unsupervised domain adaptation, where, in addition to MS-MARCO, the system is fine-tuned in synthetic data from the target domain. Our results indicate that Large Language Models outperform rule-based methods in all scenarios by a large margin, and, more importantly, that unsupervised domain adaptation is effective compared to applying a supervised IR system in a zero-shot fashion. In addition we explore several sizes of open Large Language Models to generate synthetic data and find that a medium-sized model suffices. Code and models are publicly available for reproducibility.
Jinsung Yoon, Sercan O Arik, Yanfei Chen, Tomas Pfister
Embeddings extracted by pre-trained Large Language Models (LLMs) have
significant potential to improve information retrieval and search. Beyond the
zero-shot setup in which they are being conventionally used, being able to take
advantage of the information from the relevant query-corpus paired data can
further boost the LLM capabilities. In this paper, we propose a novel method,
Search-Adaptor, for customizing LLMs for information retrieval in an efficient
and robust way. Search-Adaptor modifies the embeddings generated by pre-trained
LLMs, and can be integrated with any LLM, including those only available via
prediction APIs. On multiple English, multilingual, and multimodal retrieval
datasets, we show consistent and significant performance benefits for
Search-Adaptor -- e.g., more than 5% improvements for Google Embedding APIs in
nDCG@10 averaged over 14 BEIR datasets.
Authors' comments: Published in 2024 ACL Main Conference
Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, Jian-Yun Nie
Large language models (LLMs) face significant challenges stemming from the inherent limitations in knowledge, memory, alignment, and action. These challenges cannot be addressed by LLMs alone, but should rely on assistance from the external world, such as knowledge base, memory store, demonstration examples, and tools. Retrieval augmentation stands as a vital mechanism for bridging the gap between LLMs and the external assistance. However, conventional methods encounter two pressing issues. On one hand, the general-purpose retrievers are not properly optimized for the retrieval augmentation of LLMs. On the other hand, the task-specific retrievers lack the required versatility, hindering their performance across the diverse retrieval augmentation scenarios. In this work, we present a novel approach, the LLM Embedder, which comprehensively support the diverse needs of LLMs' retrieval augmentation with one unified embedding model. Training such an unified model is non-trivial, as various retrieval tasks aim to capture distinct semantic relationships, often subject to mutual interference. To address this challenge, we systematically optimize our training methodology. This includes reward formulation based on LLMs' feedback, the stabilization of knowledge distillation, multi-task fine-tuning with explicit instructions, and the use of homogeneous in-batch negative sampling. These optimization strategies contribute to the outstanding empirical performance of the LLM-Embedder. Notably, it yields remarkable enhancements in retrieval augmentation for LLMs, surpassing both general-purpose and task-specific retrievers in various evaluation scenarios. This project is made publicly available at https://github.com/FlagOpen/FlagEmbedding.
Mingcheng Chen, Haoran Zhao, Yuxiang Zhao, Hulei Fan, Hongqiao Gao, Yong Yu, Zheng Tian
Data-driven black-box model-based optimization (MBO) problems arise in a
great number of practical application scenarios, where the goal is to find a
design over the whole space maximizing a black-box target function based on a
static offline dataset. In this work, we consider a more general but
challenging MBO setting, named constrained MBO (CoMBO), where only part of the
design space can be optimized while the rest is constrained by the environment.
A new challenge arising from CoMBO is that most observed designs that satisfy
the constraints are mediocre in evaluation. Therefore, we focus on optimizing
these mediocre designs in the offline dataset while maintaining the given
constraints rather than further boosting the best observed design in the
traditional MBO setting. We propose retrieval-enhanced offline model-based
optimization (ROMO), a new derivable forward approach that retrieves the
offline dataset and aggregates relevant samples to provide a trusted
prediction, and use it for gradient-based optimization. ROMO is simple to
implement and outperforms state-of-the-art approaches in the CoMBO setting.
Empirically, we conduct experiments on a synthetic Hartmann (3D) function
dataset, an industrial CIO dataset, and a suite of modified tasks in the
Design-Bench benchmark. Results show that ROMO performs well in a wide range of
constrained optimization tasks.
Authors' comments: 15 pages, 9 figures
Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, Bryan Catanzaro
Pretraining auto-regressive large language models (LLMs) with retrieval demonstrates better perplexity and factual accuracy by leveraging external databases. However, the size of existing pretrained retrieval-augmented LLM is still limited (e.g., Retro has 7.5B parameters), which limits the effectiveness of instruction tuning and zero-shot generalization. In this work, we introduce Retro 48B, the largest LLM pretrained with retrieval. Specifically, we continue to pretrain a 43B GPT model on additional 100 billion tokens using the Retro augmentation method by retrieving from 1.2 trillion tokens. Notably, the obtained foundation model, Retro 48B, largely outperforms the counterpart GPT 43B trained on 1.2T tokens in terms of perplexity with only 2.58% additional GPU hours, demonstrating the significant scaling potential of the method. After instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on a wide range of zero-shot tasks. Specifically, the average improvement of InstructRetro is 7% over its GPT counterpart across 8 short-form QA and reading comprehension tasks, 10% over GPT across 4 challenging long-form QA tasks, and 16% over GPT across 3 summarization tasks. Surprisingly, we find that one can ablate the encoder from InstructRetro architecture and directly use its decoder backbone, while achieving comparable results. Our results highlight the promising direction to obtain a better GPT decoder through continued pretraining with retrieval before instruction tuning. Our code and checkpoints are publicly available at: https://github.com/NVIDIA/Megatron-LM/tree/InstructRetro/tools/retro.
Yang Bai, Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wangmeng Zuo, Rick Siow Mong Goh, Chun-Mei Feng
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption. Most existing CIR models adopt the late-fusion strategy to combine visual and language features. Besides, several approaches have also been suggested to generate a pseudo-word token from the reference image, which is further integrated into the relative caption for CIR. However, these pseudo-word-based prompting methods have limitations when target image encompasses complex changes on reference image, e.g., object removal and attribute modification. In this work, we demonstrate that learning an appropriate sentence-level prompt for the relative caption (SPRC) is sufficient for achieving effective composed image retrieval. Instead of relying on pseudo-word-based prompts, we propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts. By concatenating the learned sentence-level prompt with the relative caption, one can readily use existing text-based image retrieval models to enhance CIR performance. Furthermore, we introduce both image-text contrastive loss and text prompt alignment loss to enforce the learning of suitable sentence-level prompts. Experiments show that our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets. The source code and pretrained model are publicly available at https://github.com/chunmeifeng/SPRC
Zhangyin Feng, Xiaocheng Feng, Dezhi Zhao, Maojin Yang, Bing Qin
Large language models augmented with task-relevant documents have demonstrated impressive performance on knowledge-intensive tasks. However, regarding how to obtain effective documents, the existing methods are mainly divided into two categories. One is to retrieve from an external knowledge base, and the other is to utilize large language models to generate documents. We propose an iterative retrieval-generation collaborative framework. It is not only able to leverage both parametric and non-parametric knowledge, but also helps to find the correct reasoning path through retrieval-generation interactions, which is very important for tasks that require multi-step reasoning. We conduct experiments on four question answering datasets, including single-hop QA and multi-hop QA tasks. Empirical results show that our method significantly improves the reasoning ability of large language models and outperforms previous baselines.
Stephen Choi, William Gazeley, Siu Ho Wong, Tingting Li
This paper introduces the Conversational Factor Information Retrieval Method
(ConFIRM), a novel approach to fine-tuning large language models (LLMs) for
domain-specific retrieval tasks. ConFIRM leverages the Five-Factor Model of
personality to generate synthetic datasets that accurately reflect target
population characteristics, addressing data scarcity in specialized domains. We
demonstrate ConFIRM's effectiveness through a case study in the finance sector,
fine-tuning a Llama-2-7b model using personality-aligned data from the
PolyU-Asklora Fintech Adoption Index. The resulting model achieved 91% accuracy
in classifying financial queries, with an average inference time of 0.61
seconds on an NVIDIA A100 GPU. ConFIRM shows promise for creating more accurate
and personalized AI-driven information retrieval systems across various
domains, potentially mitigating issues of hallucinations and outdated
information in LLMs deployed
Authors' comments: 8 pages, 2 figures, 2 tables, 2 appendices
Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina et al.
Extending the context window of large language models (LLMs) is getting
popular recently, while the solution of augmenting LLMs with retrieval has
existed for years. The natural questions are: i) Retrieval-augmentation versus
long context window, which one is better for downstream tasks? ii) Can both
methods be combined to get the best of both worlds? In this work, we answer
these questions by studying both solutions using two state-of-the-art
pretrained LLMs, i.e., a proprietary 43B GPT and Llama2-70B. Perhaps
surprisingly, we find that LLM with 4K context window using simple
retrieval-augmentation at generation can achieve comparable performance to
finetuned LLM with 16K context window via positional interpolation on long
context tasks, while taking much less computation. More importantly, we
demonstrate that retrieval can significantly improve the performance of LLMs
regardless of their extended context window sizes. Our best model,
retrieval-augmented Llama2-70B with 32K context window, outperforms
GPT-3.5-turbo-16k and Davinci003 in terms of average score on nine long context
tasks including question answering, query-based summarization, and in-context
few-shot learning tasks. It also outperforms its non-retrieval Llama2-70B-32k
baseline by a margin, while being much faster at generation. Our study provides
general insights on the choice of retrieval-augmentation versus long context
extension of LLM for practitioners.
Authors' comments: Published at ICLR 2024
Brian Stern, Haoshuo Chen, Kwangwoong Kim, Hanzi Huang, Jie Zhao, Mohamad Hossein Idjadi
We demonstrate a direct-detection phase retrieval receiver based on silicon
photonics. The receiver implements strong dispersion and delay lines on a
compact chip. We retrieve the full field of a 30-GBd QPSK signal without a
carrier or local oscillator.
Authors' comments: 4 pages, 6 figures, at European Conference on Optical Communications
(ECOC) 2023
Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn et al.
Retrieval-augmented language models (RALMs) improve performance by accessing
long-tail and up-to-date knowledge from external data stores, but are
challenging to build. Existing approaches require either expensive
retrieval-specific modifications to LM pre-training or use post-hoc integration
of the data store that leads to suboptimal performance. We introduce
Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a lightweight fine-tuning
methodology that provides a third option by retrofitting any LLM with retrieval
capabilities. Our approach operates in two distinct fine-tuning steps: (1) one
updates a pre-trained LM to better use retrieved information, while (2) the
other updates the retriever to return more relevant results, as preferred by
the LM. By fine-tuning over tasks that require both knowledge utilization and
contextual awareness, we demonstrate that each stage yields significant
performance improvements, and using both leads to additional gains. Our best
model, RA-DIT 65B, achieves state-of-the-art performance across a range of
knowledge-intensive zero- and few-shot learning benchmarks, significantly
outperforming existing in-context RALM approaches by up to +8.9% in 0-shot
setting and +1.4% in 5-shot setting on average.
Authors' comments: 24 pages
Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert
We introduce Ragas (Retrieval Augmented Generation Assessment), a framework
for reference-free evaluation of Retrieval Augmented Generation (RAG)
pipelines. RAG systems are composed of a retrieval and an LLM based generation
module, and provide LLMs with knowledge from a reference textual database,
which enables them to act as a natural language layer between a user and
textual databases, reducing the risk of hallucinations. Evaluating RAG
architectures is, however, challenging because there are several dimensions to
consider: the ability of the retrieval system to identify relevant and focused
context passages, the ability of the LLM to exploit such passages in a faithful
way, or the quality of the generation itself. With Ragas, we put forward a
suite of metrics which can be used to evaluate these different dimensions
\textit{without having to rely on ground truth human annotations}. We posit
that such a framework can crucially contribute to faster evaluation cycles of
RAG architectures, which is especially important given the fast adoption of
LLMs.
Authors' comments: Reference-free (not tied to having ground truth available) evaluation
framework for retrieval agumented generation
Siqing Huo, Negar Arabzadeh, Charles L. A. Clarke
Current large language models (LLMs) can exhibit near-human levels of
performance on many natural language-based tasks, including open-domain
question answering. Unfortunately, at this time, they also convincingly
hallucinate incorrect answers, so that responses to questions must be verified
against external sources before they can be accepted at face value. In this
paper, we report two simple experiments to automatically validate generated
answers against a corpus. We base our experiments on questions and passages
from the MS MARCO (V1) test collection, and a retrieval pipeline consisting of
sparse retrieval, dense retrieval and neural rerankers. In the first
experiment, we validate the generated answer in its entirety. After presenting
a question to an LLM and receiving a generated answer, we query the corpus with
the combination of the question + generated answer. We then present the LLM
with the combination of the question + generated answer + retrieved answer,
prompting it to indicate if the generated answer can be supported by the
retrieved answer. In the second experiment, we consider the generated answer at
a more granular level, prompting the LLM to extract a list of factual
statements from the answer and verifying each statement separately. We query
the corpus with each factual statement and then present the LLM with the
statement and the corresponding retrieved evidence. The LLM is prompted to
indicate if the statement can be supported and make necessary edits using the
retrieved material. With an accuracy of over 80%, we find that an LLM is
capable of verifying its generated answer when a corpus of supporting material
is provided. However, manual assessment of a random sample of questions reveals
that incorrect generated answers are missed by this verification process. While
this verification process can reduce hallucinations, it can not entirely
eliminate them.
Authors' comments: arXiv admin note: text overlap with arXiv:2306.13781
Xingyu Yang, Daqing Liu, Heng Zhang, Yong Luo, Chaoyue Wang, Jing Zhang
Composed image retrieval is a type of image retrieval task where the user provides a reference image as a starting point and specifies a text on how to shift from the starting point to the desired target image. However, most existing methods focus on the composition learning of text and reference images and oversimplify the text as a description, neglecting the inherent structure and the user's shifting intention of the texts. As a result, these methods typically take shortcuts that disregard the visual cue of the reference images. To address this issue, we reconsider the text as instructions and propose a Semantic Shift network (SSN) that explicitly decomposes the semantic shifts into two steps: from the reference image to the visual prototype and from the visual prototype to the target image. Specifically, SSN explicitly decomposes the instructions into two components: degradation and upgradation, where the degradation is used to picture the visual prototype from the reference image, while the upgradation is used to enrich the visual prototype into the final representations to retrieve the desired target image. The experimental results show that the proposed SSN demonstrates a significant improvement of 5.42% and 1.37% on the CIRR and FashionIQ datasets, respectively, and establishes a new state-of-the-art performance. Codes will be publicly available.
Elias Ramzi, Nicolas Audebert, Clément Rambour, André Araujo, Xavier Bitot, Nicolas Thome
In image retrieval, standard evaluation metrics rely on score ranking, \eg
average precision (AP), recall at k (R@k), normalized discounted cumulative
gain (NDCG). In this work we introduce a general framework for robust and
decomposable rank losses optimization. It addresses two major challenges for
end-to-end training of deep neural networks with rank losses:
non-differentiability and non-decomposability. Firstly we propose a general
surrogate for ranking operator, SupRank, that is amenable to stochastic
gradient descent. It provides an upperbound for rank losses and ensures robust
training. Secondly, we use a simple yet effective loss function to reduce the
decomposability gap between the averaged batch approximation of ranking losses
and their values on the whole training set. We apply our framework to two
standard metrics for image retrieval: AP and R@k. Additionally we apply our
framework to hierarchical image retrieval. We introduce an extension of AP, the
hierarchical average precision $\mathcal{H}$-AP, and optimize it as well as the
NDCG. Finally we create the first hierarchical landmarks retrieval dataset. We
use a semi-automatic pipeline to create hierarchical labels, extending the
large scale Google Landmarks v2 dataset. The hierarchical dataset is publicly
available at https://github.com/cvdfoundation/google-landmark. Code will be
released at https://github.com/elias-ramzi/SupRank.
Authors' comments: arXiv admin note: text overlap with arXiv:2207.04873
Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang
Despite recent progress in text-to-audio (TTA) generation, we show that the
state-of-the-art models, such as AudioLDM, trained on datasets with an
imbalanced class distribution, such as AudioCaps, are biased in their
generation performance. Specifically, they excel in generating common audio
classes while underperforming in the rare ones, thus degrading the overall
generation performance. We refer to this problem as long-tailed text-to-audio
generation. To address this issue, we propose a simple retrieval-augmented
approach for TTA models. Specifically, given an input text prompt, we first
leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve
relevant text-audio pairs. The features of the retrieved audio-text data are
then used as additional conditions to guide the learning of TTA models. We
enhance AudioLDM with our proposed approach and denote the resulting augmented
system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a
state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the
existing approaches by a large margin. Furthermore, we show that Re-AudioLDM
can generate realistic audio for complex scenes, rare audio classes, and even
unseen audio types, indicating its potential in TTA tasks.
Authors' comments: Accepted by ICASSP 2024
Xintong Jiang, Yaxiong Wang, Yujiao Wu, Meng Wang, Xueming Qian
Composed image retrieval, a task involving the search for a target image
using a reference image and a complementary text as the query, has witnessed
significant advancements owing to the progress made in cross-modal modeling.
Unlike the general image-text retrieval problem with only one alignment
relation, i.e., image-text, we argue for the existence of two types of
relations in composed image retrieval. The explicit relation pertains to the
reference image & complementary text-target image, which is commonly exploited
by existing methods. Besides this intuitive relation, the observations during
our practice have uncovered another implicit yet crucial relation, i.e.,
reference image & target image-complementary text, since we found that the
complementary text can be inferred by studying the relation between the target
image and the reference image. Regrettably, existing methods largely focus on
leveraging the explicit relation to learn their networks, while overlooking the
implicit relation. In response to this weakness, We propose a new framework for
composed image retrieval, termed dual relation alignment, which integrates both
explicit and implicit relations to fully exploit the correlations among the
triplets. Specifically, we design a vision compositor to fuse reference image
and target image at first, then the resulted representation will serve two
roles: (1) counterpart for semantic alignment with the complementary text and
(2) compensation for the complementary text to boost the explicit relation
modeling, thereby implant the implicit relation into the alignment learning.
Our method is evaluated on two popular datasets, CIRR and FashionIQ, through
extensive experiments. The results confirm the effectiveness of our
dual-relation learning in substantially enhancing composed image retrieval
performance.
Authors' comments: The architecture of our model changes, hence methodolgy and
experiments changes a lot, We have significantly revised the original
manuscript of the paper, so a withdraw of our original script is needed
Andres Ferraro, Jaehun Kim, Sergio Oramas, Andreas Ehmann, Fabien Gouyon
Music retrieval and recommendation applications often rely on content features encoded as embeddings, which provide vector representations of items in a music dataset. Numerous complementary embeddings can be derived from processing items originally represented in several modalities, e.g., audio signals, user interaction data, or editorial data. However, data of any given modality might not be available for all items in any music dataset. In this work, we propose a method based on contrastive learning to combine embeddings from multiple modalities and explore the impact of the presence or absence of embeddings from diverse modalities in an artist similarity task. Experiments on two datasets suggest that our contrastive method outperforms single-modality embeddings and baseline algorithms for combining modalities, both in terms of artist retrieval accuracy and coverage. Improvements with respect to other methods are particularly significant for less popular query artists. We demonstrate our method successfully combines complementary information from diverse modalities, and is more robust to missing modality data (i.e., it better handles the retrieval of artists with different modality embeddings than the query artist's).