Hao Li, Curise Jia, Peng Jin, Zesen Cheng, Kehan Li, Jialu Sui, Chang Liu, Li Yuan
Image Retrieval aims to retrieve corresponding images based on a given query.
In application scenarios, users intend to express their retrieval intent
through various query styles. However, current retrieval tasks predominantly
focus on text-query retrieval exploration, leading to limited retrieval query
options and potential ambiguity or bias in user intention. In this paper, we
propose the Style-Diversified Query-Based Image Retrieval task, which enables
retrieval based on various query styles. To facilitate the novel setting, we
propose the first Diverse-Style Retrieval dataset, encompassing diverse query
styles including text, sketch, low-resolution, and art. We also propose a
light-weighted style-diversified retrieval framework. For various query style
inputs, we apply the Gram Matrix to extract the query's textural features and
cluster them into a style space with style-specific bases. Then we employ the
style-init prompt tuning module to enable the visual encoder to comprehend the
texture and style information of the query. Experiments demonstrate that our
model, employing the style-init prompt tuning strategy, outperforms existing
retrieval models on the style-diversified retrieval task. Moreover,
style-diversified queries~(sketch+text, art+text, etc) can be simultaneously
retrieved in our model. The auxiliary information from other queries enhances
the retrieval performance within the respective query.
Authors' comments: 16 pages, 7 figures
Kai Liu, Deguang Han
A phase retrievable quantum channel refers to a quantum channel $\Phi: B(H_A)\to B(H_B)$ such that there is a positive operator valued measure (POVM) $\{F_{j}\}$ in $B(H_{B})$ and $\{\Phi^*(F_j)\}$ is a phase retrievable operator valued frame. In this paper we examine the phase retrievable quantum channels in terms of their Kraus representations. For quantum channels $\Phi$ of Choi's rank-$2$, we obtain a necessary and sufficient condition under which it is phase retrievable. For the general case, we present several necessary and/or sufficient conditions. In particular, a necessary and sufficient condition is obtained in terms of the relevant matrix-valued joint spectrum of the Kraus operators. Additionally, we also examine, by examples, the problem of constructing quantum channels such that there exists a minimal number of rank-one observables $\{F_{j}\}$ such that $\{\Phi^*(F_j)\}$ does phase retrieval for $H_A$. Conversely, for a given set of rank-one observables $\{F_{j}\}_{j=1}^{N}$, we present a sufficient condition under which, for every $1\leq r\leq N$ given, a phase retrievable quantum channel $\Phi$ of Choi's rank-$r$ can be explicitly constructed.
Chakradhar Reddy Nallu
This paper is based on developing different algorithms, which generate the task tree planning for the given goal node(recipe). The knowledge representation of the dishes is called FOON. It contains the different objects and their between them with respective to the motion node The graphical representation of FOON is made by noticing the change in the state of an object with respect to the human manipulators. We will explore how the FOON is created for different recipes by the robots. Task planning contains difficulties in exploring unknown problems, as its knowledge is limited to the FOON. To get the task tree planning for a given recipe, the robot will retrieve the information of different functional units from the knowledge retrieval process called FOON. Thus the generated subgraphs will allow the robot to cook the required dish. Thus the robot can able to cook the given recipe by following the sequence of instructions.
Shitong Sun, Jindong Gu, Shaogang Gong
Text-image composed retrieval aims to retrieve the target image through the
composed query, which is specified in the form of an image plus some text that
describes desired modifications to the input image. It has recently attracted
attention due to its ability to leverage both information-rich images and
concise language to precisely express the requirements for target images.
However, the robustness of these approaches against real-world corruptions or
further text understanding has never been studied. In this paper, we perform
the first robustness study and establish three new diversified benchmarks for
systematic analysis of text-image composed retrieval against natural
corruptions in both vision and text and further probe textural understanding.
For natural corruption analysis, we introduce two new large-scale benchmark
datasets, CIRR-C and FashionIQ-C for testing in open domain and fashion domain
respectively, both of which apply 15 visual corruptions and 7 textural
corruptions. For textural understanding analysis, we introduce a new diagnostic
dataset CIRR-D by expanding the original raw data with synthetic data, which
contains modified text to better probe textual understanding ability including
numerical variation, attribute variation, object removal, background variation,
and fine-grained evaluation. The code and benchmark datasets are available at
https://github.com/SunTongtongtong/Benchmark-Robustness-Text-Image-Compose-Retrieval.
Authors' comments: Accepted by R0-FoMo: Workshop on Robustness of Few-shot and Zero-shot
Learning in Foundation Models at NeurIPS 2023
Farnaz Khun Jush, Tuan Truong, Steffen Vogler, Matthias Lenga
A wide range of imaging techniques and data formats available for medical
images make accurate retrieval from image databases challenging.
Efficient retrieval systems are crucial in advancing medical research,
enabling large-scale studies and innovative diagnostic tools. Thus, addressing
the challenges of medical image retrieval is essential for the continued
enhancement of healthcare and research.
In this study, we evaluated the feasibility of employing four
state-of-the-art pretrained models for medical image retrieval at modality,
body region, and organ levels and compared the results of two similarity
indexing approaches. Since the employed networks take 2D images, we analyzed
the impacts of weighting and sampling strategies to incorporate 3D information
during retrieval of 3D volumes. We showed that medical image retrieval is
feasible using pretrained networks without any additional training or
fine-tuning steps. Using pretrained embeddings, we achieved a recall of 1 for
various tasks at modality, body region, and organ level.
Authors' comments: 8 pages, 3 figures, 4 tables
Tong Wu, Yulei Qin, Enwei Zhang, Zihan Xu, Yuting Gao, Ke Li, Xing Sun
Retrieval augmentation has become an effective solution to empower large language models (LLMs) with external and verified knowledge sources from the database, which overcomes the limitations and hallucinations of LLMs in handling up-to-date and domain-specific information. However, existing embedding models for text retrieval usually have three non-negligible limitations. First, the number and diversity of samples in a batch are too restricted to supervise the modeling of textual nuances at scale. Second, the high proportional noise are detrimental to the semantic correctness and consistency of embeddings. Third, the equal treatment to easy and difficult samples would cause sub-optimum convergence of embeddings with poorer generalization. In this paper, we propose the PEG, a progressively learned embeddings for robust text retrieval. Specifically, we increase the training in-batch negative samples to 80,000, and for each query, we extracted five hard negatives. Concurrently, we incorporated a progressive learning mechanism, enabling the model to dynamically modulate its attention to the samples throughout the entire training process. Additionally, PEG is trained on more than 100 million data, encompassing a wide range of domains (e.g., finance, medicine, and tourism) and covering various tasks (e.g., question-answering, machine reading comprehension, and similarity matching). Extensive experiments conducted on C-MTEB and DuReader demonstrate that PEG surpasses state-of-the-art embeddings in retrieving true positives, highlighting its significant potential for applications in LLMs. Our model is publicly available at https://huggingface.co/TownsWu/PEG.
Hansi Zeng, Chen Luo, Bowen Jin, Sheikh Muhammad Sarwar, Tianxin Wei, Hamed Zamani
Recent research has shown that transformer networks can be used as differentiable search indexes by representing each document as a sequences of document ID tokens. These generative retrieval models cast the retrieval problem to a document ID generation problem for each given query. Despite their elegant design, existing generative retrieval models only perform well on artificially-constructed and small-scale collections. This has led to serious skepticism in the research community on their real-world impact. This paper represents an important milestone in generative retrieval research by showing, for the first time, that generative retrieval models can be trained to perform effectively on large-scale standard retrieval benchmarks. For doing so, we propose RIPOR- an optimization framework for generative retrieval that can be adopted by any encoder-decoder architecture. RIPOR is designed based on two often-overlooked fundamental design considerations in generative retrieval. First, given the sequential decoding nature of document ID generation, assigning accurate relevance scores to documents based on the whole document ID sequence is not sufficient. To address this issue, RIPOR introduces a novel prefix-oriented ranking optimization algorithm. Second, initial document IDs should be constructed based on relevance associations between queries and documents, instead of the syntactic and semantic information in the documents. RIPOR addresses this issue using a relevance-based document ID construction approach that quantizes relevance-based representations learned for documents. Evaluation on MSMARCO and TREC Deep Learning Track reveals that RIPOR surpasses state-of-the-art generative retrieval models by a large margin (e.g., 30.5% MRR improvements on MS MARCO Dev Set), and perform better on par with popular dense retrieval models.
Sedrick Keh, Justin T. Chiu, Daniel Fried
When a model is trying to gather information in an interactive setting, it benefits from asking informative questions. However, in the case of a grounded multi-turn image identification task, previous studies have been constrained to polar yes/no questions, limiting how much information the model can gain in a single turn. We present an approach that formulates more informative, open-ended questions. In doing so, we discover that off-the-shelf visual question answering (VQA) models often make presupposition errors, which standard information gain question selection methods fail to account for. To address this issue, we propose a method that can incorporate presupposition handling into both question selection and belief updates. Specifically, we use a two-stage process, where the model first filters out images which are irrelevant to a given question, then updates its beliefs about which image the user intends. Through self-play and human evaluations, we show that our method is successful in asking informative open-ended questions, increasing accuracy over the past state-of-the-art by 14%, while resulting in 48% more efficient games in human evaluations.
Xiaonan Li, Changtai Zhu, Linyang Li, Zhangyue Yin, Tianxiang Sun, Xipeng Qiu
Verifiable generation aims to let the large language model (LLM) generate
text with supporting documents, which enables the user to flexibly verify the
answer and makes the LLM's output more reliable. Retrieval plays a crucial role
in verifiable generation. Specifically, the retrieved documents not only
supplement knowledge to help the LLM generate correct answers, but also serve
as supporting evidence for the user to verify the LLM's output. However, the
widely used retrievers become the bottleneck of the entire pipeline and limit
the overall performance. Their capabilities are usually inferior to LLMs since
they often have much fewer parameters than the large language model and have
not been demonstrated to scale well to the size of LLMs. If the retriever does
not correctly find the supporting documents, the LLM can not generate the
correct and verifiable answer, which overshadows the LLM's remarkable
abilities. To address these limitations, we propose \LLatrieval (Large Language
Model Verified Retrieval), where the LLM updates the retrieval result until it
verifies that the retrieved documents can sufficiently support answering the
question. Thus, the LLM can iteratively provide feedback to retrieval and
facilitate the retrieval result to fully support verifiable generation.
Experiments show that LLatrieval significantly outperforms extensive baselines
and achieves state-of-the-art results.
Authors' comments: Accepted by NAACL 2024 (Main Conference)
Lukas Gienapp, Harrisen Scells, Niklas Deckers, Janek Bevendorff, Shuai Wang, Johannes Kiesel, Shahbaz Syed, Maik Frbe et al.
Recent advances in large language models have enabled the development of
viable generative information retrieval systems. A generative retrieval system
returns a grounded generated text in response to an information need instead of
the traditional document ranking. Quantifying the utility of these types of
responses is essential for evaluating generative retrieval systems. As the
established evaluation methodology for ranking-based ad hoc retrieval may seem
unsuitable for generative retrieval, new approaches for reliable, repeatable,
and reproducible experimentation are required. In this paper, we survey the
relevant information retrieval and natural language processing literature,
identify search tasks and system architectures in generative retrieval, develop
a corresponding user model, and study its operationalization. This theoretical
analysis provides a foundation and new insights for the evaluation of
generative ad hoc retrieval systems.
Authors' comments: 14 pages, 5 figures, 1 table
Hang Zhang, Yeyun Gong, Xingwei He, Dayiheng Liu, Daya Guo, Jiancheng Lv, Jian Guo
Most dense retrieval models contain an implicit assumption: the training
query-document pairs are exactly matched. Since it is expensive to annotate the
corpus manually, training pairs in real-world applications are usually
collected automatically, which inevitably introduces mismatched-pair noise. In
this paper, we explore an interesting and challenging problem in dense
retrieval, how to train an effective model with mismatched-pair noise. To solve
this problem, we propose a novel approach called Noisy Pair Corrector (NPC),
which consists of a detection module and a correction module. The detection
module estimates noise pairs by calculating the perplexity between annotated
positive and easy negative documents. The correction module utilizes an
exponential moving average (EMA) model to provide a soft supervised signal,
aiding in mitigating the effects of noise. We conduct experiments on
text-retrieval benchmarks Natural Question and TriviaQA, code-search benchmarks
StaQC and SO-DS. Experimental results show that NPC achieves excellent
performance in handling both synthetic and realistic noise.
Authors' comments: Findings of EMNLP 2023
Sunkyung Lee, Minjin Choi, Jongwuk Lee
Generative retrieval shed light on a new paradigm of document retrieval,
aiming to directly generate the identifier of a relevant document for a query.
While it takes advantage of bypassing the construction of auxiliary index
structures, existing studies face two significant challenges: (i) the
discrepancy between the knowledge of pre-trained language models and
identifiers and (ii) the gap between training and inference that poses
difficulty in learning to rank. To overcome these challenges, we propose a
novel generative retrieval method, namely Generative retrieval via LExical
iNdex learning (GLEN). For training, GLEN effectively exploits a dynamic
lexical identifier using a two-phase index learning strategy, enabling it to
learn meaningful lexical identifiers and relevance signals between queries and
documents. For inference, GLEN utilizes collision-free inference, using
identifier weights to rank documents without additional overhead. Experimental
results prove that GLEN achieves state-of-the-art or competitive performance
against existing generative retrieval methods on various benchmark datasets,
e.g., NQ320k, MS MARCO, and BEIR. The code is available at
https://github.com/skleee/GLEN.
Authors' comments: In Proceedings of the 2023 Conference on Empirical Methods in Natural
Language Processing (EMNLP 2023) main conference. 12 pages, 2 figures, 8
tables
Bjrn Engelmann, Timo Breuer, Philipp Schaer
Considering the multimodal signals of search items is beneficial for
retrieval effectiveness. Especially in web table retrieval (WTR) experiments,
accounting for multimodal properties of tables boosts effectiveness. However,
it still remains an open question how the single modalities affect user
experience in particular. Previous work analyzed WTR performance in ad-hoc
retrieval benchmarks, which neglects interactive search behavior and limits the
conclusion about the implications for real-world user environments.
To this end, this work presents an in-depth evaluation of simulated
interactive WTR search sessions as a more cost-efficient and reproducible
alternative to real user studies. As a first of its kind, we introduce
interactive query reformulation strategies based on Doc2Query, incorporating
cognitive states of simulated user knowledge. Our evaluations include two
perspectives on user effectiveness by considering different cost paradigms,
namely query-wise and time-oriented measures of effort. Our multi-perspective
evaluation scheme reveals new insights about query strategies, the impact of
modalities, and different user types in simulated WTR search sessions.
Authors' comments: 4 pages + references; accepted at CIKM'23
Philippe Jaming, Martin Rathmair
We consider the problem of reconstructing a function $f\in L^2(\mathbb{R})$ given phase-less samples of its Gabor transform, which is defined by $$\mathcal{G} f(x,\omega) := 2^{\frac14} \int_{\mathbb{R}} f(t) e^{-\pi (t-x)^2} e^{-2\pi i y t}\,\mbox{d}t,\quad (x,y)\in\mathbb{R}^2.$$More precisely, given sampling positions $\Omega\subseteq \mathbb{R}^2$ the task is to reconstruct $f$ (up to global phase) from measurements $\{|\mathcal{G} f(\omega)|: \,\omega\in\Omega\}$. This non-linear inverse problem is known to suffer from severe ill-posedness. As for any other phase retrieval problem, constructive recovery is a notoriously delicate affair due to the lack of convexity. One of the fundamental insights in this line of research is that the connectivity of the measurements is both necessary and sufficient for reconstruction of phase information to be theoretically possible. In this article we propose a reconstruction algorithm which is based on solving two convex problems and, as such, amenable to numerical analysis. We show, empirically as well as analytically, that the scheme accurately reconstructs from noisy data within the connected regime.Moreover, to emphasize the practicability of the algorithm we argue that both convex problems can actually be reformulated as semi-definite programs for which efficient solvers are readily available. The approach is based on ideas from complex analysis, Gabor frame theory as well as matrix completion.
Carlos Dominguez, Jon Ander Campos, Eneko Agirre, Gorka Azkune
Neural information retrieval requires costly annotated data for each target domain to be competitive. Synthetic annotation by query generation using Large Language Models or rule-based string manipulation has been proposed as an alternative, but their relative merits have not been analysed. In this paper, we compare both methods head-to-head using the same neural IR architecture. We focus on the BEIR benchmark, which includes test datasets from several domains with no training data, and explore two scenarios: zero-shot, where the supervised system is trained in a large out-of-domain dataset (MS-MARCO); and unsupervised domain adaptation, where, in addition to MS-MARCO, the system is fine-tuned in synthetic data from the target domain. Our results indicate that Large Language Models outperform rule-based methods in all scenarios by a large margin, and, more importantly, that unsupervised domain adaptation is effective compared to applying a supervised IR system in a zero-shot fashion. In addition we explore several sizes of open Large Language Models to generate synthetic data and find that a medium-sized model suffices. Code and models are publicly available for reproducibility.
Jinsung Yoon, Sercan O Arik, Yanfei Chen, Tomas Pfister
Embeddings extracted by pre-trained Large Language Models (LLMs) have
significant potential to improve information retrieval and search. Beyond the
zero-shot setup in which they are being conventionally used, being able to take
advantage of the information from the relevant query-corpus paired data can
further boost the LLM capabilities. In this paper, we propose a novel method,
Search-Adaptor, for customizing LLMs for information retrieval in an efficient
and robust way. Search-Adaptor modifies the embeddings generated by pre-trained
LLMs, and can be integrated with any LLM, including those only available via
prediction APIs. On multiple English, multilingual, and multimodal retrieval
datasets, we show consistent and significant performance benefits for
Search-Adaptor -- e.g., more than 5% improvements for Google Embedding APIs in
nDCG@10 averaged over 14 BEIR datasets.
Authors' comments: Published in 2024 ACL Main Conference
Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, Jian-Yun Nie
Large language models (LLMs) face significant challenges stemming from the inherent limitations in knowledge, memory, alignment, and action. These challenges cannot be addressed by LLMs alone, but should rely on assistance from the external world, such as knowledge base, memory store, demonstration examples, and tools. Retrieval augmentation stands as a vital mechanism for bridging the gap between LLMs and the external assistance. However, conventional methods encounter two pressing issues. On one hand, the general-purpose retrievers are not properly optimized for the retrieval augmentation of LLMs. On the other hand, the task-specific retrievers lack the required versatility, hindering their performance across the diverse retrieval augmentation scenarios. In this work, we present a novel approach, the LLM Embedder, which comprehensively support the diverse needs of LLMs' retrieval augmentation with one unified embedding model. Training such an unified model is non-trivial, as various retrieval tasks aim to capture distinct semantic relationships, often subject to mutual interference. To address this challenge, we systematically optimize our training methodology. This includes reward formulation based on LLMs' feedback, the stabilization of knowledge distillation, multi-task fine-tuning with explicit instructions, and the use of homogeneous in-batch negative sampling. These optimization strategies contribute to the outstanding empirical performance of the LLM-Embedder. Notably, it yields remarkable enhancements in retrieval augmentation for LLMs, surpassing both general-purpose and task-specific retrievers in various evaluation scenarios. This project is made publicly available at https://github.com/FlagOpen/FlagEmbedding.
Mingcheng Chen, Haoran Zhao, Yuxiang Zhao, Hulei Fan, Hongqiao Gao, Yong Yu, Zheng Tian
Data-driven black-box model-based optimization (MBO) problems arise in a
great number of practical application scenarios, where the goal is to find a
design over the whole space maximizing a black-box target function based on a
static offline dataset. In this work, we consider a more general but
challenging MBO setting, named constrained MBO (CoMBO), where only part of the
design space can be optimized while the rest is constrained by the environment.
A new challenge arising from CoMBO is that most observed designs that satisfy
the constraints are mediocre in evaluation. Therefore, we focus on optimizing
these mediocre designs in the offline dataset while maintaining the given
constraints rather than further boosting the best observed design in the
traditional MBO setting. We propose retrieval-enhanced offline model-based
optimization (ROMO), a new derivable forward approach that retrieves the
offline dataset and aggregates relevant samples to provide a trusted
prediction, and use it for gradient-based optimization. ROMO is simple to
implement and outperforms state-of-the-art approaches in the CoMBO setting.
Empirically, we conduct experiments on a synthetic Hartmann (3D) function
dataset, an industrial CIO dataset, and a suite of modified tasks in the
Design-Bench benchmark. Results show that ROMO performs well in a wide range of
constrained optimization tasks.
Authors' comments: 15 pages, 9 figures
Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, Bryan Catanzaro
Pretraining auto-regressive large language models (LLMs) with retrieval demonstrates better perplexity and factual accuracy by leveraging external databases. However, the size of existing pretrained retrieval-augmented LLM is still limited (e.g., Retro has 7.5B parameters), which limits the effectiveness of instruction tuning and zero-shot generalization. In this work, we introduce Retro 48B, the largest LLM pretrained with retrieval. Specifically, we continue to pretrain a 43B GPT model on additional 100 billion tokens using the Retro augmentation method by retrieving from 1.2 trillion tokens. Notably, the obtained foundation model, Retro 48B, largely outperforms the counterpart GPT 43B trained on 1.2T tokens in terms of perplexity with only 2.58% additional GPU hours, demonstrating the significant scaling potential of the method. After instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on a wide range of zero-shot tasks. Specifically, the average improvement of InstructRetro is 7% over its GPT counterpart across 8 short-form QA and reading comprehension tasks, 10% over GPT across 4 challenging long-form QA tasks, and 16% over GPT across 3 summarization tasks. Surprisingly, we find that one can ablate the encoder from InstructRetro architecture and directly use its decoder backbone, while achieving comparable results. Our results highlight the promising direction to obtain a better GPT decoder through continued pretraining with retrieval before instruction tuning. Our code and checkpoints are publicly available at: https://github.com/NVIDIA/Megatron-LM/tree/InstructRetro/tools/retro.
Yang Bai, Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wangmeng Zuo, Rick Siow Mong Goh, Chun-Mei Feng
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption. Most existing CIR models adopt the late-fusion strategy to combine visual and language features. Besides, several approaches have also been suggested to generate a pseudo-word token from the reference image, which is further integrated into the relative caption for CIR. However, these pseudo-word-based prompting methods have limitations when target image encompasses complex changes on reference image, e.g., object removal and attribute modification. In this work, we demonstrate that learning an appropriate sentence-level prompt for the relative caption (SPRC) is sufficient for achieving effective composed image retrieval. Instead of relying on pseudo-word-based prompts, we propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts. By concatenating the learned sentence-level prompt with the relative caption, one can readily use existing text-based image retrieval models to enhance CIR performance. Furthermore, we introduce both image-text contrastive loss and text prompt alignment loss to enforce the learning of suitable sentence-level prompts. Experiments show that our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets. The source code and pretrained model are publicly available at https://github.com/chunmeifeng/SPRC