Fangzhou Song, Bin Zhu, Yanbin Hao, Shuo Wang
Learning recipe and food image representation in common embedding space is
non-trivial but crucial for cross-modal recipe retrieval. In this paper, we
propose a new perspective for this problem by utilizing foundation models for
data augmentation. Leveraging on the remarkable capabilities of foundation
models (i.e., Llama2 and SAM), we propose to augment recipe and food image by
extracting alignable information related to the counterpart. Specifically,
Llama2 is employed to generate a textual description from the recipe, aiming to
capture the visual cues of a food image, and SAM is used to produce image
segments that correspond to key ingredients in the recipe. To make full use of
the augmented data, we introduce Data Augmented Retrieval framework (DAR) to
enhance recipe and image representation learning for cross-modal retrieval. We
first inject adapter layers to pre-trained CLIP model to reduce computation
cost rather than fully fine-tuning all the parameters. In addition, multi-level
circle loss is proposed to align the original and augmented data pairs, which
assigns different penalties for positive and negative pairs. On the Recipe1M
dataset, our DAR outperforms all existing methods by a large margin. Extensive
ablation studies validate the effectiveness of each component of DAR.
Authors' comments: ECCV2024
Jialong Zuo, Hanyu Zhou, Ying Nie, Feng Zhang, Tianyu Guo, Nong Sang, Yunhe Wang, Changxin Gao
Existing text-based person retrieval datasets often have relatively coarse-grained text annotations. This hinders the model to comprehend the fine-grained semantics of query texts in real scenarios. To address this problem, we contribute a new benchmark named \textbf{UFineBench} for text-based person retrieval with ultra-fine granularity. Firstly, we construct a new \textbf{dataset} named UFine6926. We collect a large number of person images and manually annotate each image with two detailed textual descriptions, averaging 80.8 words each. The average word count is three to four times that of the previous datasets. In addition of standard in-domain evaluation, we also propose a special \textbf{evaluation paradigm} more representative of real scenarios. It contains a new evaluation set with cross domains, cross textual granularity and cross textual styles, named UFine3C, and a new evaluation metric for accurately measuring retrieval ability, named mean Similarity Distribution (mSD). Moreover, we propose CFAM, a more efficient \textbf{algorithm} especially designed for text-based person retrieval with ultra fine-grained texts. It achieves fine granularity mining by adopting a shared cross-modal granularity decoder and hard negative match mechanism. With standard in-domain evaluation, CFAM establishes competitive performance across various datasets, especially on our ultra fine-grained UFine6926. Furthermore, by evaluating on UFine3C, we demonstrate that training on our UFine6926 significantly improves generalization to real scenarios compared with other coarse-grained datasets. The dataset and code will be made publicly available at \url{https://github.com/Zplusdragon/UFineBench}.
Keonwoo Kim, Younggun Lee
With the growing volume of diverse information, the demand for classifying arbitrary topics has become increasingly critical. To address this challenge, we introduce DRAFT, a simple framework designed to train a classifier for few-shot topic classification. DRAFT uses a few examples of a specific topic as queries to construct Customized dataset with a dense retriever model. Multi-query retrieval (MQR) algorithm, which effectively handles multiple queries related to a specific topic, is applied to construct the Customized dataset. Subsequently, we fine-tune a classifier using the Customized dataset to identify the topic. To demonstrate the efficacy of our proposed approach, we conduct evaluations on both widely used classification benchmark datasets and manually constructed datasets with 291 diverse topics, which simulate diverse contents encountered in real-world applications. DRAFT shows competitive or superior performance compared to baselines that use in-context learning, such as GPT-3 175B and InstructGPT 175B, on few-shot topic classification tasks despite having 177 times fewer parameters, demonstrating its effectiveness.
Amitha Attapu
Robots can be very useful to automate tasks and reduce the human effort
required. But for the robot to know, how to perform tasks, we need to give it a
clear set of steps to follow. It is nearly impossible to provide a robot with
instructions for every possible task. Therefore we have a Universal Functional
object-oriented network (FOON) which was created and expanded and has a lot of
existing recipe information [1]. But certain tasks are complicated for robots
to perform and similarly, some tasks are complicated for humans to perform.
Therefore weights have been added to functional units to represent the chance
of successful execution of the motion by the robot [2]. Given a set of kitchen
items and a goal node, using Universal FOON, a robot must be able to determine
if the required items are present in the kitchen, and if yes, get the steps to
convert the required kitchen items to the goal node. Now through this paper, we
use two algorithms (IDS and GBFS) to retrieve a task tree (if possible) for a
goal node and a given set of kitchen items. The following would be the
different parts of the paper: Section II FOON creation, where we will discuss
the different terminologies related to FOON and visualization of FOON. In
Section III Methodology we discuss the IDS and GBFS search algorithms and the
two different heuristics implemented and used in GBFS. In Section IV
Experiment/Discussion, we compare the performance of different algorithms. In
the final section V, we specify the references of the papers that have been
cited.
Authors' comments: 3 pages, 3 figures
Junfeng Liu, Zhuocheng Mei, Kewen Peng, Ranga Raju Vatsavai
Conversational agents leveraging AI, particularly deep learning, are emerging
in both academic research and real-world applications. However, these
applications still face challenges, including disrespecting knowledge and
facts, not personalizing to user preferences, and enormous demand for
computational resources during training and inference. Recent research efforts
have been focused on addressing these challenges from various aspects,
including supplementing various types of auxiliary information to the
conversational agents. However, existing methods are still not able to
effectively and efficiently exploit relevant information from these auxiliary
supplements to further unleash the power of the conversational agents and the
language models they use. In this paper, we present a novel method, PK-NCLI,
that is able to accurately and efficiently identify relevant auxiliary
information to improve the quality of conversational responses by learning the
relevance among persona, chat history, and knowledge background through
low-level normalized contextual latent interaction. Our experimental results
indicate that PK-NCLI outperforms the state-of-the-art method, PK-FoCus, by
47.80%/30.61%/24.14% in terms of perplexity, knowledge grounding, and training
efficiency, respectively, and maintained the same level of persona grounding
performance. We also provide a detailed analysis of how different factors,
including language model choices and trade-offs on training weights, would
affect the performance of PK-NCLI.
Authors' comments: 2023 IEEE International Conference on Data Mining Workshops (ICDMW)
Nan Yang, Yannan Zhang, Xiaoling Bai, Hualong Deng, Tianhua Zhou, Jin Ma
Information retrieval in real-time search presents unique challenges distinct from those encountered in classical web search. These challenges are particularly pronounced due to the rapid change of user search intent, which is influenced by the occurrence and evolution of breaking news events, such as earthquakes, elections, and wars. Previous dense retrieval methods, which primarily focused on static semantic representation, lack the capacity to capture immediate search intent, leading to inferior performance in retrieving the most recent event-related documents in time-sensitive scenarios. To address this issue, this paper expands the query with event information that represents real-time search intent. The Event information is then integrated with the query through a cross-attention mechanism, resulting in a time-context query representation. We further enhance the model's capacity for event representation through multi-task training. Since publicly available datasets such as MS-MARCO do not contain any event information on the query side and have few time-sensitive queries, we design an automatic data collection and annotation pipeline to address this issue, which includes ModelZoo-based Coarse Annotation and LLM-driven Fine Annotation processes. In addition, we share the training tricks such as two-stage training and hard negative sampling. Finally, we conduct a set of offline experiments on a million-scale production dataset to evaluate our approach and deploy an A/B testing in a real online system to verify the performance. Extensive experimental results demonstrate that our proposed approach significantly outperforms existing state-of-the-art baseline methods.
Hamed Damirchi, Cristian Rodríguez-Opazo, Ehsan Abbasnejad, Damien Teney, Javen Qinfeng Shi, Stephen Gould, Anton van den Hengel
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box. The Web likely contains the information necessary to excel on any specific application, but identifying the right data a priori is challenging. This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval. We propose to retrieve useful data from the Web at test time based on test cases that the model is uncertain about. Different from existing retrieval-augmented approaches, we then update the model to address this underlying uncertainty. We demonstrate substantial improvements in zero-shot performance, e.g. a remarkable increase of 15 percentage points in accuracy on the Stanford Cars and Flowers datasets. We also present extensive experiments that explore the impact of noisy retrieval and different learning strategies.
Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, Wenhu Chen
Existing information retrieval (IR) models often assume a homogeneous format,
limiting their applicability to diverse user needs, such as searching for
images with text descriptions, searching for a news article with a headline
image, or finding a similar photo with a query image. To approach such
different information-seeking demands, we introduce UniIR, a unified
instruction-guided multimodal retriever capable of handling eight distinct
retrieval tasks across modalities. UniIR, a single retrieval system jointly
trained on ten diverse multimodal-IR datasets, interprets user instructions to
execute various retrieval tasks, demonstrating robust performance across
existing datasets and zero-shot generalization to new tasks. Our experiments
highlight that multi-task training and instruction tuning are keys to UniIR's
generalization ability. Additionally, we construct the M-BEIR, a multimodal
retrieval benchmark with comprehensive results, to standardize the evaluation
of universal multimodal information retrieval.
Authors' comments: Our code and dataset are available on this project page:
https://tiger-ai-lab.github.io/UniIR/
Kai Liu, Deguang Han
A twirling channel is a quantum channel induced by a continuous unitary representation $\pi = \sum_{i}^{\oplus} m_i\pi_i$, where $\pi_i$ are inequivalent irreducible representations. Motivated by a recent work \cite{Twirling} on minimal mixed unitary rank of $\Phi_{\pi}$, we explore the connections of the independence number, zero error capacity, quantum codes, orthogonality index and phase retrievability of the quantum channel $\Phi_{\pi}$ with the irreducible representation multiplicities $m_i$, the irreducible representation dimensions $\dim H_{\pi_i}$. In particular we show that the independence number of $\Phi_{\pi}$ is the sum of the multiplicities, the orthogonal index of $\Phi_{\pi}$ is exactly the sum of those representation dimensions, and the zero-error capacity is equal to $\log (\sum_{i=1}^{d}m_i)$. We also present a lower bound for the phase retrievability in terms of the minimal length of phase retrievable frames for $C^n$.
Fan Jiang, Tom Drummond, Trevor Cohn
Although existing neural retrieval models reveal promising results when
training data is abundant and the performance keeps improving as training data
increases, collecting high-quality annotated data is prohibitively costly. To
this end, we introduce a novel noisy self-training framework combined with
synthetic queries, showing that neural retrievers can be improved in a
self-evolution manner with no reliance on any external models. Experimental
results show that our method improves consistently over existing methods on
both general-domain (e.g., MS-MARCO) and out-of-domain (i.e., BEIR) retrieval
benchmarks. Extra analysis on low-resource settings reveals that our method is
data efficient and outperforms competitive baselines, with as little as 30% of
labelled training data. Further extending the framework for reranker training
demonstrates that the proposed method is general and yields additional gains on
tasks of diverse domains.\footnote{Source code is available at
\url{https://github.com/Fantabulous-J/Self-Training-DPR}}
Authors' comments: Accepted by EMNLP 2023 Findings
Fan Jiang, Qiongkai Xu, Tom Drummond, Trevor Cohn
Neural 'dense' retrieval models are state of the art for many datasets,
however these models often exhibit limited domain transfer ability. Existing
approaches to adaptation are unwieldy, such as requiring explicit supervision,
complex model architectures, or massive external models. We present
$\texttt{ABEL}$, a simple but effective unsupervised method to enhance passage
retrieval in zero-shot settings. Our technique follows a straightforward loop:
a dense retriever learns from supervision signals provided by a reranker, and
subsequently, the reranker is updated based on feedback from the improved
retriever. By iterating this loop, the two components mutually enhance one
another's performance. Experimental results demonstrate that our unsupervised
$\texttt{ABEL}$ model outperforms both leading supervised and unsupervised
retrievers on the BEIR benchmark. Meanwhile, it exhibits strong adaptation
abilities to tasks and domains that were unseen during training. By either
fine-tuning $\texttt{ABEL}$ on labelled data or integrating it with existing
supervised dense retrievers, we achieve state-of-the-art
results.\footnote{Source code is available at
\url{https://github.com/Fantabulous-J/BootSwitch}.}
Authors' comments: Accepted by EMNLP 2023 Findings
Sibo Dong, Justin Goldstein, Grace Hui Yang
Many early neural Information Retrieval (NeurIR) methods are re-rankers that rely on a traditional first-stage retriever due to expensive query time computations. Recently, representation-based retrievers have gained much attention, which learns query representation and document representation separately, making it possible to pre-compute document representations offline and reduce the workload at query time. Both dense and sparse representation-based retrievers have been explored. However, these methods focus on finding the representation that best represents a text (aka metric learning) and the actual retrieval function that is responsible for similarity matching between query and document is kept at a minimum by using dot product. One drawback is that unlike traditional term-level inverted index, the index formed by these embeddings cannot be easily re-used by another retrieval method. Another drawback is that keeping the interaction at minimum hurts retrieval effectiveness. On the contrary, interaction-based retrievers are known for their better retrieval effectiveness. In this paper, we propose a novel SEgment-based Neural Indexing method, SEINE, which provides a general indexing framework that can flexibly support a variety of interaction-based neural retrieval methods. We emphasize on a careful decomposition of common components in existing neural retrieval methods and propose to use segment-level inverted index to store the atomic query-document interaction values. Experiments on LETOR MQ2007 and MQ2008 datasets show that our indexing method can accelerate multiple neural retrieval methods up to 28-times faster without sacrificing much effectiveness.
N. Khorshid, M. Min, J. M. Désert
The atmospheric compositions of planets offer a unique view into their
respective formation processes. State-of-the-art observatories and techniques
are finally able to provide high-precision data on atmospheric composition that
can be used to constrain planet formation. In this context, we focus on the
formation of WASP-77Ab based on previous observations of its atmosphere, which
have provided precise C/O and metallicity measurements. We use the SimAb planet
formation simulation to model the formation of WASP-77Ab. We assume two
compositions for the disk WASP-77Ab was formed within: one of a solar
composition and one that represents the composition of WASP-77A. In addition,
we considered two different scenarios regarding the migration of the planet and
we study the possible planet formation paths that reproduce the composition of
WASP-77Ab. This work shows that the planet is expected to have formed in a disk
where not many planetesimals could be accreted. Moreover, we demonstrate that
the most likely migration scenario is disk-free migration, whereby the planet
initiates its Type II migration within the CO ice line and ends it beyond the
water ice line.
Authors' comments: 10 pages, 9 figures
Kaizhao Liu, Zihao Wang, Lei Wu
In this paper, we present a fine-grained analysis of the local landscape of
phase retrieval under the regime of limited samples. Specifically, we aim to
ascertain the minimal sample size required to guarantee a benign local
landscape surrounding global minima in high dimensions. Let $n$ and $d$ denote
the sample size and input dimension, respectively. We first explore the local
convexity and establish that when $n=o(d\log d)$, for almost every fixed point
in the local ball, the Hessian matrix has negative eigenvalues, provided $d$ is
sufficiently large. % Consequently, the local landscape is highly non-convex.
We next consider the one-point convexity and show that, as long as
$n=\omega(d)$, with high probability, the landscape is one-point strongly
convex in the local annulus: $\{w\in\mathbb{R}^d: o_d(1)\leqslant
\|w-w^*\|\leqslant c\}$, where $w^*$ is the ground truth and $c$ is an absolute
constant. This implies that gradient descent, initialized from any point in
this domain, can converge to an $o_d(1)$-loss solution exponentially fast.
Furthermore, we show that when $n=o(d\log d)$, there is a radius of
$\widetilde\Theta\left(\sqrt{1/d}\right)$ such that one-point convexity breaks
down in the corresponding smaller local ball. This indicates an impossibility
of establishing a convergence to the exact $w^*$ for gradient descent under
limited samples by relying solely on one-point convexity.
Authors' comments: 47 pages, 5 figures. Accepted by IEEE Transactions on Information
Theory
Timo Kats, Peter van der Putten, Jan Scholtes
In a number of information retrieval applications (e.g., patent search, literature review, due diligence, etc.), preventing false negatives is more important than preventing false positives. However, approaches designed to reduce review effort (like "technology assisted review") can create false negatives, since they are often based on active learning systems that exclude documents automatically based on user feedback. Therefore, this research proposes a more recall-oriented approach to reducing review effort. More specifically, through iteratively re-ranking the relevance rankings based on user feedback, which is also referred to as relevance feedback. In our proposed method, the relevance rankings are produced by a BERT-based dense-vector search and the relevance feedback is based on cumulatively summing the queried and selected embeddings. Our results show that this method can reduce review effort between 17.85% and 59.04%, compared to a baseline approach (of no feedback), given a fixed recall target
Tingyou Li, Zixin Xu, Yong S. Chu, Xiaojing Huang, Jizhou Li
Fourier phase retrieval is essential for high-definition imaging of nanoscale structures across diverse fields, notably coherent diffraction imaging. This study presents the Single impliCit neurAl Network (SCAN), a tool built upon coordinate neural networks meticulously designed for enhanced phase retrieval performance. Remedying the drawbacks of conventional iterative methods which are easiliy trapped into local minimum solutions and sensitive to noise, SCAN adeptly connects object coordinates to their amplitude and phase within a unified network in an unsupervised manner. While many existing methods primarily use Fourier magnitude in their loss function, our approach incorporates both the predicted magnitude and phase, enhancing retrieval accuracy. Comprehensive tests validate SCAN's superiority over traditional and other deep learning models regarding accuracy and noise robustness. We also demonstrate that SCAN excels in the ptychography setting.
Daichi Horita, Naoto Inoue, Kotaro Kikuchi, Kota Yamaguchi, Kiyoharu Aizawa
Content-aware graphic layout generation aims to automatically arrange visual
elements along with a given content, such as an e-commerce product image. In
this paper, we argue that the current layout generation approaches suffer from
the limited training data for the high-dimensional layout structure. We show
that a simple retrieval augmentation can significantly improve the generation
quality. Our model, which is named Retrieval-Augmented Layout Transformer
(RALF), retrieves nearest neighbor layout examples based on an input image and
feeds these results into an autoregressive generator. Our model can apply
retrieval augmentation to various controllable generation tasks and yield
high-quality layouts within a unified architecture. Our extensive experiments
show that RALF successfully generates content-aware layouts in both constrained
and unconstrained settings and significantly outperforms the baselines.
Authors' comments: Accepted to CVPR 2024, Project website:
https://udonda.github.io/RALF/
Sai Munikoti, Anurag Acharya, Sridevi Wagle, Sameera Horawalavithana
Large language models record impressive performance on many natural language processing tasks. However, their knowledge capacity is limited to the pretraining corpus. Retrieval augmentation offers an effective solution by retrieving context from external knowledge sources to complement the language model. However, existing retrieval augmentation techniques ignore the structural relationships between these documents. Furthermore, retrieval models are not explored much in scientific tasks, especially in regard to the faithfulness of retrieved documents. In this paper, we propose a novel structure-aware retrieval augmented language model that accommodates document structure during retrieval augmentation. We create a heterogeneous document graph capturing multiple types of relationships (e.g., citation, co-authorship, etc.) that connect documents from more than 15 scientific disciplines (e.g., Physics, Medicine, Chemistry, etc.). We train a graph neural network on the curated document graph to act as a structural encoder for the corresponding passages retrieved during the model pretraining. Particularly, along with text embeddings of the retrieved passages, we obtain structural embeddings of the documents (passages) and fuse them together before feeding them to the language model. We evaluate our model extensively on various scientific benchmarks that include science question-answering and scientific document classification tasks. Experimental results demonstrate that structure-aware retrieval improves retrieving more coherent, faithful and contextually relevant passages, while showing a comparable performance in the overall accuracy.
Samira Ghodratnama, Mehrdad Zakershahrak
The advent of Large Language Models (LLMs) heralds a pivotal shift in online user interactions with information. Traditional Information Retrieval (IR) systems primarily relied on query-document matching, whereas LLMs excel in comprehending and generating human-like text, thereby enriching the IR experience significantly. While LLMs are often associated with chatbot functionalities, this paper extends the discussion to their explicit application in information retrieval. We explore methodologies to optimize the retrieval process, select optimal models, and effectively scale and orchestrate LLMs, aiming for cost-efficiency and enhanced result accuracy. A notable challenge, model hallucination-where the model yields inaccurate or misinterpreted data-is addressed alongside other model-specific hurdles. Our discourse extends to crucial considerations including user privacy, data optimization, and the necessity for system clarity and interpretability. Through a comprehensive examination, we unveil not only innovative strategies for integrating Language Models (LLMs) with Information Retrieval (IR) systems, but also the consequential considerations that underline the need for a balanced approach aligned with user-centric principles.
Tyler Maunu, Martin Molina-Fructuoso
We study accelerated optimization methods in the Gaussian phase retrieval problem. In this setting, we prove that gradient methods with Polyak or Nesterov momentum have similar implicit regularization to gradient descent. This implicit regularization ensures that the algorithms remain in a nice region, where the cost function is strongly convex and smooth despite being nonconvex in general. This ensures that these accelerated methods achieve faster rates of convergence than gradient descent. Experimental evidence demonstrates that the accelerated methods converge faster than gradient descent in practice.