Gabriel Laverghetta
Robotic agents often perform tasks that transform sets of input objects into
output objects through functional motions. This work describes the FOON
knowledge representation model for robotic tasks. We define the structure and
key components of FOON and describe the process we followed to create our
universal FOON dataset. The paper describes various search algorithms and
heuristic functions we used to search for objects within the FOON. We performed
multiple searches on our universal FOON using these algorithms and discussed
the effectiveness of each algorithm.
Authors' comments: 4 pages, 3 figures, 3 tables
SeungHeon Doh, Minz Won, Keunwoo Choi, Juhan Nam
This paper introduces effective design choices for text-to-music retrieval systems. An ideal text-based retrieval system would support various input queries such as pre-defined tags, unseen tags, and sentence-level descriptions. In reality, most previous works mainly focused on a single query type (tag or sentence) which may not generalize to another input type. Hence, we review recent text-based music retrieval systems using our proposed benchmark in two main aspects: input text representation and training objectives. Our findings enable a universal text-to-music retrieval system that achieves comparable retrieval performances in both tag- and sentence-level inputs. Furthermore, the proposed multimodal representation generalizes to 9 different downstream music classification tasks. We present the code and demo online.
Ziyan Chen, Heng Wu, Jing Cheng
The current ghost imaging phase reconstruction schemes require either complex optical systems, Fourier transform steps, or iterative algorithms, which may increase the difficulty of system design, cause phase retrieval error or take too much time. To address this problem, we propose a five-step phase-shifting method in which no complex optical systems, Fourier transform steps, or iterative algorithms are needed. With five designed incoherent sources, one can obtain five different corresponding ghost imaging patterns, then the phase information of the object can be calculated from those five speckle patterns. The applicability of this theoretical proposal is demonstrated via numerical simulations with two kinds of complicated objects, and the results illustrate the phase information of the complicated object can be reconstructed successfully and quantitatively.
Daniel Reich, Felix Putze, Tanja Schultz
Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions. Systems with strong VG are considered intuitively interpretable and suggest an improved scene understanding. While VQA accuracy performances have seen impressive gains over the past few years, explicit improvements to VG performance and evaluation thereof have often taken a back seat on the road to overall accuracy improvements. A cause of this originates in the predominant choice of learning paradigm for VQA systems, which consists of training a discriminative classifier over a predetermined set of answer options. In this work, we break with the dominant VQA modeling paradigm of classification and investigate VQA from the standpoint of an information retrieval task. As such, the developed system directly ties VG into its core search procedure. Our system operates over a weighted, directed, acyclic graph, a.k.a. "lattice", which is derived from the scene graph of a given image in conjunction with region-referring expressions extracted from the question. We give a detailed analysis of our approach and discuss its distinctive properties and limitations. Our approach achieves the strongest VG performance among examined systems and exhibits exceptional generalization capabilities in a number of scenarios.
Xunjian Yin, Xinyu Hu, Jin Jiang, Xiaojun Wan
Chinese Spelling Check (CSC) aims to detect and correct error tokens in
Chinese contexts, which has a wide range of applications. However, it is
confronted with the challenges of insufficient annotated data and the issue
that previous methods may actually not fully leverage the existing datasets. In
this paper, we introduce our plug-and-play retrieval method with error-robust
information for Chinese Spelling Check (RERIC), which can be directly applied
to existing CSC models. The datastore for retrieval is built completely based
on the training data, with elaborate designs according to the characteristics
of CSC. Specifically, we employ multimodal representations that fuse phonetic,
morphologic, and contextual information in the calculation of query and key
during retrieval to enhance robustness against potential errors. Furthermore,
in order to better judge the retrieved candidates, the n-gram surrounding the
token to be checked is regarded as the value and utilized for specific
reranking. The experiment results on the SIGHAN benchmarks demonstrate that our
proposed method achieves substantial improvements over existing work.
Authors' comments: 11 pages, 3 figures
Arusarka Bose, Zili Zhou, Guandong Xu
Increasing number of COVID-19 research literatures cause new challenges in effective literature screening and COVID-19 domain knowledge aware Information Retrieval. To tackle the challenges, we demonstrate two tasks along withsolutions, COVID-19 literature retrieval, and question answering. COVID-19 literature retrieval task screens matching COVID-19 literature documents for textual user query, and COVID-19 question answering task predicts proper text fragments from text corpus as the answer of specific COVID-19 related questions. Based on transformer neural network, we provided solutions to implement the tasks on CORD-19 dataset, we display some examples to show the effectiveness of our proposed solutions.
Tal Peer, Simon Welker, Timo Gerkmann
Diffusion probabilistic models have been recently used in a variety of tasks,
including speech enhancement and synthesis. As a generative approach, diffusion
models have been shown to be especially suitable for imputation problems, where
missing data is generated based on existing data. Phase retrieval is inherently
an imputation problem, where phase information has to be generated based on the
given magnitude. In this work we build upon previous work in the speech domain,
adapting a speech enhancement diffusion model specifically for STFT phase
retrieval. Evaluation using speech quality and intelligibility metrics shows
the diffusion approach is well-suited to the phase retrieval task, with
performance surpassing both classical and modern methods.
Authors' comments: Accepted by ICASSP 2023
Naseem Shaik
Robots can complete all human-performed tasks, but due to their current lack of knowledge, some tasks still cannot be completed by them with a high degree of success. However, with the right knowledge, these tasks can be completed by robots with a high degree of success, reducing the amount of human effort required to complete daily tasks. In this paper, the FOON, which describes the robot action success rate, is discussed. The functional object-oriented network (FOON) is a knowledge representation for symbolic task planning that takes the shape of a graph. It is to demonstrate the adaptability of FOON in developing a novel and adaptive method of solving a problem utilizing knowledge obtained from various sources, a graph retrieval methodology is shown to produce manipulation motion sequences from the FOON to accomplish a desired aim. The outcomes are illustrated using motion sequences created by the FOON to complete the desired objectives in a simulated environment.
Sandeep Bondalapati
Robotics is used to foster creativity. Humans can perform jobs in their
unique manner, depending on the circumstances. This situation applies to food
cooking. Robotic technology in the kitchen can speed up the process and reduce
its workload. However, the potential of robotics in the kitchen is still
unrealized. In this essay, the idea of FOON, a structural knowledge
representation built on insights from human manipulations, is introduced. To
reduce the failure rate and ensure that the task is effectively completed,
three different algorithms have been implemented where weighted values have
been assigned to the manipulations depending on the success rates of motion.
This knowledge representation was created using videos of open-sourced recipes
Authors' comments: 5 pages, 6 figures
Yujie Qian, Jinhyuk Lee, Sai Meher Karthik Duddu, Zhuyun Dai, Siddhartha Brahma, Iftekhar Naim, Tao Lei, Vincent Y. Zhao
Multi-vector retrieval models improve over single-vector dual encoders on many information retrieval tasks. In this paper, we cast the multi-vector retrieval problem as sparse alignment between query and document tokens. We propose AligneR, a novel multi-vector retrieval model that learns sparsified pairwise alignments between query and document tokens (e.g. `dog' vs. `puppy') and per-token unary saliences reflecting their relative importance for retrieval. We show that controlling the sparsity of pairwise token alignments often brings significant performance gains. While most factoid questions focusing on a specific part of a document require a smaller number of alignments, others requiring a broader understanding of a document favor a larger number of alignments. Unary saliences, on the other hand, decide whether a token ever needs to be aligned with others for retrieval (e.g. `kind' from `kind of currency is used in new zealand}'). With sparsified unary saliences, we are able to prune a large number of query and document token vectors and improve the efficiency of multi-vector retrieval. We learn the sparse unary saliences with entropy-regularized linear programming, which outperforms other methods to achieve sparsity. In a zero-shot setting, AligneR scores 51.1 points nDCG@10, achieving a new retriever-only state-of-the-art on 13 tasks in the BEIR benchmark. In addition, adapting pairwise alignments with a few examples (<= 8) further improves the performance up to 15.7 points nDCG@10 for argument retrieval tasks. The unary saliences of AligneR helps us to keep only 20% of the document token representations with minimal performance loss. We further show that our model often produces interpretable alignments and significantly improves its performance when initialized from larger language models.
Shirin Shoushtari, Jiaming Liu, Ulugbek S. Kamilov
Phase retrieval refers to the problem of recovering an image from the magnitudes of its complex-valued linear measurements. Since the problem is ill-posed, the recovery requires prior knowledge on the unknown image. We present DOLPH as a new deep model-based architecture for phase retrieval that integrates an image prior specified using a diffusion model with a nonconvex data-fidelity term for phase retrieval. Diffusion models are a recent class of deep generative models that are relatively easy to train due to their implementation as image denoisers. DOLPH reconstructs high-quality solutions by alternating data-consistency updates with the sampling step of a diffusion model. Our numerical results show the robustness of DOLPH to noise and its ability to generate several candidate solutions given a set of measurements.
Frank Portman, Stephen Ragain, Ahmed El-Kishky
Providing personalized recommendations in an environment where items exhibit
ephemerality and temporal relevancy (e.g. in social media) presents a few
unique challenges: (1) inductively understanding ephemeral appeal for items in
a setting where new items are created frequently, (2) adapting to trends within
engagement patterns where items may undergo temporal shifts in relevance, (3)
accurately modeling user preferences over this item space where users may
express multiple interests. In this work we introduce MiCRO, a generative
statistical framework that models multi-interest user preferences and temporal
multi-interest item representations. Our framework is specifically formulated
to adapt to both new items and temporal patterns of engagement. MiCRO
demonstrates strong empirical performance on candidate retrieval experiments
performed on two large scale user-item datasets: (1) an open-source temporal
dataset of (User, User) follow interactions and (2) a temporal dataset of
(User, Tweet) favorite interactions which we will open-source as an additional
contribution to the community.
Authors' comments: Preprint
Chull Hwan Song, Jooyoung Yoon, Shunghyun Choi, Yannis Avrithis
Vision transformers have achieved remarkable progress in vision tasks such as
image classification and detection. However, in instance-level image retrieval,
transformers have not yet shown good performance compared to convolutional
networks. We propose a number of improvements that make transformers outperform
the state of the art for the first time. (1) We show that a hybrid architecture
is more effective than plain transformers, by a large margin. (2) We introduce
two branches collecting global (classification token) and local (patch tokens)
information, from which we form a global image representation. (3) In each
branch, we collect multi-layer features from the transformer encoder,
corresponding to skip connections across distant layers. (4) We enhance
locality of interactions at the deeper layers of the encoder, which is the
relative weakness of vision transformers. We train our model on all commonly
used training sets and, for the first time, we make fair comparisons separately
per training set. In all cases, we outperform previous models based on global
representation. Public code is available at
https://github.com/dealicious-inc/DToP.
Authors' comments: WACV 2023
Shawn Diaz
In this experiment, three different search algorithms are implemented for the
purpose of extracting a task tree from a large knowledge graph, known as the
Functional Object-Oriented Network (FOON). Using a universal FOON, which
contains knowledge extracted by annotating online cooking videos, and a desired
goal, a task tree can be retrieved. The process of searching the universal FOON
for task tree retrieval is tested using iterative deepening search and greedy
best-first search with two different heuristic functions. The performance of
these three algorithms is analyzed and compared. The results of the experiment
show that iterative deepening performs strongly overall. However, different
heuristics in an informed search proved to be beneficial for certain
situations.
Authors' comments: 3 pages, 1 figure
Jean-Jacques Godeme, Jalal Fadili, Xavier Buet, Myriam Zerrad, Michel Lequime, Claude Amra
In this paper, we consider the problem of phase retrieval, which consists of recovering an $n$-dimensional real vector from the magnitude of its $m$ linear measurements. We propose a mirror descent (or Bregman gradient descent) algorithm based on a wisely chosen Bregman divergence, hence allowing to remove the classical global Lipschitz continuity requirement on the gradient of the non-convex phase retrieval objective to be minimized. We apply the mirror descent for two random measurements: the \iid standard Gaussian and those obtained by multiple structured illuminations through Coded Diffraction Patterns (CDP). For the Gaussian case, we show that when the number of measurements $m$ is large enough, then with high probability, for almost all initializers, the algorithm recovers the original vector up to a global sign change. For both measurements, the mirror descent exhibits a local linear convergence behaviour with a dimension-independent convergence rate. Our theoretical results are finally illustrated with various numerical experiments, including an application to the reconstruction of images in precision optics.
Zhaoqiang Liu, Xinshao Wang, Jiulong Liu
In this paper, we study phase retrieval under model misspecification and
generative priors. In particular, we aim to estimate an $n$-dimensional signal
$\mathbf{x}$ from $m$ i.i.d.~realizations of the single index model $y =
f(\mathbf{a}^T\mathbf{x})$, where $f$ is an unknown and possibly random
nonlinear link function and $\mathbf{a} \in \mathbb{R}^n$ is a standard
Gaussian vector. We make the assumption
$\mathrm{Cov}[y,(\mathbf{a}^T\mathbf{x})^2] \ne 0$, which corresponds to the
misspecified phase retrieval problem. In addition, the underlying signal
$\mathbf{x}$ is assumed to lie in the range of an $L$-Lipschitz continuous
generative model with bounded $k$-dimensional inputs. We propose a two-step
approach, for which the first step plays the role of spectral initialization
and the second step refines the estimated vector produced by the first step
iteratively. We show that both steps enjoy a statistical rate of order
$\sqrt{(k\log L)\cdot (\log m)/m}$ under suitable conditions. Experiments on
image datasets are performed to demonstrate that our approach performs on par
with or even significantly outperforms several competing methods.
Authors' comments: NeurIPS 2022
Prajit Nadkarni, Narendra Varma Dasararaju
Visual search is of great assistance in reseller commerce, especially for
non-tech savvy users with affinity towards regional languages. It allows
resellers to accurately locate the products that they seek, unlike textual
search which recommends products from head brands. Product attributes available
in e-commerce have a great potential for building better visual search systems
as they capture fine grained relations between data points. In this work, we
design a visual search system for reseller commerce using a multi-task learning
approach. We also highlight and address the challenges like image compression,
cropping, scribbling on the image, etc, faced in reseller commerce. Our model
consists of three different tasks: attribute classification, triplet ranking
and variational autoencoder (VAE). Masking technique is used for designing the
attribute classification. Next, we introduce an offline triplet mining
technique which utilizes information from multiple attributes to capture
relative order within the data. This technique displays a better performance
compared to the traditional triplet mining baseline, which uses single
label/attribute information. We also compare and report incremental gain
achieved by our unified multi-task model over each individual task separately.
The effectiveness of our method is demonstrated using the in-house dataset of
product images from the Lifestyle business-unit of Flipkart, India's largest
e-commerce company. To efficiently retrieve the images in production, we use
the Approximate Nearest Neighbor (ANN) index. Finally, we highlight our
production environment constraints and present the design choices and
experiments conducted to select a suitable ANN index.
Authors' comments: 10 pages, 5 figures
D. Freeman, T. Oikhberg, B. Pineau, M. A. Taylor
Let $(\Omega,\Sigma,\mu)$ be a measure space, and $1\leq p\leq \infty$. A
subspace $E\subseteq L_p(\mu)$ is said to do stable phase retrieval (SPR) if
there exists a constant $C\geq 1$ such that for any $f,g\in E$ we have $$
\inf_{|\lambda|=1} \|f-\lambda g\|\leq C\||f|-|g|\|. $$
In this case, if $|f|$ is known, then $f$ is uniquely determined up to an
unavoidable global phase factor $\lambda$; moreover, the phase recovery map is
$C$-Lipschitz. Phase retrieval appears in several applied circumstances,
ranging from crystallography to quantum mechanics.
In this article, we construct various subspaces doing stable phase retrieval,
and make connections with $\Lambda(p)$-set theory. Moreover, we set the
foundations for an analysis of stable phase retrieval in general function
spaces. This, in particular, allows us to show that H\"older stable phase
retrieval implies stable phase retrieval, improving the stability bounds in a
recent article of M. Christ and the third and fourth authors. We also
characterize those compact Hausdorff spaces $K$ such that $C(K)$ contains an
infinite dimensional SPR subspace.
Authors' comments: 58 pages
Wedad Alharbi, Salah Alshabhi, Daniel Freeman, Dorsa Ghoreishi
A frame $(x_j)_{j\in J}$ for a Hilbert space $H$ is said to do phase
retrieval if for all distinct vectors $x,y\in H$ the magnitude of the frame
coefficients $(|\langle x, x_j\rangle|)_{j\in J}$ and $(|\langle y,
x_j\rangle|)_{j\in J}$ distinguish $x$ from $y$ (up to a unimodular scalar). We
consider the weaker condition where the magnitude of the frame coefficients
distinguishes $x$ from every vector $y$ in a small neighborhood of $x$ (up to a
unimodular scalar). We prove that some of the important theorems for phase
retrieval hold for this local condition, where as some theorems are completely
different. We prove as well that when considering stability of phase retrieval,
the worst stability inequality is always witnessed at orthogonal vectors. This
allows for much simpler calculations when considering optimization problems for
phase retrieval.
Authors' comments: Added some additional comments and references
Soumya Basu, Ankit Singh Rawat, Manzil Zaheer
Many modern high-performing machine learning models such as GPT-3 primarily rely on scaling up models, e.g., transformer networks. Simultaneously, a parallel line of work aims to improve the model performance by augmenting an input instance with other (labeled) instances during inference. Examples of such augmentations include task-specific prompts and similar examples retrieved from the training data by a nonparametric component. Remarkably, retrieval-based methods have enjoyed success on a wide range of problems, ranging from standard natural language processing and vision tasks to protein folding, as demonstrated by many recent efforts, including WebGPT and AlphaFold. Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. In this paper, we present a formal treatment of retrieval-based models to characterize their generalization ability. In particular, we focus on two classes of retrieval-based classification approaches: First, we analyze a local learning framework that employs an explicit local empirical risk minimization based on retrieved examples for each input instance. Interestingly, we show that breaking down the underlying learning task into local sub-tasks enables the model to employ a low complexity parametric component to ensure good overall accuracy. The second class of retrieval-based approaches we explore learns a global model using kernel methods to directly map an input instance and retrieved examples to a prediction, without explicitly solving a local learning task.