Sandeep Bondalapati
Robotics is used to foster creativity. Humans can perform jobs in their
unique manner, depending on the circumstances. This situation applies to food
cooking. Robotic technology in the kitchen can speed up the process and reduce
its workload. However, the potential of robotics in the kitchen is still
unrealized. In this essay, the idea of FOON, a structural knowledge
representation built on insights from human manipulations, is introduced. To
reduce the failure rate and ensure that the task is effectively completed,
three different algorithms have been implemented where weighted values have
been assigned to the manipulations depending on the success rates of motion.
This knowledge representation was created using videos of open-sourced recipes
Authors' comments: 5 pages, 6 figures
Yujie Qian, Jinhyuk Lee, Sai Meher Karthik Duddu, Zhuyun Dai, Siddhartha Brahma, Iftekhar Naim, Tao Lei, Vincent Y. Zhao
Multi-vector retrieval models improve over single-vector dual encoders on many information retrieval tasks. In this paper, we cast the multi-vector retrieval problem as sparse alignment between query and document tokens. We propose AligneR, a novel multi-vector retrieval model that learns sparsified pairwise alignments between query and document tokens (e.g. `dog' vs. `puppy') and per-token unary saliences reflecting their relative importance for retrieval. We show that controlling the sparsity of pairwise token alignments often brings significant performance gains. While most factoid questions focusing on a specific part of a document require a smaller number of alignments, others requiring a broader understanding of a document favor a larger number of alignments. Unary saliences, on the other hand, decide whether a token ever needs to be aligned with others for retrieval (e.g. `kind' from `kind of currency is used in new zealand}'). With sparsified unary saliences, we are able to prune a large number of query and document token vectors and improve the efficiency of multi-vector retrieval. We learn the sparse unary saliences with entropy-regularized linear programming, which outperforms other methods to achieve sparsity. In a zero-shot setting, AligneR scores 51.1 points nDCG@10, achieving a new retriever-only state-of-the-art on 13 tasks in the BEIR benchmark. In addition, adapting pairwise alignments with a few examples (<= 8) further improves the performance up to 15.7 points nDCG@10 for argument retrieval tasks. The unary saliences of AligneR helps us to keep only 20% of the document token representations with minimal performance loss. We further show that our model often produces interpretable alignments and significantly improves its performance when initialized from larger language models.
Shirin Shoushtari, Jiaming Liu, Ulugbek S. Kamilov
Phase retrieval refers to the problem of recovering an image from the magnitudes of its complex-valued linear measurements. Since the problem is ill-posed, the recovery requires prior knowledge on the unknown image. We present DOLPH as a new deep model-based architecture for phase retrieval that integrates an image prior specified using a diffusion model with a nonconvex data-fidelity term for phase retrieval. Diffusion models are a recent class of deep generative models that are relatively easy to train due to their implementation as image denoisers. DOLPH reconstructs high-quality solutions by alternating data-consistency updates with the sampling step of a diffusion model. Our numerical results show the robustness of DOLPH to noise and its ability to generate several candidate solutions given a set of measurements.
Frank Portman, Stephen Ragain, Ahmed El-Kishky
Providing personalized recommendations in an environment where items exhibit
ephemerality and temporal relevancy (e.g. in social media) presents a few
unique challenges: (1) inductively understanding ephemeral appeal for items in
a setting where new items are created frequently, (2) adapting to trends within
engagement patterns where items may undergo temporal shifts in relevance, (3)
accurately modeling user preferences over this item space where users may
express multiple interests. In this work we introduce MiCRO, a generative
statistical framework that models multi-interest user preferences and temporal
multi-interest item representations. Our framework is specifically formulated
to adapt to both new items and temporal patterns of engagement. MiCRO
demonstrates strong empirical performance on candidate retrieval experiments
performed on two large scale user-item datasets: (1) an open-source temporal
dataset of (User, User) follow interactions and (2) a temporal dataset of
(User, Tweet) favorite interactions which we will open-source as an additional
contribution to the community.
Authors' comments: Preprint
Chull Hwan Song, Jooyoung Yoon, Shunghyun Choi, Yannis Avrithis
Vision transformers have achieved remarkable progress in vision tasks such as
image classification and detection. However, in instance-level image retrieval,
transformers have not yet shown good performance compared to convolutional
networks. We propose a number of improvements that make transformers outperform
the state of the art for the first time. (1) We show that a hybrid architecture
is more effective than plain transformers, by a large margin. (2) We introduce
two branches collecting global (classification token) and local (patch tokens)
information, from which we form a global image representation. (3) In each
branch, we collect multi-layer features from the transformer encoder,
corresponding to skip connections across distant layers. (4) We enhance
locality of interactions at the deeper layers of the encoder, which is the
relative weakness of vision transformers. We train our model on all commonly
used training sets and, for the first time, we make fair comparisons separately
per training set. In all cases, we outperform previous models based on global
representation. Public code is available at
https://github.com/dealicious-inc/DToP.
Authors' comments: WACV 2023
Shawn Diaz
In this experiment, three different search algorithms are implemented for the
purpose of extracting a task tree from a large knowledge graph, known as the
Functional Object-Oriented Network (FOON). Using a universal FOON, which
contains knowledge extracted by annotating online cooking videos, and a desired
goal, a task tree can be retrieved. The process of searching the universal FOON
for task tree retrieval is tested using iterative deepening search and greedy
best-first search with two different heuristic functions. The performance of
these three algorithms is analyzed and compared. The results of the experiment
show that iterative deepening performs strongly overall. However, different
heuristics in an informed search proved to be beneficial for certain
situations.
Authors' comments: 3 pages, 1 figure
Jean-Jacques Godeme, Jalal Fadili, Xavier Buet, Myriam Zerrad, Michel Lequime, Claude Amra
In this paper, we consider the problem of phase retrieval, which consists of recovering an $n$-dimensional real vector from the magnitude of its $m$ linear measurements. We propose a mirror descent (or Bregman gradient descent) algorithm based on a wisely chosen Bregman divergence, hence allowing to remove the classical global Lipschitz continuity requirement on the gradient of the non-convex phase retrieval objective to be minimized. We apply the mirror descent for two random measurements: the \iid standard Gaussian and those obtained by multiple structured illuminations through Coded Diffraction Patterns (CDP). For the Gaussian case, we show that when the number of measurements $m$ is large enough, then with high probability, for almost all initializers, the algorithm recovers the original vector up to a global sign change. For both measurements, the mirror descent exhibits a local linear convergence behaviour with a dimension-independent convergence rate. Our theoretical results are finally illustrated with various numerical experiments, including an application to the reconstruction of images in precision optics.
Zhaoqiang Liu, Xinshao Wang, Jiulong Liu
In this paper, we study phase retrieval under model misspecification and
generative priors. In particular, we aim to estimate an $n$-dimensional signal
$\mathbf{x}$ from $m$ i.i.d.~realizations of the single index model $y =
f(\mathbf{a}^T\mathbf{x})$, where $f$ is an unknown and possibly random
nonlinear link function and $\mathbf{a} \in \mathbb{R}^n$ is a standard
Gaussian vector. We make the assumption
$\mathrm{Cov}[y,(\mathbf{a}^T\mathbf{x})^2] \ne 0$, which corresponds to the
misspecified phase retrieval problem. In addition, the underlying signal
$\mathbf{x}$ is assumed to lie in the range of an $L$-Lipschitz continuous
generative model with bounded $k$-dimensional inputs. We propose a two-step
approach, for which the first step plays the role of spectral initialization
and the second step refines the estimated vector produced by the first step
iteratively. We show that both steps enjoy a statistical rate of order
$\sqrt{(k\log L)\cdot (\log m)/m}$ under suitable conditions. Experiments on
image datasets are performed to demonstrate that our approach performs on par
with or even significantly outperforms several competing methods.
Authors' comments: NeurIPS 2022
Prajit Nadkarni, Narendra Varma Dasararaju
Visual search is of great assistance in reseller commerce, especially for
non-tech savvy users with affinity towards regional languages. It allows
resellers to accurately locate the products that they seek, unlike textual
search which recommends products from head brands. Product attributes available
in e-commerce have a great potential for building better visual search systems
as they capture fine grained relations between data points. In this work, we
design a visual search system for reseller commerce using a multi-task learning
approach. We also highlight and address the challenges like image compression,
cropping, scribbling on the image, etc, faced in reseller commerce. Our model
consists of three different tasks: attribute classification, triplet ranking
and variational autoencoder (VAE). Masking technique is used for designing the
attribute classification. Next, we introduce an offline triplet mining
technique which utilizes information from multiple attributes to capture
relative order within the data. This technique displays a better performance
compared to the traditional triplet mining baseline, which uses single
label/attribute information. We also compare and report incremental gain
achieved by our unified multi-task model over each individual task separately.
The effectiveness of our method is demonstrated using the in-house dataset of
product images from the Lifestyle business-unit of Flipkart, India's largest
e-commerce company. To efficiently retrieve the images in production, we use
the Approximate Nearest Neighbor (ANN) index. Finally, we highlight our
production environment constraints and present the design choices and
experiments conducted to select a suitable ANN index.
Authors' comments: 10 pages, 5 figures
D. Freeman, T. Oikhberg, B. Pineau, M. A. Taylor
Let $(\Omega,\Sigma,\mu)$ be a measure space, and $1\leq p\leq \infty$. A
subspace $E\subseteq L_p(\mu)$ is said to do stable phase retrieval (SPR) if
there exists a constant $C\geq 1$ such that for any $f,g\in E$ we have $$
\inf_{|\lambda|=1} \|f-\lambda g\|\leq C\||f|-|g|\|. $$
In this case, if $|f|$ is known, then $f$ is uniquely determined up to an
unavoidable global phase factor $\lambda$; moreover, the phase recovery map is
$C$-Lipschitz. Phase retrieval appears in several applied circumstances,
ranging from crystallography to quantum mechanics.
In this article, we construct various subspaces doing stable phase retrieval,
and make connections with $\Lambda(p)$-set theory. Moreover, we set the
foundations for an analysis of stable phase retrieval in general function
spaces. This, in particular, allows us to show that H\"older stable phase
retrieval implies stable phase retrieval, improving the stability bounds in a
recent article of M. Christ and the third and fourth authors. We also
characterize those compact Hausdorff spaces $K$ such that $C(K)$ contains an
infinite dimensional SPR subspace.
Authors' comments: 58 pages
Wedad Alharbi, Salah Alshabhi, Daniel Freeman, Dorsa Ghoreishi
A frame $(x_j)_{j\in J}$ for a Hilbert space $H$ is said to do phase
retrieval if for all distinct vectors $x,y\in H$ the magnitude of the frame
coefficients $(|\langle x, x_j\rangle|)_{j\in J}$ and $(|\langle y,
x_j\rangle|)_{j\in J}$ distinguish $x$ from $y$ (up to a unimodular scalar). We
consider the weaker condition where the magnitude of the frame coefficients
distinguishes $x$ from every vector $y$ in a small neighborhood of $x$ (up to a
unimodular scalar). We prove that some of the important theorems for phase
retrieval hold for this local condition, where as some theorems are completely
different. We prove as well that when considering stability of phase retrieval,
the worst stability inequality is always witnessed at orthogonal vectors. This
allows for much simpler calculations when considering optimization problems for
phase retrieval.
Authors' comments: Added some additional comments and references
Soumya Basu, Ankit Singh Rawat, Manzil Zaheer
Many modern high-performing machine learning models such as GPT-3 primarily rely on scaling up models, e.g., transformer networks. Simultaneously, a parallel line of work aims to improve the model performance by augmenting an input instance with other (labeled) instances during inference. Examples of such augmentations include task-specific prompts and similar examples retrieved from the training data by a nonparametric component. Remarkably, retrieval-based methods have enjoyed success on a wide range of problems, ranging from standard natural language processing and vision tasks to protein folding, as demonstrated by many recent efforts, including WebGPT and AlphaFold. Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. In this paper, we present a formal treatment of retrieval-based models to characterize their generalization ability. In particular, we focus on two classes of retrieval-based classification approaches: First, we analyze a local learning framework that employs an explicit local empirical risk minimization based on retrieved examples for each input instance. Interestingly, we show that breaking down the underlying learning task into local sub-tasks enables the model to employ a low complexity parametric component to ensure good overall accuracy. The second class of retrieval-based approaches we explore learns a global model using kernel methods to directly map an input instance and retrieved examples to a prediction, without explicitly solving a local learning task.
Soham Deshmukh, Benjamin Elizalde, Huaming Wang
Audio-Text retrieval takes a natural language query to retrieve relevant audio files in a database. Conversely, Text-Audio retrieval takes an audio file as a query to retrieve relevant natural language descriptions. Most of the literature train retrieval systems with one audio captioning dataset, but evaluating the benefit of training with multiple datasets is underexplored. Moreover, retrieval systems have to learn the alignment between elaborated sentences describing audio content of variable length ranging from a few seconds to several minutes. In this work, we propose a new collection of web audio-text pairs and a new framework for retrieval. First, we provide a new collection of about five thousand web audio-text pairs that we refer to as WavText5K. When used to train our retrieval system, WavText5K improved performance more than other audio captioning datasets. Second, our framework learns to connect language and audio content by using a text encoder, two audio encoders, and a contrastive learning objective. Combining both audio encoders helps to process variable length audio. The two contributions beat state of the art performance for AudioCaps and Clotho on Text-Audio retrieval by a relative 2% and 16%, and Audio-Text retrieval by 6% and 23%.
Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu et al.
Knowledge distillation is an effective way to transfer knowledge from a
strong teacher to an efficient student model. Ideally, we expect the better the
teacher is, the better the student. However, this expectation does not always
come true. It is common that a better teacher model results in a bad student
via distillation due to the nonnegligible gap between teacher and student. To
bridge the gap, we propose PROD, a PROgressive Distillation method, for dense
retrieval. PROD consists of a teacher progressive distillation and a data
progressive distillation to gradually improve the student. We conduct extensive
experiments on five widely-used benchmarks, MS MARCO Passage, TREC Passage 19,
TREC Document 19, MS MARCO Document and Natural Questions, where PROD achieves
the state-of-the-art within the distillation methods for dense retrieval. The
code and models will be released.
Authors' comments: Accepted by WWW2023
Dawn Lawrie, Eugene Yang, Douglas W. Oard, James Mayfield
Providing access to information across languages has been a goal of
Information Retrieval (IR) for decades. While progress has been made on Cross
Language IR (CLIR) where queries are expressed in one language and documents in
another, the multilingual (MLIR) task to create a single ranked list of
documents across many languages is considerably more challenging. This paper
investigates whether advances in neural document translation and pretrained
multilingual neural language models enable improvements in the state of the art
over earlier MLIR techniques. The results show that although combining neural
document translation with neural ranking yields the best Mean Average Precision
(MAP), 98% of that MAP score can be achieved with an 84% reduction in indexing
time by using a pretrained XLM-R multilingual language model to index documents
in their native language, and that 2% difference in effectiveness is not
statistically significant. Key to achieving these results for MLIR is to
fine-tune XLM-R using mixed-language batches from neural translations of MS
MARCO passages.
Authors' comments: 17 pages, 3 figures, accepted at ECIR 2023
Euna Jung, Jungwon Park, Jaekeol Choi, Sungyoon Kim, Wonjong Rhee
The recent advancement in language representation modeling has broadly
affected the design of dense retrieval models. In particular, many of the
high-performing dense retrieval models evaluate representations of query and
document using BERT, and subsequently apply a cosine-similarity based scoring
to determine the relevance. BERT representations, however, are known to follow
an anisotropic distribution of a narrow cone shape and such an anisotropic
distribution can be undesirable for the cosine-similarity based scoring. In
this work, we first show that BERT-based DR also follows an anisotropic
distribution. To cope with the problem, we introduce unsupervised
post-processing methods of Normalizing Flow and whitening, and develop
token-wise method in addition to the sequence-wise method for applying the
post-processing methods to the representations of dense retrieval models. We
show that the proposed methods can effectively enhance the representations to
be isotropic, then we perform experiments with ColBERT and RepBERT to show that
the performance (NDCG at 10) of document re-ranking can be improved by
5.17\%$\sim$8.09\% for ColBERT and 6.88\%$\sim$22.81\% for RepBERT. To examine
the potential of isotropic representation for improving the robustness of DR
models, we investigate out-of-distribution tasks where the test dataset differs
from the training dataset. The results show that isotropic representation can
achieve a generally improved performance. For instance, when training dataset
is MS-MARCO and test dataset is Robust04, isotropy post-processing can improve
the baseline performance by up to 24.98\%. Furthermore, we show that an
isotropic model trained with an out-of-distribution dataset can even outperform
a baseline model trained with the in-distribution dataset.
Authors' comments: 9 pages, 4 figures
Xin Zhang, Yong Jiang, Xiaobin Wang, Xuming Hu, Yueheng Sun, Pengjun Xie, Meishan Zhang
Successful Machine Learning based Named Entity Recognition models could fail
on texts from some special domains, for instance, Chinese addresses and
e-commerce titles, where requires adequate background knowledge. Such texts are
also difficult for human annotators. In fact, we can obtain some potentially
helpful information from correlated texts, which have some common entities, to
help the text understanding. Then, one can easily reason out the correct answer
by referencing correlated samples. In this paper, we suggest enhancing NER
models with correlated samples. We draw correlated samples by the sparse BM25
retriever from large-scale in-domain unlabeled data. To explicitly simulate the
human reasoning process, we perform a training-free entity type calibrating by
majority voting. To capture correlation features in the training stage, we
suggest to model correlated samples by the transformer-based multi-instance
cross-encoder. Empirical results on datasets of the above two domains show the
efficacy of our methods.
Authors' comments: Accepted by COLING 2022, added dev results of the address data
Nima Sadri
Although representational retrieval models based on Transformers have been able to make major advances in the past few years, and despite the widely accepted conventions and best-practices for testing such models, a $\textit{standardized}$ evaluation framework for testing them has not been developed. In this work, we formalize the best practices and conventions followed by researchers in the literature, paving the path for more standardized evaluations - and therefore more fair comparisons between the models. Our framework (1) embeds the documents and queries; (2) for each query-document pair, computes the relevance score based on the dot product of the document and query embedding; (3) uses the $\texttt{dev}$ set of the MSMARCO dataset to evaluate the models; (4) uses the $\texttt{trec_eval}$ script to calculate MRR@100, which is the primary metric used to evaluate the models. Most importantly, we showcase the use of this framework by experimenting on some of the most well-known dense retrieval models.
Dahlia Shehata, Negar Arabzadeh, Charles L. A. Clarke
Despite the advantages of their low-resource settings, traditional sparse retrievers depend on exact matching approaches between high-dimensional bag-of-words (BoW) representations of both the queries and the collection. As a result, retrieval performance is restricted by semantic discrepancies and vocabulary gaps. On the other hand, transformer-based dense retrievers introduce significant improvements in information retrieval tasks by exploiting low-dimensional contextualized representations of the corpus. While dense retrievers are known for their relative effectiveness, they suffer from lower efficiency and lack of generalization issues, when compared to sparse retrievers. For a lightweight retrieval task, high computational resources and time consumption are major barriers encouraging the renunciation of dense models despite potential gains. In this work, we propose boosting the performance of sparse retrievers by expanding both the queries and the documents with linked entities in two formats for the entity names: 1) explicit and 2) hashed. We employ a zero-shot end-to-end dense entity linking system for entity recognition and disambiguation to augment the corpus. By leveraging the advanced entity linking methods, we believe that the effectiveness gap between sparse and dense retrievers can be narrowed. We conduct our experiments on the MS MARCO passage dataset. Since we are concerned with the early stage retrieval in cascaded ranking architectures of large information retrieval systems, we evaluate our results using recall@1000. Our approach is also capable of retrieving documents for query subsets judged to be particularly difficult in prior work. We further demonstrate that the non-expanded and the expanded runs with both explicit and hashed entities retrieve complementary results. Consequently, we adopt a run fusion approach to maximize the benefits of entity linking.
Albert Fannjiang
This paper develops uniqueness theory for 3D phase retrieval with finite,
discrete measurement data for strong phase objects and weak phase objects,
including:
(i) {\em Unique determination of (phase) projections from diffraction
patterns} -- General measurement schemes with coded and uncoded apertures are
proposed and shown to ensure unique reduction of diffraction patterns to the
phase projection for a strong phase object (respectively, the projection for a
weak phase object) in each direction separately without the knowledge of
relative orientations and locations. (ii) {\em Uniqueness for 3D phase
unwrapping} -- General conditions for unique determination of a 3D strong phase
object from its phase projection data are established, including, but not
limited to, random tilt schemes densely sampled from a spherical triangle of
vertexes in three orthogonal directions and other deterministic tilt schemes.
(iii) {\em Uniqueness for projection tomography} -- Unique determination of an
object of $n^3$ voxels from generic $n$ projections or $n+1$ coded diffraction
patterns is proved.
This approach of reducing 3D phase retrieval to the problem of (phase)
projection tomography has the practical implication of enabling classification
and alignment, when relative orientations are unknown, to be carried out in
terms of (phase) projections, instead of diffraction patterns.
The applications with the measurement schemes such as single-axis tilt,
conical tilt, dual-axis tilt, random conical tilt and general random tilt are
discussed.
Authors' comments: Revision of the previously titled "3D UNWRAPPED PHASE RETRIEVAL WITH
CODED APERTURE IS REDUCIBLE TO PROJECTION TOMOGRAPHY"