Darío Garigliotti, Krisztian Balog
Today, the practice of returning entities from a knowledge base in response
to search queries has become widespread. One of the distinctive characteristics
of entities is that they are typed, i.e., assigned to some hierarchically
organized type system (type taxonomy). The primary objective of this paper is
to gain a better understanding of how entity type information can be utilized
in entity retrieval. We perform this investigation in an idealized "oracle"
setting, assuming that we know the distribution of target types of the relevant
entities for a given query. We perform a thorough analysis of three main
aspects: (i) the choice of type taxonomy, (ii) the representation of
hierarchical type information, and (iii) the combination of type-based and
term-based similarity in the retrieval model. Using a standard entity search
test collection based on DBpedia, we find that type information proves most
useful when using large type taxonomies that provide very specific types. We
provide further insights on the extensional coverage of entities and on the
utility of target types.
Authors' comments: Proceedings of the 3rd ACM International Conference on the Theory of
Information Retrieval (ICTIR '17), 2017
Subhadip Mukherjee, Chandra Sekhar Seelamantula
We consider the problem of signal reconstruction from quadratic measurements that are encoded as +1 or -1 depending on whether they exceed a predetermined positive threshold or not. Binary measurements are fast to acquire and inexpensive in terms of hardware. We formulate the problem of signal reconstruction using a consistency criterion, wherein one seeks to find a signal that is in agreement with the measurements. To enforce consistency, we construct a convex cost using a one-sided quadratic penalty and minimize it using an iterative accelerated projected gradient-descent (APGD) technique. The PGD scheme reduces the cost function in each iteration, whereas incorporating momentum into PGD, notwithstanding the lack of such a descent property, exhibits faster convergence than PGD empirically. We refer to the resulting algorithm as binary phase retrieval (BPR). Considering additive white noise contamination prior to quantization, we also derive the Cramer-Rao Bound (CRB) for the binary encoding model. Experimental results demonstrate that the BPR algorithm yields a signal-to- reconstruction error ratio (SRER) of approximately 25 dB in the absence of noise. In the presence of noise prior to quantization, the SRER is within 2 to 3 dB of the CRB.
Tom Kenter, Alexey Borisov, Christophe Van Gysel, Mostafa Dehghani, Maarten de Rijke, Bhaskar Mitra
Machine learning plays a role in many aspects of modern IR systems, and deep
learning is applied in all of them. The fast pace of modern-day research has
given rise to many different approaches for many different IR problems. The
amount of information available can be overwhelming both for junior students
and for experienced researchers looking for new research topics and directions.
Additionally, it is interesting to see what key insights into IR problems the
new technologies are able to give us. The aim of this full-day tutorial is to
give a clear overview of current tried-and-trusted neural methods in IR and how
they benefit IR research. It covers key architectures, as well as the most
promising future directions.
Authors' comments: Overview of full-day tutorial at SIGIR 2017
Chenglong Bao, George Barbastathis, Hui Ji, Zuowei Shen, Zhengyun Zhang
The mutual intensity and its equivalent phase-space representations quantify
an optical field's state of coherence and are important tools in the study of
light propagation and dynamics, but they can only be estimated indirectly from
measurements through a process called coherence retrieval, otherwise known as
phase-space tomography. As practical considerations often rule out the
availability of a complete set of measurements, coherence retrieval is usually
a challenging high-dimensional ill-posed inverse problem. In this paper, we
propose a trace-regularized optimization model for coherence retrieval and a
provably-convergent adaptive accelerated proximal gradient algorithm for
solving the resulting problem. Applying our model and algorithm to both
simulated and experimental data, we demonstrate an improvement in
reconstruction quality over previous models as well as an increase in
convergence speed compared to existing first-order methods.
Authors' comments: 28 pages, 10 figures, accepted for publication in SIAM Journal on
Imaging Sciences
Veit Elser, Ti-Yen Lan, Tamir Bendory
In recent years, the mathematical and algorithmic aspects of the phase
retrieval problem have received considerable attention. Many papers in this
area mention crystallography as a principal application. In crystallography,
the signal to be recovered is periodic and comprised of atomic distributions
arranged homogeneously in the unit cell of the crystal. The crystallographic
problem is both the leading application and one of the hardest forms of phase
retrieval. We have constructed a graded set of benchmark problems for
evaluating algorithms that perform this type of phase retrieval. The data,
publicly available online, is provided in an easily interpretable format. We
also propose a simple and unambiguous success/failure criterion based on the
actual needs in crystallography. Baseline runtimes were obtained with an
iterative algorithm that is similar but more transparent than those used in
crystallography. Empirically, the runtimes grow exponentially with respect to a
new hardness parameter: the sparsity of the signal autocorrelation. We also
review the algorithms used by the leading software packages. This set of
benchmark problems, we hope, will encourage the development of new algorithms
for the phase retrieval problem in general, and crystallography in particular.
Authors' comments: 27 pages, 10 figures, new references, modified appendix, improved
presentation
Federico Nanni, Bhaskar Mitra, Matt Magnusson, Laura Dietz
Retrieving paragraphs to populate a Wikipedia article is a challenging task. The new TREC Complex Answer Retrieval (TREC CAR) track introduces a comprehensive dataset that targets this retrieval scenario. We present early results from a variety of approaches -- from standard information retrieval methods (e.g., tf-idf) to complex systems that using query expansion using knowledge bases and deep neural networks. The goal is to offer future participants of this track an overview of some promising approaches to tackle this problem.
Bhaskar Mitra, Nick Craswell
Neural ranking models for information retrieval (IR) use shallow or deep neural networks to rank search results in response to a query. Traditional learning to rank models employ machine learning techniques over hand-crafted IR features. By contrast, neural models learn representations of language from raw text that can bridge the gap between query and document vocabulary. Unlike classical IR models, these new machine learning based approaches are data-hungry, requiring large scale training data before they can be deployed. This tutorial introduces basic concepts and intuitions behind neural IR models, and places them in the context of traditional retrieval models. We begin by introducing fundamental concepts of IR and different neural and non-neural approaches to learning vector representations of text. We then review shallow neural IR methods that employ pre-trained neural term embeddings without learning the IR task end-to-end. We introduce deep neural networks next, discussing popular deep architectures. Finally, we review the current DNN models for information retrieval. We conclude with a discussion on potential future directions for neural IR.
Christina Lioma, Birger Larsen, Wei Lu
Typically, every part in most coherent text has some plausible reason for its presence, some function that it performs to the overall semantics of the text. Rhetorical relations, e.g. contrast, cause, explanation, describe how the parts of a text are linked to each other. Knowledge about this socalled discourse structure has been applied successfully to several natural language processing tasks. This work studies the use of rhetorical relations for Information Retrieval (IR): Is there a correlation between certain rhetorical relations and retrieval performance? Can knowledge about a document's rhetorical relations be useful to IR? We present a language model modification that considers rhetorical relations when estimating the relevance of a document to a query. Empirical evaluation of different versions of our model on TREC settings shows that certain rhetorical relations can benefit retrieval effectiveness notably (> 10% in mean average precision over a state-of-the-art baseline).
Walid Shalaby, Wlodek Zadrozny
With the ever increasing number of filed patent applications every year, the need for effective and efficient systems for managing such tremendous amounts of data becomes inevitably important. Patent Retrieval (PR) is considered the pillar of almost all patent analysis tasks. PR is a subfield of Information Retrieval (IR) which is concerned with developing techniques and methods that effectively and efficiently retrieve relevant patent documents in response to a given search request. In this paper we present a comprehensive review on PR methods and approaches. It is clear that, recent successes and maturity in IR applications such as Web search cannot be transferred directly to PR without deliberate domain adaptation and customization. Furthermore, state-of-the-art performance in automatic PR is still around average in terms of recall. These observations motivate the need for interactive search tools which provide cognitive assistance to patent professionals with minimal effort. These tools must also be developed in hand with patent professionals considering their practices and expectations. We additionally touch on related tasks to PR such as patent valuation, litigation, licensing, and highlight potential opportunities and open directions for computational scientists in these domains.
Joani Mitro
This paper functions as a tutorial for individuals interested to enter the
field of information retrieval but wouldn't know where to begin from. It
describes two fundamental yet efficient image retrieval techniques, the first
being k - nearest neighbors (knn) and the second support vector machines(svm).
The goal is to provide the reader with both the theoretical and practical
aspects in order to acquire a better understanding. Along with this tutorial we
have also developed the equivalent software1 using the MATLAB environment in
order to illustrate the techniques, so that the reader can have a hands-on
experience.
Authors' comments: Technical Report
P. Chen, A. Fannjiang, G. Liu
The null vector method, based on a simple linear algebraic concept, is proposed as a solution to the phase retrieval problem. In the case with complex Gaussian random measurement matrices, a non-asymptotic error bound is derived, yielding an asymptotic regime of accurate approximation comparable to that for the spectral vector method.
Antonia Creswell, Anil Anthony Bharath
Generative Adversarial Networks (GAN) are able to learn excellent
representations for unlabelled data which can be applied to image generation
and scene classification. Representations learned by GANs have not yet been
applied to retrieval. In this paper, we show that the representations learned
by GANs can indeed be used for retrieval. We consider heritage documents that
contain unlabelled Merchant Marks, sketch-like symbols that are similar to
hieroglyphs. We introduce a novel GAN architecture with design features that
make it suitable for sketch retrieval. The performance of this sketch-GAN is
compared to a modified version of the original GAN architecture with respect to
simple invariance properties. Experiments suggest that sketch-GANs learn
representations that are suitable for retrieval and which also have increased
stability to rotation, scale and translation compared to the standard GAN
architecture.
Authors' comments: Accepted to ECCV2016 VisArt Workshop
Kalpa Gunaratna
Document retrieval has been an important research problem over many years in the information retrieval community. State-of-the-art techniques utilize various methods in matching documents to a given document including keywords, phrases, and annotations. In this paper, we propose a new approach for document retrieval that utilizes predications (subject-predicate-object triples) extracted from the documents. We represent documents as sets of predications. We measure the similarity between predications to compute the similarity between documents. Our approach utilizes the hierarchical information available in ontologies in computing concept-concept similarity, making the approach flexible. Predication-based document similarity is more precise and forms the basis for a semantically aware document retrieval system. We show that the approach is competitive with an existing state-of-the-art related document retrieval technique in the biomedical domain.
Kinjalk Lochan, Sumanta Chakraborty, T. Padmanabhan
It is generally believed that, when matter collapses to form a black hole,
the complete information about the initial state of the matter cannot be
retrieved by future asymptotic observers, through local measurements. This is
contrary to the expectation from a unitary evolution in quantum theory and
leads to (a version of) the black hole information paradox. Classically,
nothing else, apart from mass, charge and angular momentum is expected to be
revealed to such asymptotic observers after the formation of a black hole.
Semi-classically, black holes evaporate after their formation through the
Hawking radiation. The dominant part of the radiation is expected to be thermal
and hence one cannot know anything about the initial data from the resultant
radiation. However, there can be sources of distortions which make the
radiation non-thermal. Although the distortions are not strong enough to make
the evolution unitary, these distortions carry some part of information
regarding the in-state. In this work, we show how one can decipher the
information about the in-state of the field from these distortions. We show
that the distortions of a particular kind --- which we call {\it non-vacuum
distortions} --- can be used to \emph{fully} reconstruct the initial data. The
asymptotic observer can do this operationally by measuring certain well-defined
observables of the quantum field at late times. We demonstrate that a general
class of in-states encode all their information content in the correlation of
late time out-going modes. Further, using a $1+1$ dimensional CGHS model to
accommodate back-reaction self-consistently, we show that observers can also
infer and track the information content about the initial data, during the
course of evaporation, unambiguously. Implications of such information
extraction are discussed.
Authors' comments: 23 pages; 2 figures
Maura B. Paterson, Douglas R. Stinson, Jalaj Upadhyay
There has been considerable recent interest in "cloud storage" wherein a user
asks a server to store a large file. One issue is whether the user can verify
that the server is actually storing the file, and typically a
challenge-response protocol is employed to convince the user that the file is
indeed being stored correctly. The security of these schemes is phrased in
terms of an extractor which will recover the file given any "proving algorithm"
that has a sufficiently high success probability. This forms the basis of
\emph{proof-of-retrievability} ($\mathsf{PoR}$) systems.
In this paper, we study multiple server $\mathsf{PoR}$ systems. We formalize
security definitions for two possible scenarios: (i) when a threshold of
servers succeed with high enough probability (worst-case) and (ii) when the
average of the success probability of all the servers is above a threshold
(average-case). We also motivate the study of confidentiality of the outsourced
message. We give $\mathsf{M}\mbox{-}\mathsf{PoR}$ schemes which are secure
under both these security definitions and provide reasonable confidentiality
guarantees even when there is no restriction on the computational power of the
servers. We also show how classical statistical techniques used by Paterson,
Stinson and Upadhyay (Journal of Mathematical Cryptology: 7(3)) can be extended
to evaluate whether the responses of the provers are accurate enough to permit
successful extraction. We also look at one specific instantiation of our
construction when instantiated with the unconditionally secure version of the
Shacham-Waters scheme (Asiacrypt, 2008). This scheme gives reasonable security
and privacy guarantee. We show that, in the multi-server setting with
computationally unbounded provers, one can overcome the limitation that the
verifier needs to store as much secret information as the provers.
Authors' comments: 22 pages
Veit Elser
Bit retrieval is the problem of reconstructing a binary sequence from its
periodic autocorrelation, with applications in cryptography and x-ray
crystallography. After defining the problem, with and without noise, we
describe and compare various algorithms for solving it. A geometrical
constraint satisfaction algorithm, relaxed-reflect-reflect, is currently the
best algorithm for noisy bit retrieval.
Authors' comments: 42 pages, 15 figures
Irène Waldspurger
We describe a new algorithm to solve a particular phase retrieval problem,
that has wide applications in audio processing: the reconstruction of a
function from its scalogram, that is from the modulus of its wavelet transform.
It is a multiscale iterative algorithm. To efficiently propagate phase
information from low to high frequencies, it uses an equivalent formulation of
the phase retrieval problem that involves the holomorphic extension of the
wavelet transform. Our numerical results, on audio and non-audio signals, show
that reconstruction is precise and stable to noise. The algorithm has a linear
complexity in the size of the signal, up to logarithmic factors, and can thus
be used on large signals.
Authors' comments: Analysis of the algorithm for Cauchy wavelets + various updates
Emad Elabd, Eissa Alshari, Hatem Abdulkader
Arabic language is one of the most widely spoken languages. This language has
a complex morphological structure and is considered as one of the most prolific
languages in terms of article linguistic. Therefore, Arabic Information
Retrieval (AIR) models need specific techniques to deal with this complex
morphological structure. This paper aims to develop an integrate AIR
frameworks. It lists and analysis the different Information Retrieval (IR)
methods and techniques such as query processing, stemming and indexing which
are used in AIR systems. We conclude that AIR frameworks have a weakness to
deal with semantic in term of indexing, Boolean model, Latent Semantic Analysis
(LSA), Latent Semantic Index (LSI) and semantic ranking. Therefore, semantic
Boolean IR framework is proposed in this paper. This model is implemented and
the precision, recall and run time are measured and compared with the
traditional IR model.
Authors' comments: in The International Arab Journal of Information Technology, Vol. 12,
No. 3, May 2015
Eissa M. Alshari
The continuous increasing in the amount of the published and stored information requires a special Information Retrieval (IR) frameworks to search and get information accurately and speedily. Currently, keywords-based techniques are commonly used in information retrieval. However, a major drawback of the keywords approach is its inability of handling the polysemy and synonymy phenomenon of the natural language. For instance, the meanings of words and understanding of concepts differ in different communities. Same word use for different concepts (polysemy) or use different words for the same concept (synonymy). Most of information retrieval frameworks have a weakness to deal with the semantics of the words in term of (indexing, Boolean model, Latent Semantic Analysis (LSA) , Latent semantic Index (LSI) and semantic ranking, etc.). Traditional Arabic Information Retrieval (AIR) models performance insufficient with semantic queries, which deal with not only the keywords but also with the context of these keywords. Therefore, there is a need for a semantic information retrieval model with a semantic index structure and ranking algorithm based on semantic index.
Sara Botelho-Andrade, Peter G. Casazza, Hanh Van Nguyen, Janet C. Tremain
In 2006, Balan/Casazza/Edidin \cite{BCE} introduced the frame theoretic study of phaseless reconstruction. Since then, this has turned into a very active area of research. Over the years, many people have replaced the term {\it phaseless reconstruction} with {\it phase retrieval}. Casazza then asked: {\it Are these really the same?} In this paper, we will show that phase retrieval is equivalent to phaseless reconstruction. We then show, more generally, that phase retrieval by projections is equivalent to phaseless reconstruction by projections. Finally, we study {\it weak phase retrieval} and discover that it is very different from phaseless reconstruction.