Urmila Shrawankar, Anjali Mahajan
Along with the rapid development of information technology, the amount of
information generated at a given time far exceeds human's ability to organize,
search, and manipulate without the help of automatic systems. Now a days so
many tools and techniques are available for storage and retrieval of
information. User uses interface to interact with these techniques, mostly text
user interface (TUI) or graphical user interface (GUI). Here, I am trying to
introduce a new interface i.e. speech for information retrieval. The goal of
this project is to develop a speech interface that can search and read the
required information from the database effectively, efficiently and more
friendly. This tool will be highly useful to blind people, they will able to
demand the information to the computer by giving voice command/s (keyword)
through microphone and listen the required information using speaker or
headphones.
Authors' comments: Pages: 06 Figures: 01; Conference Proceedings International
Conference on Digital Libraires (ICDL 06), 2006, TERI, India
D. García Yárnoz, J. P. Sánchez, C. R. McInnes
Asteroids and comets are of strategic importance for science in an effort to understand the formation, evolution and composition of the Solar System. Near-Earth Objects (NEOs) are of particular interest because of their accessibility from Earth, but also because of their speculated wealth of material resources. The exploitation of these resources has long been discussed as a means to lower the cost of future space endeavours. In this paper, we consider the currently known NEO population and define a family of so-called Easily Retrievable Objects (EROs), objects that can be transported from accessible heliocentric orbits into the Earth's neighbourhood at affordable costs. The asteroid retrieval transfers are sought from the continuum of low energy transfers enabled by the dynamics of invariant manifolds; specifically, the retrieval transfers target planar, vertical Lyapunov and halo orbit families associated with the collinear equilibrium points of the Sun-Earth Circular Restricted Three Body problem. The judicious use of these dynamical features provides the best opportunity to find extremely low energy Earth transfers for asteroid material. A catalogue of asteroid retrieval candidates is then presented. Despite the highly incomplete census of very small asteroids, the ERO catalogue can already be populated with 12 different objects retrievable with less than 500 m/s of {\Delta}v. Moreover, the approach proposed represents a robust search and ranking methodology for future retrieval candidates that can be automatically applied to the growing survey of NEOs.
Diederik Aerts, Jan Broekaert, Sandro Sozzo, Tomas Veloz
In recent years, quantum-based methods have promisingly integrated the
traditional procedures in information retrieval (IR) and natural language
processing (NLP). Inspired by our research on the identification and
application of quantum structures in cognition, more specifically our work on
the representation of concepts and their combinations, we put forward a
'quantum meaning based' framework for structured query retrieval in text
corpora and standardized testing corpora. This scheme for IR rests on
considering as basic notions, (i) 'entities of meaning', e.g., concepts and
their combinations and (ii) traces of such entities of meaning, which is how
documents are considered in this approach. The meaning content of these
'entities of meaning' is reconstructed by solving an 'inverse problem' in the
quantum formalism, consisting of reconstructing the full states of the entities
of meaning from their collapsed states identified as traces in relevant
documents. The advantages with respect to traditional approaches, such as
Latent Semantic Analysis (LSA), are discussed by means of concrete examples.
Authors' comments: 11 pages
Marc Sloan, Jun Wang
Many Information Retrieval (IR) models make use of offline statistical
techniques to score documents for ranking over a single period, rather than use
an online, dynamic system that is responsive to users over time. In this paper,
we explicitly formulate a general Multi Period Information Retrieval problem,
where we consider retrieval as a stochastic yet controllable process. The
ranking action during the process continuously controls the retrieval system's
dynamics, and an optimal ranking policy is found in order to maximise the
overall users' satisfaction over the multiple periods as much as possible. Our
derivations show interesting properties about how the posterior probability of
the documents relevancy evolves from users feedbacks through clicks, and
provides a plug-in framework for incorporating different click models. Based on
the Multi-Armed Bandit theory, we propose a simple implementation of our
framework using a dynamic ranking rule that takes rank bias and exploration of
documents into account. We use TREC data to learn a suitable exploration
parameter for our model, and then analyse its performance and a number of
variants using a search log data set; the experiments suggest an ability to
explore document relevance dynamically over time using user feedback in a way
that can handle rank bias.
Authors' comments: 8 pages, 3 tables, published at the Workshop on Web Search Click Data
2013
Kishore Jaganathan, Samet Oymak, Babak Hassibi
We consider the problem of recovering signals from their power spectral
density. This is a classical problem referred to in literature as the phase
retrieval problem, and is of paramount importance in many fields of applied
sciences. In general, additional prior information about the signal is required
to guarantee unique recovery as the mapping from signals to power spectral
density is not one-to-one. In this paper, we assume that the underlying signals
are sparse. Recently, semidefinite programming (SDP) based approaches were
explored by various researchers. Simulations of these algorithms strongly
suggest that signals upto $o(\sqrt{n})$ sparsity can be recovered by this
technique. In this work, we develop a tractable algorithm based on reweighted
$l_1$-minimization that recovers a sparse signal from its power spectral
density for significantly higher sparsities, which is unprecedented.
We discuss the square-root bottleneck of the existing convex algorithms and
show that a $k$-sparse signal can be efficiently recovered using $O(k^2logn)$
phaseless Fourier measurements. We also show that a $k$-sparse signal can be
recovered using only $O(k log n)$ phaseless measurements if we are allowed to
design the measurement matrices.
Authors' comments: 6 pages, 2 figures
Fatiha Boubekeur, Wassila Azzoug
Traditional information retrieval systems rely on keywords to index documents
and queries. In such systems, documents are retrieved based on the number of
shared keywords with the query. This lexical-focused retrieval leads to
inaccurate and incomplete results when different keywords are used to describe
the documents and queries. Semantic-focused retrieval approaches attempt to
overcome this problem by relying on concepts rather than on keywords to
indexing and retrieval. The goal is to retrieve documents that are semantically
relevant to a given user query. This paper addresses this issue by proposing a
solution at the indexing level. More precisely, we propose a novel approach for
semantic indexing based on concepts identified from a linguistic resource. In
particular, our approach relies on the joint use of WordNet and WordNetDomains
lexical databases for concept identification. Furthermore, we propose a
semantic-based concept weighting scheme that relies on a novel definition of
concept centrality. The resulting system is evaluated on the TIME test
collection. Experimental results show the effectiveness of our proposition over
traditional IR approaches.
Authors' comments: 18 pages, 5 tables, 3 figures
Bernhard G. Bodmann, Nathaniel Hammen
We investigate the recovery of vectors from magnitudes of frame coefficients
when the frames have a low redundancy, meaning a small number of frame vectors
compared to the dimension of the Hilbert space. We first show that for vectors
in d dimensions, 4d-4 suitably chosen frame vectors are sufficient to uniquely
determine each signal, up to an overall unimodular constant, from the
magnitudes of its frame coefficients. Then we discuss the effect of noise and
show that 8d-4 frame vectors provide a stable recovery if part of the frame
coefficients is bounded away from zero. In this regime, perturbing the
magnitudes of the frame coefficients by noise that is sufficiently small
results in a recovery error that is at most proportional to the noise level.
Authors' comments: 12 pages AMSLaTeX, 1 figure
Yoav Shechtman, Amir Beck, Yonina C. Eldar
We consider the problem of phase retrieval, namely, recovery of a signal from
the magnitude of its Fourier transform, or of any other linear transform. Due
to the loss of the Fourier phase information, this problem is ill-posed.
Therefore, prior information on the signal is needed in order to enable its
recovery. In this work we consider the case in which the signal is known to be
sparse, i.e., it consists of a small number of nonzero elements in an
appropriate basis. We propose a fast local search method for recovering a
sparse signal from measurements of its Fourier transform (or other linear
transform) magnitude which we refer to as GESPAR: GrEedy Sparse PhAse
Retrieval. Our algorithm does not require matrix lifting, unlike previous
approaches, and therefore is potentially suitable for large scale problems such
as images. Simulation results indicate that GESPAR is fast and more accurate
than existing techniques in a variety of settings.
Authors' comments: Generalized to non-Fourier measurements, added 2D simulations, and a
theorem for convergence to stationary point
Andrey Nikolaev
Spinlocks are widely used in database engines for processes synchronization.
KGX mutexes is new retrial spinlocks appeared in contemporary Oracle versions
for submicrosecond synchronization. The mutex contention is frequently observed
in highly concurrent OLTP environments.
This work explores how Oracle mutexes operate, spin, and sleep. It develops
predictive mathematical model and discusses parameters and statistics related
to mutex performance tuning, as well as results of contention experiments.
Authors' comments: Proceedings of International Conference on Informatics MEDIAS2012.
Cyprus, Limassol, May 7--14, 2012. ISBN 978-5-88835-023-2. 12 pages, 15
figures
Roberto Konow, Gonzalo Navarro
An optimal index solving top-k document retrieval [Navarro and Nekrich,
SODA12] takes O(m + k) time for a pattern of length m, but its space is at
least 80n bytes for a collection of n symbols. We reduce it to 1.5n to 3n
bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is
up to 25 times faster than the best previous compressed solutions, and requires
at most 5% more space in practice (and in some cases as little as one half).
Apart from replacing classical by compressed data structures, our main idea is
to replace suffix tree sampling by frequency thresholding to achieve
compression.
Authors' comments: 10 pages
Yonina C. Eldar, Shahar Mendelson
We consider stability and uniqueness in real phase retrieval problems over general input sets. Specifically, we assume the data consists of noisy quadratic measurements of an unknown input x in R^n that lies in a general set T and study conditions under which x can be stably recovered from the measurements. In the noise-free setting we derive a general expression on the number of measurements needed to ensure that a unique solution can be found in a stable way, that depends on the set T through a natural complexity parameter. This parameter can be computed explicitly for many sets T of interest. For example, for k-sparse inputs we show that O(k\log(n/k)) measurements are needed, and when x can be any vector in R^n, O(n) measurements suffice. In the noisy case, we show that if one can find a value for which the empirical risk is bounded by a given, computable constant (that depends on the set T), then the error with respect to the true input is bounded above by an another, closely related complexity parameter of the set. By choosing an appropriate number N of measurements, this bound can be made arbitrarily small, and it decays at a rate faster than N^{-1/2+\delta} for any \delta>0. In particular, for k-sparse vectors stable recovery is possible from O(k\log(n/k)\log k) noisy measurements, and when x can be any vector in R^n, O(n \log n) noisy measurements suffice. We also show that the complexity parameter for the quadratic problem is the same as the one used for analyzing stability in linear measurements under very general conditions. Thus, no substantial price has to be paid in terms of stability if there is no knowledge of the phase.
Yi Wang
This paper gives a summary of the content-based Image Retrieval and Content-based Audio Retrieval, which are two parts of the Content-based Retrieval. Content-based Retrieval is the retrieval based on the features of the content. Generally, it is a way to extract features of the media data and find other data with the similar features from the database automatically. Content-based Retrieval can not only work on discrete media like texts, but also can be used on continuous media, such as video and audio.
Laxmi Choudhary, Bhawani Shankar Burdak
As the use of web is increasing more day by day, the web users get easily
lost in the web's rich hyper structure. The main aim of the owner of the
website is to give the relevant information according their needs to the users.
We explained the Web mining is used to categorize users and pages by analyzing
user's behavior, the content of pages and then describe Web Structure mining.
This paper includes different Page Ranking algorithms and compares those
algorithms used for Information Retrieval. Different Page Rank based algorithms
like Page Rank (PR), WPR (Weighted Page Rank), HITS (Hyperlink Induced Topic
Selection), Distance Rank and EigenRumor algorithms are discussed and compared.
Simulation Interface has been designed for PageRank algorithm and Weighted
PageRank algorithm but PageRank is the only ranking algorithm on which Google
search engine works.
Authors' comments: Keywords: Page Rank, Web Mining, Web Structured Mining, Web Content
Mining
Udayan Khurana, Amol Deshpande
We address the problem of managing historical data for large evolving information networks like social networks or citation networks, with the goal to enable temporal and evolutionary queries and analysis. We present the design and architecture of a distributed graph database system that stores the entire history of a network and provides support for efficient retrieval of multiple graphs from arbitrary time points in the past, in addition to maintaining the current state for ongoing updates. Our system exposes a general programmatic API to process and analyze the retrieved snapshots. We introduce DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index structure that enables compactly recording the historical information, and that supports efficient retrieval of historical graph snapshots for single-site or parallel processing. Along with the original graph data, DeltaGraph can also maintain and index auxiliary information; this functionality can be used to extend the structure to efficiently execute queries like subgraph pattern matching over historical data. We develop analytical models for both the storage space needed and the snapshot retrieval times to aid in choosing the right parameters for a specific scenario. In addition, we present strategies for materializing portions of the historical graph state in memory to further speed up the retrieval process. Secondly, we present an in-memory graph data structure called GraphPool that can maintain hundreds of historical graph instances in main memory in a non-redundant manner. We present a comprehensive experimental evaluation that illustrates the effectiveness of our proposed techniques at managing historical graph information.
Nieves R. Brisaboa, Ana Cerdeira-Pena, Gonzalo Navarro, Oscar Pedreira
Ranked document retrieval is a fundamental task in search engines. Such
queries are solved with inverted indexes that require additional 45%-80% of the
compressed text space, and take tens to hundreds of microseconds per query. In
this paper we show how ranked document retrieval queries can be solved within
tens of milliseconds using essentially no extra space over an in-memory
compressed representation of the document collection. More precisely, we
enhance wavelet trees on bytecodes (WTBCs), a data structure that rearranges
the bytes of the compressed collection, so that they support ranked conjunctive
and disjunctive queries, using just 6%-18% of the compressed text space.
Authors' comments: This is an extended version of the paper that will appear in Proc. of
SPIRE'2012
Rahul Shah, Cheng Sheng, Sharma V. Thankachan, Jeffrey Scott Vitter
Let ${\cal{D}}$ = $\{d_1, d_2, d_3, ..., d_D\}$ be a given set of $D$
(string) documents of total length $n$. The top-$k$ document retrieval problem
is to index $\cal{D}$ such that when a pattern $P$ of length $p$, and a
parameter $k$ come as a query, the index returns the $k$ most relevant
documents to the pattern $P$. Hon et. al. \cite{HSV09} gave the first linear
space framework to solve this problem in $O(p + k\log k)$ time. This was
improved by Navarro and Nekrich \cite{NN12} to $O(p + k)$. These results are
powerful enough to support arbitrary relevance functions like frequency,
proximity, PageRank, etc. In many applications like desktop or email search,
the data resides on disk and hence disk-bound indexes are needed. Despite of
continued progress on this problem in terms of theoretical, practical and
compression aspects, any non-trivial bounds in external memory model have so
far been elusive. Internal memory (or RAM) solution to this problem decomposes
the problem into $O(p)$ subproblems and thus incurs the additive factor of
$O(p)$. In external memory, these approaches will lead to $O(p)$ I/Os instead
of optimal $O(p/B)$ I/O term where $B$ is the block-size. We re-interpret the
problem independent of $p$, as interval stabbing with priority over tree-shaped
structure. This leads us to a linear space index in external memory supporting
top-$k$ queries (with unsorted outputs) in near optimal $O(p/B + \log_B n +
\log^{(h)} n + k/B)$ I/Os for any constant $h${$\log^{(1)}n =\log n$ and
$\log^{(h)} n = \log (\log^{(h-1)} n)$}. Then we get $O(n\log^*n)$ space index
with optimal $O(p/B+\log_B n + k/B)$ I/Os.
Authors' comments: 3 figures
Albert Fannjiang, Wenjing Liao
This paper presents a detailed, numerical study on the performance of the standard phasing algorithms with random phase illumination (RPI). Phasing with high resolution RPI and the oversampling ratio $\sigma=4$ determines a unique phasing solution up to a global phase factor. Under this condition, the standard phasing algorithms converge rapidly to the true solution without stagnation. Excellent approximation is achieved after a small number of iterations, not just with high resolution but also low resolution RPI in the presence of additive as well multiplicative noises. It is shown that RPI with $\sigma=2$ is sufficient for phasing complex-valued images under a sector condition and $\sigma=1$ for phasing nonnegative images. The Error Reduction algorithm with RPI is proved to converge to the true solution under proper conditions.
Hassania Ouchetto, Ouail Ouchetto, Ounsa Roudies
The semantic e-government is a new application field accompanying the development of semantic web where the ontologies have become a fertile field of investigation. This is due firstly to both the complexity and the size of e-government systems and secondly to the importance of the issues. However, permitting easy and personalized access to e-government services has become, at this juncture, an arduous and not spontaneous process. Indeed, the provided e-gov services to the user represent a critical contact point between administrations and users. The encountered problems in the e-gov services retrieving process are: the absence of an integrated one-stop government, the difficulty of localizing the services' sources, the lack of mastery of search terms and the deficiency of multilingualism of the online services. In order to solve these problems, to facilitate access to e-gov services and to satisfy the needs of potential users, we propose an original approach to this issue. This approach incorporates a semantic layer as a crucial element in the retrieving process. It consists in implementing a personalized search system that integrates ontology of the e-gov domain in this process.
Roman Zapatrin
Document ranking based on probabilistic evaluations of relevance is known to
exhibit non-classical correlations, which may be explained by admitting a
complex structure of the event space, namely, by assuming the events to emerge
from multiple sample spaces. The structure of event space formed by overlapping
sample spaces is known in quantum mechanics, they may exhibit some
counter-intuitive features, called quantum contextuality. In this Note I
observe that from the structural point of view quantum contextuality looks
similar to personalization of information retrieval scenarios. Along these
lines, Knowledge Revision is treated as operationalistic measurement and a way
to quantify the rate of personalization of Information Retrieval scenarios is
suggested.
Authors' comments: 11 pages
Weimao Ke
We proposed a Least Information theory (LIT) to quantify meaning of
information in probability distribution changes, from which a new information
retrieval model was developed. We observed several important characteristics of
the proposed theory and derived two quantities in the IR context for document
representation. Given probability distributions in a collection as prior
knowledge, LI Binary (LIB) quantifies least information due to the binary
occurrence of a term in a document whereas LI Frequency (LIF) measures least
information based on the probability of drawing a term from a bag of words.
Three fusion methods were also developed to combine LIB and LIF quantities for
term weighting and document ranking. Experiments on four benchmark TREC
collections for ad hoc retrieval showed that LIT-based methods demonstrated
very strong performances compared to classic TF*IDF and BM25, especially for
verbose queries and hard search topics. The least information theory offers a
new approach to measuring semantic quantities of information and provides
valuable insight into the development of new IR models.
Authors' comments: 10 pages, 3 figures