Haohan Zhu, George Kollios, Vassilis Athitsos
This paper proposes a general framework for matching similar subsequences in
both time series and string databases. The matching results are pairs of query
subsequences and database subsequences. The framework finds all possible pairs
of similar subsequences if the distance measure satisfies the "consistency"
property, which is a property introduced in this paper. We show that most
popular distance functions, such as the Euclidean distance, DTW, ERP, the
Frechet distance for time series, and the Hamming distance and Levenshtein
distance for strings, are all "consistent". We also propose a generic index
structure for metric spaces named "reference net". The reference net occupies
O(n) space, where n is the size of the dataset and is optimized to work well
with our framework. The experiments demonstrate the ability of our method to
improve retrieval performance when combined with diverse distance measures. The
experiments also illustrate that the reference net scales well in terms of
space overhead and query time.
Authors' comments: VLDB2012
Swathi Rajasurya, Tamizhamudhu Muralidharan, Sandhiya Devi, S. Swamynathan
Today's conventional search engines hardly do provide the essential content relevant to the user's search query. This is because the context and semantics of the request made by the user is not analyzed to the full extent. So here the need for a semantic web search arises. SWS is upcoming in the area of web search which combines Natural Language Processing and Artificial Intelligence. The objective of the work done here is to design, develop and implement a semantic search engine- SIEU(Semantic Information Extraction in University Domain) confined to the university domain. SIEU uses ontology as a knowledge base for the information retrieval process. It is not just a mere keyword search. It is one layer above what Google or any other search engines retrieve by analyzing just the keywords. Here the query is analyzed both syntactically and semantically. The developed system retrieves the web results more relevant to the user query through keyword expansion. The results obtained here will be accurate enough to satisfy the request made by the user. The level of accuracy will be enhanced since the query is analyzed semantically. The system will be of great use to the developers and researchers who work on web. The Google results are re-ranked and optimized for providing the relevant links. For ranking an algorithm has been applied which fetches more apt results for the user query.
Mohammad Nabil Almunawar
Content-based multimedia information retrieval is an interesting research
area since it allows retrieval based on inherent characteristic of multimedia
objects. For example retrieval based on visual characteristics such as colour,
shapes or textures of objects in images or retrieval based on spatial
relationships among objects in the media (images or video clips). This paper
reviews some work done in image and video retrieval and then proposes an
integrated model that can handle images and video clips uniformly. Using this
model retrieval on images or video clips can be done based on the same
framework.
Authors' comments: 15 pages, conference paper
Mahyuddin K. M. Nasution, Shahrul Azman Noah
Future Information Retrieval, especially in connection with the internet,
will incorporate the content descriptions that are generated with social
network extraction technologies and preferably incorporate the probability
theory for assigning the semantic. Although there is an increasing interest
about social network extraction, but a little of them has a significant impact
to infomation retrieval. Therefore this paper proposes a model of information
retrieval from the social network extraction.
Authors' comments: 5 pages
Shuyu Zhou, Shanchao Zhang, Chang Liu, J. F. Chen, Jianming Wen, M. M. T. Loy, G. K. L. Wong, Shengwang Du
We report an experimental demonstration of optimal storage and retrieval of
heralded single-photon wave packets using electromagnetically induced
transparency (EIT) in cold atoms at a high optical depth. We obtain an optimal
storage efficiency of (49+/-3)% for single-photon waveforms with a temporal
likeness of 96%. Our result brings the EIT quantum light-matter interface close
to practical quantum information applications.
Authors' comments: 5 pages, 4 figures
Sudhir Ahuja, Mr. Rinkaj Goyal
Web space is the huge repository of data. Everyday lots of new information get added to this web space. The more the information, more is demand for tools to access that information. Answering users' queries about the online information intelligently is one of the great challenges in information retrieval in intelligent systems. In this paper, we will start with the brief introduction on information retrieval and intelligent systems and explain how swoogle, the semantic search engine, uses its algorithms and techniques to search for the desired contents in the web. We then continue with the clustering technique that is used to group the similar things together and discuss the machine learning technique called Self-organizing maps [6] or SOM, which is a data visualization technique that reduces the dimensions of data through the use of self-organizing neural networks. We then discuss how SOM is used to visualize the contents of the data, by following some lines of algorithm, in the form of maps. So, we could say that websites or machines can be used to retrieve the information that what exactly users want from them.
Thomas Lüke, Philipp Schaer, Philipp Mayr
Choosing the right terms to describe an information need is becoming more
difficult as the amount of available information increases.
Search-Term-Recommendation (STR) systems can help to overcome these problems.
This paper evaluates the benefits that may be gained from the use of STRs in
Query Expansion (QE). We create 17 STRs, 16 based on specific disciplines and
one giving general recommendations, and compare the retrieval performance of
these STRs. The main findings are: (1) QE with specific STRs leads to
significantly better results than QE with a general STR, (2) QE with specific
STRs selected by a heuristic mechanism of topic classification leads to better
results than the general STR, however (3) selecting the best matching specific
STR in an automatic way is a major challenge of this process.
Authors' comments: 6 pages; to be published in Proceedings of Theory and Practice of
Digital Libraries 2012 (TPDL 2012)
Mohammadreza Keyvanpour, Reza Tavoli
Feature weighting is a technique used to approximate the optimal degree of influence of individual features. This paper presents a feature weighting method for Document Image Retrieval System (DIRS) based on keyword spotting. In this method, we weight the feature using coefficient of multiple correlations. Coefficient of multiple correlations can be used to describe the synthesized effects and correlation of each feature. The aim of this paper is to show that feature weighting increases the performance of DIRS. After applying the feature weighting method to DIRS the average precision is 93.23% and average recall become 98.66% respectively
Youssef Bassil, Paul Semaan
With the advent of the Internet, a new era of digital information exchange
has begun. Currently, the Internet encompasses more than five billion online
sites and this number is exponentially increasing every day. Fundamentally,
Information Retrieval (IR) is the science and practice of storing documents and
retrieving information from within these documents. Mathematically, IR systems
are at the core based on a feature vector model coupled with a term weighting
scheme that weights terms in a document according to their significance with
respect to the context in which they appear. Practically, Vector Space Model
(VSM), Term Frequency (TF), and Inverse Term Frequency (IDF) are among other
long-established techniques employed in mainstream IR systems. However, present
IR models only target generic-type text documents, in that, they do not
consider specific formats of files such as HTML web documents. This paper
proposes a new semantic-sensitive web information retrieval model for HTML
documents. It consists of a vector model called SWVM and a weighting scheme
called BTF-IDF, particularly designed to support the indexing and retrieval of
HTML web documents. The chief advantage of the proposed model is that it
assigns extra weights for terms that appear in certain pre-specified HTML tags
that are correlated to the semantics of the document. Additionally, the model
is semantic-sensitive as it generates synonyms for every term being indexed and
later weights them appropriately to increase the likelihood of retrieving
documents with similar context but different vocabulary terms. Experiments
conducted, revealed a momentous enhancement in the precision of web IR systems
and a radical increase in the number of relevant documents being retrieved. As
further research, the proposed model is to be upgraded so as to support the
indexing and retrieval of web images in multimedia-rich web documents.
Authors' comments: LACSC - Lebanese Association for Computational Sciences,
http://www.lacsc.org/; European Journal of Scientific Research, Vol. 69, No.
4, February 2012
Mónica Marrero, Sonia Sánchez-Cuadrado, Julián Urbano, Jorge Morato, José-Antonio Moreiro
The terminology used in Biomedicine shows lexical peculiarities that have
required the elaboration of terminological resources and information retrieval
systems with specific functionalities. The main characteristics are the high
rates of synonymy and homonymy, due to phenomena such as the proliferation of
polysemic acronyms and their interaction with common language. Information
retrieval systems in the biomedical domain use techniques oriented to the
treatment of these lexical peculiarities. In this paper we review some of the
techniques used in this domain, such as the application of Natural Language
Processing (BioNLP), the incorporation of lexical-semantic resources, and the
application of Named Entity Recognition (BioNER). Finally, we present the
evaluation methods adopted to assess the suitability of these techniques for
retrieving biomedical resources.
Authors' comments: 6 pages, 4 tables
Tranos Zuva, Oludayo O. Olugbara, Sunday O. Ojo, Seleman M. Ngwira
Research is taking place to find effective algorithms for content-based image
representation and description. There is a substantial amount of algorithms
available that use visual features (color, shape, texture). Shape feature has
attracted much attention from researchers that there are many shape
representation and description algorithms in literature. These shape image
representation and description algorithms are usually not application
independent or robust, making them undesirable for generic shape description.
This paper presents an object shape representation using Kernel Density Feature
Points Estimator (KDFPE). In this method, the density of feature points within
defined rings around the centroid of the image is obtained. The KDFPE is then
applied to the vector of the image. KDFPE is invariant to translation, scale
and rotation. This method of image representation shows improved retrieval rate
when compared to Density Histogram Feature Points (DHFP) method. Analytic
analysis is done to justify our method, which was compared with the DHFP to
prove its robustness.
Authors' comments: ISSN 0975-5578 (Online) 0975-5934 (Print)
Tarek El-Shishtawy, Fatma El-Ghannam
In spite of its robust syntax, semantic cohesion, and less ambiguity, lemma
level analysis and generation does not yet focused in Arabic NLP literatures.
In the current research, we propose the first non-statistical accurate Arabic
lemmatizer algorithm that is suitable for information retrieval (IR) systems.
The proposed lemmatizer makes use of different Arabic language knowledge
resources to generate accurate lemma form and its relevant features that
support IR purposes. As a POS tagger, the experimental results show that, the
proposed algorithm achieves a maximum accuracy of 94.8%. For first seen
documents, an accuracy of 89.15% is achieved, compared to 76.7% of up to date
Stanford accurate Arabic model, for the same, dataset.
Authors' comments: 9 pages
Sergey Petrov, Jose F. Fontanari, Leonid I. Perlovsky
The categorization of emotion names, i.e., the grouping of emotion words that have similar emotional connotations together, is a key tool of Social Psychology used to explore people's knowledge about emotions. Without exception, the studies following that research line were based on the gauging of the perceived similarity between emotion names by the participants of the experiments. Here we propose and examine a new approach to study the categories of emotion names - the similarities between target emotion names are obtained by comparing the contexts in which they appear in texts retrieved from the World Wide Web. This comparison does not account for any explicit semantic information; it simply counts the number of common words or lexical items used in the contexts. This procedure allows us to write the entries of the similarity matrix as dot products in a linear vector space of contexts. The properties of this matrix were then explored using Multidimensional Scaling Analysis and Hierarchical Clustering. Our main findings, namely, the underlying dimension of the emotion space and the categories of emotion names, were consistent with those based on people's judgments of emotion names similarities.
Eliyahu Osherovich, Michael Zibulevsky, Irad Yavneh
In this work we develop an algorithm for signal reconstruction from the magnitude of its Fourier transform in a situation where some (non-zero) parts of the sought signal are known. Although our method does not assume that the known part comprises the boundary of the sought signal, this is often the case in microscopy: a specimen is placed inside a known mask, which can be thought of as a known light source that surrounds the unknown signal. Therefore, in the past, several algorithms were suggested that solve the phase retrieval problem assuming known boundary values. Unlike our method, these methods do rely on the fact that the known part is on the boundary. Besides the reconstruction method we give an explanation of the phenomena observed in previous work: the reconstruction is much faster when there is more energy concentrated in the known part. Quite surprisingly, this can be explained using our previous results on phase retrieval with approximately known Fourier phase.
Sarah Tang, Afzal Godil
As the usage of 3D models increases, so does the importance of developing
accurate 3D shape retrieval algorithms. A common approach is to calculate a
shape descriptor for each object, which can then be compared to determine two
objects' similarity. However, these descriptors are often evaluated
independently and on different datasets, making them difficult to compare.
Using the SHREC 2011 Shape Retrieval Contest of Non-rigid 3D Watertight Meshes
dataset, we systematically evaluate a collection of local shape descriptors. We
apply each descriptor to the bag-of-words paradigm and assess the effects of
varying the dictionary's size and the number of sample points. In addition,
several salient point detection methods are used to choose sample points; these
methods are compared to each other and to random selection. Finally,
information from two local descriptors is combined in two ways and changes in
performance are investigated. This paper presents results of these experiment
Authors' comments: IS&T/SPIE Electronic Imaging 2012, Proceedings Vol. 8290
Three-Dimensional Image Processing (3DIP) and Applications II, Atilla M.
Baskurt; Robert Sitnik, Editors, 82900N Dates: Tuesday-Thursday 24 - 26
January 2012, Paper 8290-22
Scott Hand
This paper proposes a novel statistical approach to intelligent document retrieval. It seeks to offer a more structured and extensible mathematical approach to the term generalization done in the popular Latent Semantic Analysis (LSA) approach to document indexing. A Markov Random Field (MRF) is presented that captures relationships between terms and documents as probabilistic dependence assumptions between random variables. From there, it uses the MRF-Gibbs equivalence to derive joint probabilities as well as local probabilities for document variables. A parameter learning method is proposed that utilizes rank reduction with singular value decomposition in a matter similar to LSA to reduce dimensionality of document-term relationships to that of a latent topic space. Experimental results confirm the ability of this approach to effectively and efficiently retrieve documents from substantial data sets.
My Abdellah Kassimi, Omar El beqqali
The size of 3D models used on the web or stored in databases is becoming
increasingly high. Then, an efficient method that allows users to find similar
3D objects for a given 3D model query has become necessary. Keywords and the
geometry of a 3D model cannot meet the needs of users' retrieval because they
do not include the semantic information. In this paper, a new method has been
proposed to 3D models retrieval using semantic concepts combined with shape
indexes. To obtain these concepts, we use the machine learning methods to label
3D models by k-means algorithm in measures and shape indexes space. Moreover,
semantic concepts have been organized and represented by ontology language OWL
and spatial relationships are used to disambiguate among models of similar
appearance. The SPARQL query language has been used to question the information
displayed in this language and to compute the similarity between two 3D models.
We interpret our results using the Princeton Shape Benchmark Database and the
results show the performance of the proposed new approach to retrieval 3D
models. Keywords: 3D Model, 3D retrieval, measures, shape indexes, semantic,
ontology
Authors' comments: IJCSI International Journal of Computer Science Issues, Vol. 8, Issue
3, May 2011
Henrik Ohlsson, Allen Y. Yang, Roy Dong, S. Shankar Sastry
Given a linear system in a real or complex domain, linear regression aims to
recover the model parameters from a set of observations. Recent studies in
compressive sensing have successfully shown that under certain conditions, a
linear program, namely, l1-minimization, guarantees recovery of sparse
parameter signals even when the system is underdetermined. In this paper, we
consider a more challenging problem: when the phase of the output measurements
from a linear system is omitted. Using a lifting technique, we show that even
though the phase information is missing, the sparse signal can be recovered
exactly by solving a simple semidefinite program when the sampling rate is
sufficiently high, albeit the exact solutions to both sparse signal recovery
and phase retrieval are combinatorial. The results extend the type of
applications that compressive sensing can be applied to those where only output
magnitudes can be observed. We demonstrate the accuracy of the algorithms
through theoretical analysis, extensive simulations and a practical experiment.
Authors' comments: Parts of the derivations have submitted to the 16th IFAC Symposium on
System Identification, SYSID 2012, and parts to the 51st IEEE Conference on
Decision and Control, CDC 2012
Gonzalo Navarro, Daniel Valenzuela
Supporting top-k document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reduced-space structures that support top-k retrieval and propose new alternatives. Our experimental results show that our novel algorithms and data structures dominate almost all the space/time tradeoff.
Sourav Dutta, Souvik Bhattacherjee, Ankur Narang
Balanced allocation of online balls-into-bins has long been an active area of research for efficient load balancing and hashing applications.There exists a large number of results in this domain for different settings, such as parallel allocations~\cite{parallel}, multi-dimensional allocations~\cite{multi}, weighted balls~\cite{weight} etc. For sequential multi-choice allocation, where $m$ balls are thrown into $n$ bins with each ball choosing $d$ (constant) bins independently uniformly at random, the maximum load of a bin is $O(\log \log n) + m/n$ with high probability~\cite{heavily_load}. This offers the current best known allocation scheme. However, for $d = \Theta(\log n)$, the gap reduces to $O(1)$~\cite{soda08}.A similar constant gap bound has been established for parallel allocations with $O(\log ^*n)$ communication rounds~\cite{lenzen}. In this paper we propose a novel multi-choice allocation algorithm, \emph{Improved D-choice with Estimated Average} ($IDEA$) achieving a constant gap with a high probability for the sequential single-dimensional online allocation problem with constant $d$. We achieve a maximum load of $\lceil m/n \rceil$ with high probability for constant $d$ choice scheme with \emph{expected} constant number of retries or rounds per ball. We also show that the bound holds even for an arbitrary large number of balls, $m>>n$. Further, we generalize this result to (i)~the weighted case, where balls have weights drawn from an arbitrary weight distribution with finite variance, (ii)~multi-dimensional setting, where balls have $D$ dimensions with $f$ randomly and uniformly chosen filled dimension for $m=n$, and (iii)~the parallel case, where $n$ balls arrive and are placed parallely in the bins. We show that the gap in these case is also a constant w.h.p. (independent of $m$) for constant value of $d$ with expected constant number of retries per ball.