Laxmi Choudhary, Bhawani Shankar Burdak
As the use of web is increasing more day by day, the web users get easily
lost in the web's rich hyper structure. The main aim of the owner of the
website is to give the relevant information according their needs to the users.
We explained the Web mining is used to categorize users and pages by analyzing
user's behavior, the content of pages and then describe Web Structure mining.
This paper includes different Page Ranking algorithms and compares those
algorithms used for Information Retrieval. Different Page Rank based algorithms
like Page Rank (PR), WPR (Weighted Page Rank), HITS (Hyperlink Induced Topic
Selection), Distance Rank and EigenRumor algorithms are discussed and compared.
Simulation Interface has been designed for PageRank algorithm and Weighted
PageRank algorithm but PageRank is the only ranking algorithm on which Google
search engine works.
Authors' comments: Keywords: Page Rank, Web Mining, Web Structured Mining, Web Content
Mining
Udayan Khurana, Amol Deshpande
We address the problem of managing historical data for large evolving information networks like social networks or citation networks, with the goal to enable temporal and evolutionary queries and analysis. We present the design and architecture of a distributed graph database system that stores the entire history of a network and provides support for efficient retrieval of multiple graphs from arbitrary time points in the past, in addition to maintaining the current state for ongoing updates. Our system exposes a general programmatic API to process and analyze the retrieved snapshots. We introduce DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index structure that enables compactly recording the historical information, and that supports efficient retrieval of historical graph snapshots for single-site or parallel processing. Along with the original graph data, DeltaGraph can also maintain and index auxiliary information; this functionality can be used to extend the structure to efficiently execute queries like subgraph pattern matching over historical data. We develop analytical models for both the storage space needed and the snapshot retrieval times to aid in choosing the right parameters for a specific scenario. In addition, we present strategies for materializing portions of the historical graph state in memory to further speed up the retrieval process. Secondly, we present an in-memory graph data structure called GraphPool that can maintain hundreds of historical graph instances in main memory in a non-redundant manner. We present a comprehensive experimental evaluation that illustrates the effectiveness of our proposed techniques at managing historical graph information.
Nieves R. Brisaboa, Ana Cerdeira-Pena, Gonzalo Navarro, Oscar Pedreira
Ranked document retrieval is a fundamental task in search engines. Such
queries are solved with inverted indexes that require additional 45%-80% of the
compressed text space, and take tens to hundreds of microseconds per query. In
this paper we show how ranked document retrieval queries can be solved within
tens of milliseconds using essentially no extra space over an in-memory
compressed representation of the document collection. More precisely, we
enhance wavelet trees on bytecodes (WTBCs), a data structure that rearranges
the bytes of the compressed collection, so that they support ranked conjunctive
and disjunctive queries, using just 6%-18% of the compressed text space.
Authors' comments: This is an extended version of the paper that will appear in Proc. of
SPIRE'2012
Rahul Shah, Cheng Sheng, Sharma V. Thankachan, Jeffrey Scott Vitter
Let ${\cal{D}}$ = $\{d_1, d_2, d_3, ..., d_D\}$ be a given set of $D$
(string) documents of total length $n$. The top-$k$ document retrieval problem
is to index $\cal{D}$ such that when a pattern $P$ of length $p$, and a
parameter $k$ come as a query, the index returns the $k$ most relevant
documents to the pattern $P$. Hon et. al. \cite{HSV09} gave the first linear
space framework to solve this problem in $O(p + k\log k)$ time. This was
improved by Navarro and Nekrich \cite{NN12} to $O(p + k)$. These results are
powerful enough to support arbitrary relevance functions like frequency,
proximity, PageRank, etc. In many applications like desktop or email search,
the data resides on disk and hence disk-bound indexes are needed. Despite of
continued progress on this problem in terms of theoretical, practical and
compression aspects, any non-trivial bounds in external memory model have so
far been elusive. Internal memory (or RAM) solution to this problem decomposes
the problem into $O(p)$ subproblems and thus incurs the additive factor of
$O(p)$. In external memory, these approaches will lead to $O(p)$ I/Os instead
of optimal $O(p/B)$ I/O term where $B$ is the block-size. We re-interpret the
problem independent of $p$, as interval stabbing with priority over tree-shaped
structure. This leads us to a linear space index in external memory supporting
top-$k$ queries (with unsorted outputs) in near optimal $O(p/B + \log_B n +
\log^{(h)} n + k/B)$ I/Os for any constant $h${$\log^{(1)}n =\log n$ and
$\log^{(h)} n = \log (\log^{(h-1)} n)$}. Then we get $O(n\log^*n)$ space index
with optimal $O(p/B+\log_B n + k/B)$ I/Os.
Authors' comments: 3 figures
Albert Fannjiang, Wenjing Liao
This paper presents a detailed, numerical study on the performance of the standard phasing algorithms with random phase illumination (RPI). Phasing with high resolution RPI and the oversampling ratio $\sigma=4$ determines a unique phasing solution up to a global phase factor. Under this condition, the standard phasing algorithms converge rapidly to the true solution without stagnation. Excellent approximation is achieved after a small number of iterations, not just with high resolution but also low resolution RPI in the presence of additive as well multiplicative noises. It is shown that RPI with $\sigma=2$ is sufficient for phasing complex-valued images under a sector condition and $\sigma=1$ for phasing nonnegative images. The Error Reduction algorithm with RPI is proved to converge to the true solution under proper conditions.
Hassania Ouchetto, Ouail Ouchetto, Ounsa Roudies
The semantic e-government is a new application field accompanying the development of semantic web where the ontologies have become a fertile field of investigation. This is due firstly to both the complexity and the size of e-government systems and secondly to the importance of the issues. However, permitting easy and personalized access to e-government services has become, at this juncture, an arduous and not spontaneous process. Indeed, the provided e-gov services to the user represent a critical contact point between administrations and users. The encountered problems in the e-gov services retrieving process are: the absence of an integrated one-stop government, the difficulty of localizing the services' sources, the lack of mastery of search terms and the deficiency of multilingualism of the online services. In order to solve these problems, to facilitate access to e-gov services and to satisfy the needs of potential users, we propose an original approach to this issue. This approach incorporates a semantic layer as a crucial element in the retrieving process. It consists in implementing a personalized search system that integrates ontology of the e-gov domain in this process.
Roman Zapatrin
Document ranking based on probabilistic evaluations of relevance is known to
exhibit non-classical correlations, which may be explained by admitting a
complex structure of the event space, namely, by assuming the events to emerge
from multiple sample spaces. The structure of event space formed by overlapping
sample spaces is known in quantum mechanics, they may exhibit some
counter-intuitive features, called quantum contextuality. In this Note I
observe that from the structural point of view quantum contextuality looks
similar to personalization of information retrieval scenarios. Along these
lines, Knowledge Revision is treated as operationalistic measurement and a way
to quantify the rate of personalization of Information Retrieval scenarios is
suggested.
Authors' comments: 11 pages
Weimao Ke
We proposed a Least Information theory (LIT) to quantify meaning of
information in probability distribution changes, from which a new information
retrieval model was developed. We observed several important characteristics of
the proposed theory and derived two quantities in the IR context for document
representation. Given probability distributions in a collection as prior
knowledge, LI Binary (LIB) quantifies least information due to the binary
occurrence of a term in a document whereas LI Frequency (LIF) measures least
information based on the probability of drawing a term from a bag of words.
Three fusion methods were also developed to combine LIB and LIF quantities for
term weighting and document ranking. Experiments on four benchmark TREC
collections for ad hoc retrieval showed that LIT-based methods demonstrated
very strong performances compared to classic TF*IDF and BM25, especially for
verbose queries and hard search topics. The least information theory offers a
new approach to measuring semantic quantities of information and provides
valuable insight into the development of new IR models.
Authors' comments: 10 pages, 3 figures
R. K. Roul, S. K. Sahay
Search engine returns thousands of web pages for a single user query, in
which most of them are not relevant. In this context, effective information
retrieval from the expanding web is a challenging task, in particular, if the
query is ambiguous. The major question arises here is that how to get the
relevant pages for an ambiguous query. We propose an approach for the effective
result of an ambiguous query by forming community vector based on association
concept of data minning using vector space model and the freedictionary. We
develop clusters by computing the similarity between community vectors and
document vectors formed from the extracted web pages by the search engine. We
use Gensim package to implement the algorithm because of its simplicity and
robust nature. Analysis shows that our approach is an effective way to form
clusters for an ambiguous query.
Authors' comments: 11 Pages, 1 figure
Youssef Bassil
The Bing Bang of the Internet in the early 90's increased dramatically the
number of images being distributed and shared over the web. As a result, image
information retrieval systems were developed to index and retrieve image files
spread over the Internet. Most of these systems are keyword-based which search
for images based on their textual metadata; and thus, they are imprecise as it
is vague to describe an image with a human language. Besides, there exist the
content-based image retrieval systems which search for images based on their
visual information. However, content-based type systems are still immature and
not that effective as they suffer from low retrieval recall/precision rate.
This paper proposes a new hybrid image information retrieval model for indexing
and retrieving web images published in HTML documents. The distinguishing mark
of the proposed model is that it is based on both graphical content and textual
metadata. The graphical content is denoted by color features and color
histogram of the image; while textual metadata are denoted by the terms that
surround the image in the HTML document, more particularly, the terms that
appear in the tags p, h1, and h2, in addition to the terms that appear in the
image's alt attribute, filename, and class-label. Moreover, this paper presents
a new term weighting scheme called VTF-IDF short for Variable Term
Frequency-Inverse Document Frequency which unlike traditional schemes, it
exploits the HTML tag structure and assigns an extra bonus weight for terms
that appear within certain particular HTML tags that are correlated to the
semantics of the image. Experiments conducted to evaluate the proposed IR model
showed a high retrieval precision rate that outpaced other current models.
Authors' comments: LACSC - Lebanese Association for Computational Sciences,
http://www.lacsc.org/; International Journal of Computer Science & Emerging
Technologies (IJCSET), Vol. 3, No. 1, February 2012
Eliyahu Osherovich, Michael Zibulevsky, Irad Yavneh
We present a new method for real- and complex-valued image reconstruction from two intensity measurements made in the Fourier plane: the Fourier magnitude of the unknown image, and the intensity of the interference pattern arising from superimposition of the original signal with a reference beam. This approach can provide significant advantages in digital holography since it poses less stringent requirements on the reference beam. In particular, it does not require spatial separation between the sought signal and the reference beam. Moreover, the reference beam need not be known precisely, and in fact, may contain severe errors, without leading to a deterioration in the reconstruction quality. Numerical simulations are presented to demonstrate the speed and quality of reconstruction.
Muhammad Fahad Khan, Saira Beg
Paper presents the way of transferring stereo images using SMS over GSM
network. Generally, Stereo image is composed of two stereoscopic images in such
way that gives three dimensional affect when viewed. GSM have two short
messaging services, which can transfer images and sounds etc. Such services are
known as; MMS (Multimedia Messaging Service) and EMS (Extended Messaging
Service). EMS can send Predefined sounds, animation and images but have
limitation that it does not support widely. MMS can send much higher contents
than EMS but need 3G and other network capability in order to send large size
data up to 1000 bytes. Other limitations are Portability, content adaption etc.
Our major aim in this paper is to provide an alternative way of sending stereo
images over SMS which is widely supported than EMS. We develop an application
using J2ME Platform.
Authors' comments: 3 pages,3 figuers,Journal
Awny Sayed
The continuous growth in the XML information repositories has been matched by
increasing efforts in development of XML retrieval systems, in large parts
aiming at supporting content-oriented XML retrieval. These systems exploit the
available structural information, as market up in XML documents, in order to
return documents components- the so called XML elements-instead of the
complement documents in repose to the user query. In this paper, we provide an
overview of the different XML information retrieval systems and classify them
according to their storage and query evaluation strategies.
Authors' comments: 10 pages, 25 references
Abdelghni Lakehal, Omar El Beqqali
The recent technological progress in acquisition, modeling and processing of
3D data leads to the proliferation of a large number of 3D objects databases.
Consequently, the techniques used for content based 3D retrieval has become
necessary. In this paper, we introduce a new method for 3D objects recognition
and retrieval by using a set of binary images CLI (Characteristic level
images). We propose a 3D indexing and search approach based on the similarity
between characteristic level images using Hu moments for it indexing. To
measure the similarity between 3D objects we compute the Hausdorff distance
between a vectors descriptor. The performance of this new approach is evaluated
at set of 3D object of well known database, is NTU (National Taiwan University)
database.
Authors' comments: 10 pages, 5 figures, publication paper
Fidelia Ibekwe-Sanjuan, Fernandez Silvia, Sanjuan Eric, Charton Eric
We present a methodology combining surface NLP and Machine Learning
techniques for ranking asbtracts and generating summaries based on annotated
corpora. The corpora were annotated with meta-semantic tags indicating the
category of information a sentence is bearing (objective, findings, newthing,
hypothesis, conclusion, future work, related work). The annotated corpus is fed
into an automatic summarizer for query-oriented abstract ranking and multi-
abstract summarization. To adapt the summarizer to these two tasks, two novel
weighting functions were devised in order to take into account the distribution
of the tags in the corpus. Results, although still preliminary, are encouraging
us to pursue this line of work and find better ways of building IR systems that
can take into account semantic annotations in a corpus.
Authors' comments: ECIR'08 Workshop on: Exploiting Semantic Annotations for Information
Retrieval, Glasgow : United Kingdom (2008)
Yin Ye, Yunshan Cao, Xin-Qi Li, Shmuel Gurvitz
We found that in contrast with the common premise, a measurement on the
environment of an open quantum system can {\em reduce} its decoherence rate. We
demonstrate it by studying an example of indirect qubit's measurement, where
the information on its state is hidden in the environment. This information is
extracted by a distant device, coupled with the environment. We also show that
the reduction of decoherence generated by this device, is accompanied with
diminution of the environmental noise in a vicinity of the qubit. An
interpretation of these results in terms of quantum interference on large
scales is presented.
Authors' comments: 9 pages, 8 figures, additional explanations added, Phys. Rev. B, in
press
E. Di Sciascio, F. M. Donini, M. Mongiello
We propose a structured approach to the problem of retrieval of images by content and present a description logic that has been devised for the semantic indexing and retrieval of images containing complex objects. As other approaches do, we start from low-level features extracted with image analysis to detect and characterize regions in an image. However, in contrast with feature-based approaches, we provide a syntax to describe segmented regions as basic objects and complex objects as compositions of basic ones. Then we introduce a companion extensional semantics for defining reasoning services, such as retrieval, classification, and subsumption. These services can be used for both exact and approximate matching, using similarity measures. Using our logical approach as a formal specification, we implemented a complete client-server image retrieval system, which allows a user to pose both queries by sketch and queries by example. A set of experiments has been carried out on a testbed of images to assess the retrieval capabilities of the system in comparison with expert users ranking. Results are presented adopting a well-established measure of quality borrowed from textual information retrieval.
C. Yang, J. Qian, A. Schirotzek, F. Maia, S. Marchesini
Ptychography promises diffraction limited resolution without the need for
high resolution lenses. To achieve high resolution one has to solve the phase
problem for many partially overlapping frames. Here we review some of the
existing methods for solving ptychographic phase retrieval problem from a
numerical analysis point of view, and propose alternative methods based on
numerical optimization.
Authors' comments: 32 pages, 15 figures
Ye Ji
This report presents the results and details of a content-based image retrieval project using the Top-surf descriptor. The experimental results are preliminary, however, it shows the capability of deducing objects from parts of the objects or from the objects that are similar. This paper uses a dataset consisting of 1200 images of which 800 images are equally divided into 8 categories, namely airplane, beach, motorbike, forest, elephants, horses, bus and building, while the other 400 images are randomly picked from the Internet. The best results achieved are from building category.
Heiko Hellweg, Jürgen Krause, Thomas Mandl, Jutta Marx, Matthias N. O. Müller, Peter Mutschke, Robert Strötgen
The first step to handle semantic heterogeneity should be the attempt to
enrich the semantic information about documents, i.e. to fill up the gaps in
the documents meta-data automatically. Section 2 describes a set of cascading
deductive and heuristic extraction rules, which were developed in the project
CARMEN for the domain of Social Sciences. The mapping between different
terminologies can be done by using intellectual, statistical and/or neural
network transfer modules. Intellectual transfers use cross-concordances between
different classification schemes or thesauri. Section 3 describes the creation,
storage and handling of such transfers.
Authors' comments: Technical Report (Arbeitsbericht) GESIS - Leibniz Institute for the
Social Sciences