Philipp Mayr, Philipp Schaer, Peter Mutschke
This paper is about a better understanding on the structure and dynamics of
science and the usage of these insights for compensating the typical problems
that arises in metadata-driven Digital Libraries. Three science model driven
retrieval services are presented: co-word analysis based query expansion,
re-ranking via Bradfordizing and author centrality. The services are evaluated
with relevance assessments from which two important implications emerge: (1)
precision values of the retrieval service are the same or better than the
tf-idf retrieval baseline and (2) each service retrieved a disjoint set of
documents. The different services each favor quite other - but still relevant -
documents than pure term-frequency based rankings. The proposed models and
derived retrieval services therefore open up new viewpoints on the scientific
knowledge space and provide an alternative framework to structure scholarly
information systems.
Authors' comments: 8 pages, 4 figures, Cologne Conference on Interoperability and
Semantics in Knowledge Organization
Shiro Ikeda, Hidetoshi Kono
In this paper, we propose the SPR (sparse phase retrieval) method, which is a
new phase retrieval method for coherent x-ray diffraction imaging (CXDI).
Conventional phase retrieval methods effectively solve the problem for high
signal-to-noise ratio measurements, but would not be sufficient for single
biomolecular imaging which is expected to be realized with femto-second x-ray
free electron laser pulses. The SPR method is based on the Bayesian statistics.
It does not need to set the object boundary constraint that is required by the
commonly used hybrid input-output (HIO) method, instead a prior distribution is
defined with an exponential distribution and used for the estimation.
Simulation results demonstrate that the proposed method reconstructs the
electron density under a noisy condition even some central pixels are masked.
Authors' comments: 13 pages, 13 figures, submitted for a journal
Ranjeet Devarakonda, Giri Palanisamy
In the recent years, there has been significant advancement in the areas of scientific data management and retrieval techniques, especially in terms of standards and protocols for archiving data. Oak Ridge National Laboratory Distributed Data Archive Center for biogeochemical dynamics is making efforts in building advanced toolsets for these purposes. Mercury is a web-based metadata harvesting, data discovery and access system, built for researchers to search for, share and obtain biogeochemical data. Originally developed for single National Aeronautics and Space Administration (NASA) project, Mercury now used over fourteen different projects across three US federal agencies. Mercury renders various capabilities including metadata management, indexing, searching, data sharing, and also software reusability.
Samir AbdelRahman, Basma Hassan, Reem Bahgat
Email Retrieval task has recently taken much attention to help the user retrieve the email(s) related to the submitted query. Up to our knowledge, existing email retrieval ranking approaches sort the retrieved emails based on some heuristic rules, which are either search clues or some predefined user criteria rooted in email fields. Unfortunately, the user usually does not know the effective rule that acquires best ranking related to his query. This paper presents a new email retrieval ranking approach to tackle this problem. It ranks the retrieved emails based on a scoring function that depends on crucial email fields, namely subject, content, and sender. The paper also proposes an architecture to allow every user in a network/group of users to be able, if permissible, to know the most important network senders who are interested in his submitted query words. The experimental evaluation on Enron corpus prove that our approach outperforms known email retrieval ranking approaches.
Samir AbdelRahman, Basma Hassan, Reem Bahgat
Email Retrieval task has recently taken much attention to help the user
retrieve the email(s) related to the submitted query. Up to our knowledge,
existing email retrieval ranking approaches sort the retrieved emails based on
some heuristic rules, which are either search clues or some predefined user
criteria rooted in email fields. Unfortunately, the user usually does not know
the effective rule that acquires best ranking related to his query. This paper
presents a new email retrieval ranking approach to tackle this problem. It
ranks the retrieved emails based on a scoring function that depends on crucial
email fields, namely subject, content, and sender. The paper also proposes an
architecture to allow every user in a network/group of users to be able, if
permissible, to know the most important network senders who are interested in
his submitted query words. The experimental evaluation on Enron corpus prove
that our approach outperforms known email retrieval ranking approaches
Authors' comments: 20 pages
Ranjeet Devarakonda, Giri Palanisamy, Jim Green
Storing data is easy, but finding and using data is not. It is desirable that the data is stored in a structured format, which can be preserved and retrieved in future. Creating Metadata for the data is one way of creating structured data formats. Metadata can provide Multidisciplinary data access and will foster more robust scientific discoveries. In the recent years, there has been significant advancement in the areas of scientific data management and retrieval techniques, particularly in terms of standards and protocols for archiving data and metadata. New search technologies are being implemented around these protocols, which makes searching easy, fast and yet robust. Scientific data is generally rich, not easy to understand, and spread across different places. In order to integrate these pieces together, a data archive and an associated metadata is generated. This data should be stored in a format that can be locatable, retrievable and understandable, more importantly it should be in a form that will continue to be accessible as technology changes, such as XML.
Philipp Schaer, Philipp Mayr, Peter Mutschke
This paper is a short description of an information retrieval system enhanced
by three model driven retrieval services: (1) co-word analysis based query
expansion, re-ranking via (2) Bradfordizing and (3) author centrality. The
different services each favor quite other - but still relevant - documents than
pure term-frequency based rankings. Each service can be interactively combined
with each other to allow an iterative retrieval refinement.
Authors' comments: 2 pages, 1 figure, ASIST 2010 conference, Pittsburgh, PA, USA
Veit Elser, Stefan Eisebitt
Previous criteria for the feasibility of reconstructing phase information
from intensity measurements, both in x-ray crystallography and more recently in
coherent x-ray imaging, have been based on the Maxwell constraint counting
principle. We propose a new criterion, based on Shannon's mutual information,
that is better suited for noisy data or contrast that has strong priors not
well modeled by continuous variables. A natural application is magnetic domain
imaging, where the criterion for uniqueness in the reconstruction takes the
form that the number of photons, per pixel of contrast in the image, exceeds a
certain minimum. Detailed studies of a simple model show that the uniqueness
transition is of the type exhibited by spin glasses.
Authors' comments: 19 pages, 8 figures
Antti Ukkonen
We consider the evaluation of approximate top-k queries from relations with a-priori unknown values. Such relations can arise for example in the context of expensive predicates, or cloud-based data sources. The task is to find an approximate top-k set that is close to the exact one while keeping the total processing cost low. The cost of a query is the sum of the costs of the entries that are read from the hidden relation. A novel aspect of this work is that we consider prior information about the values in the hidden matrix. We propose an algorithm that uses regression models at query time to assess whether a row of the matrix can enter the top-k set given that only a subset of its values are known. The regression models are trained with existing data that follows the same distribution as the relation subjected to the query. To evaluate the algorithm and to compare it with a method proposed previously in literature, we conduct experiments using data from a context sensitive Wikipedia search engine. The results indicate that the proposed method outperforms the baseline algorithms in terms of the cost while maintaining a high accuracy of the returned results.
Alberto Costa, Fabio Roda
In this paper we present a method for reformulating the Recommender Systems problem in an Information Retrieval one. In our tests we have a dataset of users who give ratings for some movies; we hide some values from the dataset, and we try to predict them again using its remaining portion (the so-called "leave-n-out approach"). In order to use an Information Retrieval algorithm, we reformulate this Recommender Systems problem in this way: a user corresponds to a document, a movie corresponds to a term, the active user (whose rating we want to predict) plays the role of the query, and the ratings are used as weigths, in place of the weighting schema of the original IR algorithm. The output is the ranking list of the documents ("users") relevant for the query ("active user"). We use the ratings of these users, weighted according to the rank, to predict the rating of the active user. We carry out the comparison by means of a typical metric, namely the accuracy of the predictions returned by the algorithm, and we compare this to the real ratings from users. In our first tests, we use two different Information Retrieval algorithms: LSPR, a recently proposed model based on Discrete Fourier Transform, and a simple vector space model.
Md. Saiful Islam, Md. Haider Ali
Due to the rapid development of World Wide Web (WWW) and imaging technology,
more and more images are available in the Internet and stored in databases.
Searching the related images by the querying image is becoming tedious and
difficult. Most of the images on the web are compressed by methods based on
discrete cosine transform (DCT) including Joint Photographic Experts
Group(JPEG) and H.261. This paper presents an efficient content-based image
indexing technique for searching similar images using discrete cosine transform
features. Experimental results demonstrate its superiority with the existing
techniques.
Authors' comments: 9 pages, 4 figures, 4 tables
Simin Feng
We apply the equivalent theory to orthorhombic anisotropic materials and
provide a general unit-cell design criterion for achieving a length-independent
retrieval of the effective material parameters from a single layer of unit
cells. We introduce a graphical retrieval method and phase unwrapping
techniques. The graphical method utilizes the linear regression technique. Our
method can reduce the uncertainty of experimental measurements and the
ambiguity of phase unwrapping. Moreover, the graphical method can
simultaneously determine the bulk values of the six effective material
parameters, permittivity and permeability tensors, from a single layer of unit
cells.
Authors' comments: Accepted for publication in Optics Express
Marek Karpinski, Yakov Nekrich
In this paper we describe a new efficient (in fact optimal) data structure for the {\em top-$K$ color problem}. Each element of an array $A$ is assigned a color $c$ with priority $p(c)$. For a query range $[a,b]$ and a value $K$, we have to report $K$ colors with the highest priorities among all colors that occur in $A[a..b]$, sorted in reverse order by their priorities. We show that such queries can be answered in $O(K)$ time using an $O(N\log \sigma)$ bits data structure, where $N$ is the number of elements in the array and $\sigma$ is the number of colors. Thus our data structure is asymptotically optimal with respect to the worst-case query time and space. As an immediate application of our results, we obtain optimal time solutions for several document retrieval problems. The method of the paper could be also of independent interest.
Sonal Chawla, R. K. Singla
In today\^as world designing adaptable course material requires new technical
knowledge which involves a need for a uniform protocol that allows organizing
resources with emphasis on quality and Learning. This can be achieved by
bundling the resources in a known and prescribed fashion called Learning
objects. Learning Objects are composed of two aspects namely "Learning" and
"Object". The Learning aspect of Learning objects refers to Education. Since
Education is a process so the primary aim of learning objects tends to be
facilitating acquisition, assessment and conversion of content into Learning
objects while fostering the assimilation of these Learning objects into
learning modules and instruction. The Object part of Learning objects relates
to the Digital Electronic format of the resources i.e. to say that it deals
with the physical resource that forms the Learning objects. The objects in LOs
are analogous to objects used in object-oriented modeling (OOM). The analogy
helps visualize how LOs will be packaged, processed and transported across the
digital library as well as utilized in course building. OOM concepts such as
encapsulation, classification, polymorphism, inheritance and reuse can be
borrowed to describe the operations on LOs in the digital library. Thus, the
aim of this paper is threefolds. Firstly, to discuss the background of this
research and the concept of Learning Objects. Secondly, to provide a framework
for adaptive mechanism for the retrieval of Learning Objects and thirdly to
highlight the benefits that this new proposed framework shall bring.
Authors' comments: Submitted to Journal of Telecommunications, see
http://sites.google.com/site/journaloftelecommunications/volume-2-issue-2-may-2010
Uday Pratap Singh, Sanjeev Jain, Gulfishan Firdose Ahmed
The digital image data is rapidly expanding in quantity and heterogeneity.
The traditional information retrieval techniques does not meet the user's
demand, so there is need to develop an efficient system for content based image
retrieval. Content based image retrieval means retrieval of images from
database on the basis of visual features of image like as color, texture etc.
In our proposed method feature are extracted after applying Phong shading on
input image. Phong shading, flattering out the dull surfaces of the image The
features are extracted using color, texture & edge density methods. Feature
extracted values are used to find the similarity between input query image and
the data base image. It can be measure by the Euclidean distance formula. The
experimental result shows that the proposed approach has a better retrieval
results with phong shading.
Authors' comments: IEEE Publication format, International Journal of Computer Science
and Information Security, IJCSIS, Vol. 8 No. 1, April 2010, USA. ISSN 1947
5500, http://sites.google.com/site/ijcsis/
Abderrahim El Qadi, Driss Aboutajedine, Yassine Ennouary
In this paper we describe a mechanism to improve Information Retrieval (IR)
on the web. The method is based on Formal Concepts Analysis (FCA) that it is
makes semantical relations during the queries, and allows a reorganizing, in
the shape of a lattice of concepts, the answers provided by a search engine. We
proposed for the IR an incremental algorithm based on Galois lattice. This
algorithm allows a formal clustering of the data sources, and the results which
it turns over are classified by order of relevance. The control of relevance is
exploited in clustering, we improved the result by using ontology in field of
image processing, and reformulating the user queries which make it possible to
give more relevant documents.
Authors' comments: Pages IEEE format, International Journal of Computer Science and
Information Security, IJCSIS, Vol. 7 No. 2, February 2010, USA. ISSN 1947
5500, http://sites.google.com/site/ijcsis/
Kathrin Knautz, Simone Soubusta, Wolfgang G. Stock
The paper presents our design of a next generation information retrieval system based on tag co-occurrences and subsequent clustering. We help users getting access to digital data through information visualization in the form of tag clusters. Current problems like the absence of interactivity and semantics between tags or the difficulty of adding additional search arguments are solved. In the evaluation, based upon SERVQUAL and IT systems quality indicators, we found out that tag clusters are perceived as more useful than tag clouds, are much more trustworthy, and are more enjoyable to use.
Patricio Galeas, Ralph Kretschmer, Bernd Freisleben
In addition to the frequency of terms in a document collection, the
distribution of terms plays an important role in determining the relevance of
documents. In this paper, a new approach for representing term positions in
documents is presented. The approach allows an efficient evaluation of
term-positional information at query evaluation time. Three applications are
investigated: a function-based ranking optimization representing a user-defined
document region, a query expansion technique based on overlapping the term
distributions in the top-ranked documents, and cluster analysis of terms in
documents. Experimental results demonstrate the effectiveness of the proposed
approach.
Authors' comments: 12 pages, submitted to proceedings of ECIR-2010
James Schombert
The first step in a science project is the acquisition and understanding of
the relevant data. This paper outlines the results of a project to design and
test network tools specifically oriented at retrieving astronomical data. The
tools range from simple data transfer methods to more complex browser-emulating
scripts. When integrated with a defined sample or catalog, these scripts
provide seamless techniques to retrieve and store data of varying types.
Examples are given on how these tools can be used to leapfrog from website to
website to acquire multi-wavelength datasets. This project demonstrates the
capability to use multiple data websites, in conjunction, to perform the type
of calculations once reserved for on-site datasets.
Authors' comments: 10 pages, no figures, software at
http://abyss.uoregon.edu/~js/network
Daniel Sonntag, Romàn R. Zapatrin
We present a method to geometrize massive data sets from search engines query logs. For this purpose, a macrodynamic-like quantitative model of the Information Retrieval (IR) process is developed, whose paradigm is inspired by basic constructions of Einstein's general relativity theory in which all IR objects are uniformly placed in a common Room. The Room has a structure similar to Einsteinian spacetime, namely that of a smooth manifold. Documents and queries are treated as matter objects and sources of material fields. Relevance, the central notion of IR, becomes a dynamical issue controlled by both gravitation (or, more precisely, as the motion in a curved spacetime) and forces originating from the interactions of matter fields. The spatio-temporal description ascribes dynamics to any document or query, thus providing a uniform description for documents of both initially static and dynamical nature. Within the IR context, the techniques presented are based on two ideas. The first is the placement of all objects participating in IR into a common continuous space. The second idea is the `objectivization' of the IR process; instead of expressing users' wishes, we consider the overall IR as an objective physical process, representing the IR process in terms of motion in a given external-fields configuration. Various semantic environments are treated as various IR universes.