Conceptual clustering is a discovery process that groups a set of data in the way that the intra-... more Conceptual clustering is a discovery process that groups a set of data in the way that the intra-cluster similarity is maximized and the inter-cluster similarity is minimized. Traditional clustering algorithms employ some measure of distance between data points in n-dimensional space. However, not all data types can be represented in a metric space, therefore no natural distance function is available for them. We address the problem of clustering sequences of categorical values. We present a measure of similarity for the sequences and an agglomerative hierarchical algorithm that uses frequent sequential patterns found in the database to efficiently generate the resulting clusters. The algorithm iteratively merges smaller, similar clusters into bigger ones until the requested number of clusters is reached.
With the vastly growing data resources on the Internet, XML is one of the most important standard... more With the vastly growing data resources on the Internet, XML is one of the most important standards for document management. Not only does it provide enhancements to document exchange and storage, but it is also helpful in a variety of information retrieval tasks. Document clustering is one of the most interesting research areas that utilize XML's semi-structural nature. In this paper, we put forward a new XML clustering algorithm that relies solely on document structure. We propose the use of maximal frequent subtrees and an operator called Satisfy/Violate to divide documents into groups. The algorithm is experimentally evaluated on real and synthetic data sets with promising results.
Notes on Numerical Fluid Mechanics and Multidisciplinary Design, 2002
The Times They Are A-Changing" (B. Dylan), and with them the structures, schemas, master data, et... more The Times They Are A-Changing" (B. Dylan), and with them the structures, schemas, master data, etc. of data warehouses. For the correct treatment of such changes in OLAP queries the orthogonality assumption of star schemas has to be abandoned. We propose the COMET model which allows to represent not only changes of transaction data, as usual in data warehouses, but also of schema, and structure data. The COMET model can then be used as basis of OLAP tools which are aware of structural changes and permit correct query results spanning multiple periods and thus different versions of dimension data. In this paper we present the COMET metamodel in detail with all necessary integrity constraints and show how the intervals of structural stabilities can be computed for all components of a data warehouse.
Now that the use of XML is prevalent, methods for mining semi-structured documents have become ev... more Now that the use of XML is prevalent, methods for mining semi-structured documents have become even more important. In particular, one of the areas that could greatly benefit from in-depth analysis of XML's semi-structured nature is cluster analysis. Most of the XML clustering approaches developed so far employ pairwise similarity measures. In this paper, we study clustering algorithms, which use patterns to cluster documents without the need for pairwise comparisons. We investigate the shortcomings of existing approaches and establish a new pattern-based clustering framework called XPattern, which tries to address these shortcomings. The proposed framework consists of four steps: choosing a pattern definition, pattern mining, pattern clustering, and document assignment. The framework's distinguishing feature is the combination of pattern clustering and document-cluster assignment, which allows to group documents according to their characteristic features rather than their direct similarity. We experimentally evaluate the proposed approach by implementing an algorithm called PathXP, which mines maximal frequent paths and groups them into profiles. PathXP was found to match, in terms of accuracy, other XML clustering approaches, while requiring less parametrization and providing easily interpretable cluster representatives. Additionally, the results of an in-depth experimental study lead to general suggestions concerning pattern-based XML clustering.
The number of patterns discovered by data mining can become tremendous, in some cases exceeding t... more The number of patterns discovered by data mining can become tremendous, in some cases exceeding the size of the original database. Therefore, there is a requirement for querying previously generated mining results or for querying the database against discovered patters. In this paper, we focus on developing methods for the storage and querying of large collections of sequential patterns. We describe a family of algorithms, which address the problem of considering the ordering among elements, that is crucial when dealing with sequential patterns. Moreover, we take into account the fact that the distribution of elements within sequential patterns is highly skewed, to propose a novel approach for the effective encoding of patterns. Experimental results, which examine a variety of factors, illustrate the efficiency of the proposed method.
One of the important research and technological problems in data warehouse query optimization con... more One of the important research and technological problems in data warehouse query optimization concerns star queries. So far, most of the research focused on optimizing such queries by means of join indexes, bitmap join indexes, or various multidimensional indexes. These structures neither support navigation well along dimension hierarchies nor optimize joins with the Time dimension, which in practice is used in most of the star queries. In this paper we propose an index, called Time-HOBI, for optimizing the star queries that compute aggregates along dimension hierarchies. Time-HOBI, created on a dimension hierarchy, is composed of (1) a Hierarchically Organized Bitmap Index (HOBI), where one bitmap index is maintained for one dimension level, and (2) a Time Index (TI) that implicitly encodes time in every dimension. HOBI allows to quickly search for fact rows satisfying predicates defined on different levels of dimension hierarchies. With the support of TI joining a fact table with the Time dimension is avoided. Thus, Time-HOBI supports a broad class of star queries. In this paper we explain how query execution plans for star queries can profit from Time-HOBI. We show, based on experiments, the efficiency of Time-HOBI for different classes of queries, as compared to HOBI and a traditional bitmap index. Based on the experiments, we also demonstrate how sensitive Time-HOBI is to variable selectivity of queries. We also analyze the maintenance time of Time-HOBI as compared to HOBI and a traditional bitmap index. The experiments used in the paper have been conducted on a real dataset, coming from the biggest East-European Internet auction platform Allegro.pl. The experiments show that Time-HOBI can be successfully applied to the optimization of star queries as it offers promising performance improvement.
Uploads
Papers by Tadeusz Morzy