Investigating Binary String Encoding for Compact Representation of XML Documents
2017, Computer Science & Information Technology (CS & IT)
https://doi.org/10.5121/CSIT.2017.70402…
8 pages
Sign up for access to the world's latest research
Abstract
Since Extensible Markup Language abbreviated as XML, became an official World Wide Web Consortium recommendation in 1998, XML has emerged as the predominant mechanism for data storage and exchange, in particular over the World Web. Due to the flexibility and the easy use of XML, it is nowadays widely used in a vast number of application areas and new information is increasingly being encoded as XML documents. Because of the widespread use of XML and the large amounts of data that are represented in XML, it is therefore important to provide a repository for XML documents, which supports efficient management and storage of XML data. Since the logical structure of an XML document is an ordered tree consisting of tree nodes, establishing a relationship between nodes is essential for processing the structural part of the queries. Therefore, tree navigation is essential to answer XML queries. For this purpose, many proposals have been made, the most common ones are node labeling schemes. On the other hand, XML repeatedly uses tags to describe the data itself. This self-describing nature of XML makes it verbose with the result that the storage requirements of XML are often expanded and can be excessive. In addition, the increased size leads to increased costs for data manipulation. Therefore, it also seems natural to use compression techniques to increase the efficiency of storing and querying XML data. In our previous works, we aimed at combining the advantages of both areas (labeling and compaction technologies), Specially, we took advantage of XML structural peculiarities for attempting to reduce storage space requirements and to improve the efficiency of XML query processing using labeling schemes. In this paper, we continue our investigations on variations of binary string encoding forms to decrease the label size. Also We report the experimental results to examine the impact of binary string encoding on reducing the storage size needed to store the compacted XML documents.
Related papers
2008
With the rapidly increasing popularity of XML as a data format, there is a large demand for efficient techniques in storing and querying XML documents. However XML is by nature verbose, due to repeatedly used tags that describe data. For this reason the storage requirements of XML can be excessive and lead to increased costs for data manipulation. Therefore, it seems natural to use compression techniques to increase the efficiency of storing and querying XML data. In this paper, we propose a new approach called SCQX for Storing, Compressing and Querying XML documents. This approach compresses the structure of an XML document based on exploiting repetitive consecutive tags in the structure, and then SCQX stores the compressed XML structure and the data separately in a robust storage structure that includes a set of access support structures to guarantee fast query performance. Moreover, SCQX supports querying of the compressed XML structure directly and efficiently without requiring decompression. An experimental evaluation on sets of XML data shows the effectiveness of our approach.
2007 2nd International Conference on Digital Information Management, 2007
In recent years, the method of assigning labels to the nodes of an XML tree is getting more attraction. Various functions in an RDBMS can be easily utilized by storing the labeled XML documents into the RDB. However, in traditional labeling methods, a number of nodes need to be relabeled, when the XML documents are updated. To address this problem, we proposed DO-VLEI code combining VLEI code with the Dewey Order method. DO-VLEI code is effective to reduce the update cost, but the label size increases rapidly when handling large XML documents. To reduce the label size, we presented Compressed-bit-string DO-VLEI (C-DO-VLEI) code. However, it is difficult to handle the length of C-DO-VLEI because it is a variable-length code. In this paper, we propose two effective methods, VLEI-ABL and VLEI-EOL for handling the code length of C-DO-VLEI. We perform experiments to compare the storage consumption of the proposed methods with the previously known OR-DPATH. The experimental results show that our methods considerably outperform the ORDPATH.
2012
Extensible Markup Language (XML) is proposed as a standardized data format designed for specifying and exchanging data on the Web. With the proliferation of mobile devices, such as palmtop computers, as a means of communication in recent years, it is reasonable to expect that in the foreseeable future, a massive amount of XML data will be generated and exchanged between applications in order to perform dynamic computations over the Web. However, XML is by nature verbose, since terseness in XML markup is not considered a pressing issue from the design perspective. In practice, XML documents are usually large in size as they often contain much redundant data. The size problem hinders the adoption of XML, since it substantially increases the costs of data processing, data storage, and data exchanges over the Web. As the common generic text compressors, such as Gzip, Bzip2, WinZip, PKZIP, or MPEG-7 (BiM), are not able to produce usable XML compressed data, many XML specific compression technologies have been recently proposed. The essential idea of these technologies is that, by utilizing the exposed structure information in the input XML document during the compression process, they pursue two important goals at the same time. First, they aim at achieving a good compression ratio and time compared to the generic text compressors. Second, they aim at generating a compressed XML document that is able to support efficient evaluation of queries over the data. This paper discuses survey of some of the Adaptive Compression Techniques for XML namely Xmill ,Xpress ,Xgrind.
2007
In order to efficiently determine structural relationships among XML elements and to avoid re-labeling for updates, much research about labeling schemes has been conducted, recently. However, a harmonic support of efficient query processing and updating has not been achieved. In this paper, we propose an efficient XML encoding and labeling scheme, called EXEL, which is a variant of the region numbering scheme using bit strings. In order to generate the ordinal and insert-friendly bit strings in EXEL, a novel binary encoding method is devised. Also, we devise a labeling scheme for a newly inserted node which incurs no re-labeling of pre-existing labels. These encoding and inserting methods are the bases of efficient query processing and the complete avoidance of re-labeling for updates. Moreover, EXEL supports all structural relationships in XPath and the relationships can be checked by SQL statements supported by an RDBMS. Finally, the experimental results show that EXEL provides fairly reasonable query processing performance while completely avoiding re-labeling for updates.
Journal of Software, 2011
Recently, the researchers have proposed a number of labeling schemes. In these labeling schemes, the approach which can extract structural information between nodes and process query efficiently is more outstanding. However, most of these labeling schemes do not well support update operations. To achieve update-friendly operations, some of the methods keep intervals between labeling numbers, but it requires whole relabeling when the intervals are used up. Several labeling schemes support dynamic XML documents, but most of these labeling schemes allow only leaf node insertions. OrdPathX supports both leaf node insertions and internal node insertions. Inspired by the method of inserting internal nodes of OrdPathX and extending the C-DO-VLEI code, in this paper we propose two dimensions VLEI code. We discuss how this labeling scheme labels nodes and how we can get the structural information of nodes from their labels. We design experiments to evaluate the efficiency of producing labels, the storage consumption and the querying performance of two dimensions VLEI code we proposed, and compare those with the OrdPathX.
In Last few years XML has became standard of data interchange in most of the applications over the web. Most of the Enterprise level application, Middleware software adopted XML as their standard medium for data interchange. So in these days of Internet and web technology popularity of XML is huge. The power of XML lies in its self describing abilities. This same ability makes XML verbose thus introducing significant amount of redundancy that adds no particular value. Increased size of XML documents create burden on application and network bandwidth. So there is a clear requirement for good and effective data compression algorithm for XML. Traditional ZIP utility like GZIP compressed the XML data in a non readable format means we can't query those data until it's decompressed. So in this paper I had developed an efficient XML compression utility named as XLARGE. XLARGE has shown a significant amount of improvement in compression ratio over GZip.
Lecture Notes in Computer Science, 2009
Lossy compression techniques have been applied to image and text compression, yielding compression factors that are vastly superior to lossless compression schemes. In this paper, we present a preliminary study on a set of lossy transformations for XML documents that preserve the semantics. Inspired by previous techniques, e.g. lossy text compression and literate programming, we apply a simple algorithm to XML syntactic constructs to loose superfluous layout information and redundant text. The obtained XML keeps the human-readability and machine-readability properties. Additionally, it can lead to a considerable reduction of its space occupancy and boost the application of conventional text compressors, thus representing a promising technology for several data management tasks.
Background: Bioinformatics data are the description of all the chemical interactions or the protein structure to form the gene in the living bodies. To make these data easy to be stored, transmit, retrieved, and unified the best way is to represent these data as XML representation. However, XML documents suffer from high redundancy in its structure. Objective: This paper produces a new XML compressor (BioXComp) to compress the Bioinformatics XML documents (Bio-XML) and to retrieve information from the compressed XML documents without the need to decompress them. The proposed algorithm depends on generating the Structure Indexed Tree (SIT) for the Bio-XML document to use it for the retrieval purposes according to different types of queries. Results: BioXComp achieves 68.7% compression ratio and retrieves information based on different kinds of XQuery queries. Conclusion: As the importance of using XML documents in representing the Bioinformatics data increases, the need to compress these data is increasing as well to transmit these documents using less bandwidth. Instead of decompressing these files each time the user needs to retrieve a specific portion from them, BioXComp provides a way to retrieve the information from the compressed data using different types of XQuery query language.
Lecture Notes in Computer Science, 2006
Sharing of common subtrees has been reported useful not only for XML compression but also for main-memory XML query processing. This method compresses subtrees only when they exhibit identical structure. Even slight irregularities among subtrees dramatically reduce the performance of compression algorithms of this kind. Furthermore, when XML documents are large, the chance of having large number of identical subtrees is inherently low. In this paper, we proposed a method of decomposing XML documents for better compression. We proposed a heuristic method of locating minor irregularities in XML documents. The irregularities are then projected out from the original XML document. We refered this process to as document decomposition. We demonstrated that better compression can be achieved by compressing the decomposed documents separately. Experimental results demonstrated that the compressed skeletons, for all real-world datasets, to our knowledge, fit comfortably into main memory of commodity computers nowadays. Preliminary results on querying compressed skeletons validate the effectiveness our approach.
Advances in Databases and Information Systems, 2007
This paper describes a new XML compression scheme that offers both high compression ratios and short query response time. Its core is a fully reversible transform featuring substitution of every word in an XML document using a semi-dynamic dictionary, effective encoding of dictionary indices, as well as numbers, dates and times found in the document, and grouping data within the same structural context in individual containers. The results of conducted tests show that the proposed scheme attains compression ratios rivaling the best available algorithms, and fast compression, decompression, and query processing.
References (16)
- R. Alkhatib and M. H. Scholl. Cxqu: A compact xml storage for efficient query and update processing. In P. Pichappan and A. Abraham, editors, ICDIM, pages 605-612. IEEE, 2008.
- R. Alkhatib and M. H. Scholl. Compacting xml structures using a dynamic labeling scheme. In A. P. Sexton, editor, BNCOD, volume 5588 of Lecture Notes in Computer Science, pages 158-170. Springer, 2009.
- M. Ali, M. A. Khan: Efficient parallel compression and decompression for large XML files. Int. Arab J. Inf. Technol. 13(4):403-408, 2016
- H. AlZadjali, Siobhán North: XML Labels Compression using Prefix-encodings. WEBIST, pages 69- 75, 2016
- J. Bosak. Xml-tagged religion. Oct 1998. http://xml.coverpages.org.
- S. Böttcher, R. Hartel, C. Krislin: CluX -Clustering XML Sub-trees. ICEIS, pages 142, 150, 2010
- T. Härder, M. P. Haustein, C.Mathis, andM.W. 0002. Node labeling schemes for dynamic xml documents reconsidered. Data Knowl. Eng., 60(1):126-149, 2007.
- M. Lohrey, S. Maneth, R. Mennicke: XML tree structure compression using RePair. Inf. Syst. 38(8): 1150-1167, 2013
- D. R. MADDISON, K.-S. SCHULZ, and W. P. MADDI-SON. The tree of life web project. ZOOTAXA, pages 19-40, 20 Nov. 2007.
- W. May. Information extraction and integration with FLORID: The MONDIAL case study. Technical Report 131, Universität Freiburg, Institut für Informatik, 1999. Available from http://dbis.informatik.uni-goettingen.de/Mondial,.
- G. Miklau. Xml repository. http://www.cs.washington.edu/research/xmldatasets.
- NCBI. National center for biotechnology information(ncbi) xml data format. http://www.ncbi.nlm.nih.gov/index.html.
- NLM. National library of medicine (nlm) xml data format. http://xml.coverpages.org
- P. E. O'Neil, E. J. O'Neil, S. Pal, I. Cseri, G. Schaller, and N. Westbury. Ordpaths: Insert-friendly xml node labels. In G.Weikum, A. C. K¨ onig, and S. Deßloch, editors, SIGMOD Conference, pages 903-908. ACM, 2004.
- Tatarinov, S. Viglas, K. S. Beyer, J. Shanmugasundaram, E. J. Shekita, and C. Zhang. Storing and querying ordered xml using a relational database system. In M. J. Franklin, B. Moon, and A. Ailamaki, editors, SIGMOD Conference, pages 204-215. ACM, 2002.
- T. web project. the tol tree structure. 1998. http://tolweb.org/tree
Ramez Alkhatib