Recently, Yang and Kieffer' proposed a novel lossless grammar-based data compression algorithm, called the YK algorithm, in which a greedy sequential grammar transform is applied to the original data to construct an irreducible context...
moreRecently, Yang and Kieffer' proposed a novel lossless grammar-based data compression algorithm, called the YK algorithm, in which a greedy sequential grammar transform is applied to the original data to construct an irreducible context free grammar, which is encoded indirectly by using an arithmetic coder. The YK algorithm has been shown to be universal for the class of stationary, ergodic sources. Experimental results show that the YK algorithm significantly outperforms other commonly used sequential lossless data compression algorithms such as Lempel-Ziv types of codes. Moreover, the YK algorithm is effective on a virtually unlimited range of data sizes. The basic implementation of the YK encoding algorithm consists of a sequentially iterative application of three fundamental steps-Parsing, Arithmetic Encoding, and Updating. The parsing operation searches for the longest prefix of the remaining part of the original data sequence that is representable by the current grammar. The arithmetic encoding operation encodes the parsed phrase using frequency counts over a dynamic alphabet. The updating operation uses the parsed substring to update the grammar and the frequency counts. This paper proposes five modifications of the basic YK algorithm, motivated by applications of the algorithm to the internet transmission of web-data: 1. Fast YK encoder: The parsing operation is a major step of the YK algorithm, which can contribute to a significant portion of the overall time taken by the encoder. A variant of the trie data structure, which captures the structure of the context-free grammar in a non-redundant manner, is proposed for fast parsing. This is applicable for real-time compression of IP datagrams, particularly at high data rates. 2. Pre-defined source statistics: Known source statistics can be exploited to improve compression efficiency, which is particularly effective for small IP datagrams with a known structure, such as NNTP datagrams. 3. Pre-defined grammar: Starting with a "typical" pre-defined grammar can significantly improve the compression efficiency for applications such as HTML web-page compression, because HTML responses originating at a given server tend to have identical layout and similar content, in addition to having standard HTML tags and keywords. 4. Memory constrained implementation: During YK compression, as the length of the data sequence increases, the grammar also continues to grow in size, which can potentially exhaust the available memory in the system. This paper proposes a way to check memory requirement by reusing variables in the grammar, once a user-chosen limit on grammar size is reached. 5. Error handling capability: The paper identifies all possible contingencies that can arise when an erroneous bit-stream is fed to the YK decoder, and provides explicit ways to handle these. This is important in applications where compressed IP datagrams are transmitted over unreliable links.