Academia.eduAcademia.edu

File compression

description13 papers
group2 followers
lightbulbAbout this topic
File compression is the process of reducing the size of a digital file by encoding its data more efficiently, thereby minimizing the amount of storage space required and facilitating faster transmission over networks. This is achieved through various algorithms that eliminate redundancy and optimize data representation.
lightbulbAbout this topic
File compression is the process of reducing the size of a digital file by encoding its data more efficiently, thereby minimizing the amount of storage space required and facilitating faster transmission over networks. This is achieved through various algorithms that eliminate redundancy and optimize data representation.

Key research themes

1. How do lossless compression algorithms leverage statistical and structural redundancies to optimize data representation?

This theme explores lossless compression techniques that focus on identifying and removing redundancies and optimizing codeword assignments in digital data, especially text and images, to minimize file sizes without any information loss. It matters because lossless compression guarantees perfect data recovery, vital for scenarios like medical imaging, legal documents, and executable files.

Key finding: This paper categorizes and explains three fundamental types of redundancies exploited in image compression: coding redundancy (removable via optimal codes like Huffman coding), interpixel redundancy (correlations between... Read more
Key finding: By detailing the procedural application and comparative performance of Huffman coding and arithmetic coding, this study confirms that arithmetic coding generally outperforms Huffman by achieving shorter average codeword... Read more
Key finding: This study quantitatively compares classical lossless compression algorithms—Shannon-Fano, Huffman, Run-Length Encoding, and Lempel-Ziv-Welch (LZW)—evaluating them on the basis of compression ratio, entropy, and processing... Read more
Key finding: This paper introduces the adaptation of bit recycling techniques into arithmetic coding, exploiting the multiplicity of encoding to surpass Huffman-based bit recycling limitations. The adapted method theoretically achieves... Read more
Key finding: The proposed lossless compression algorithm leverages the Burrows-Wheeler transform coupled with key-based character reduction and Huffman encoding to significantly enhance compression ratios for highly repetitive character... Read more

2. What novel data structures and heuristics can enhance lossless text compression beyond classical dictionary and statistical coding methods?

This research area investigates innovative methods such as word lookup tables, self-organizing lists, and pattern-based optimizations to improve compression by indexing word-level repetitions and exploiting locality. Enhancing dictionary structures and dynamic coding heuristics matter for efficient compression and decompression, especially for large-scale text data and streaming applications.

Key finding: This paper demonstrates that employing a word lookup table as an OS-level reduction step to replace entire words by fixed-size address values effectively reduces the persistent storage requirements of English text by... Read more
Key finding: Introducing a compression scheme based on a self-organizing sequential search list with a move-to-front heuristic, this method exploits locality of reference by dynamically adjusting codeword positions. Theoretical analysis... Read more
Key finding: The compression scheme encodes textual data by substituting characters with dynamic-size differential values of ASCII codes relative to a calculated midpoint reference, exploiting the small differential range in textual... Read more
Key finding: Besides earlier mentioned transform benefits, this paper also innovates by introducing two keys to reduce consecutive repeated characters post-Burrows-Wheeler transform, combined with Huffman encoding on frequent patterns.... Read more

3. How can statistical modeling and feature-based predictors enable accurate and efficient estimation of lossy compression ratios for scientific and high-dimensional data?

This theme focuses on predicting lossy compression performance through machine-learned statistical frameworks that analyze spatial correlations, entropy measures, and data quantization impacts. Accurate prediction frameworks matter for optimizing compression configurations and selecting the best algorithms without exhaustive trial-and-error, particularly in data-intensive scientific computing environments.

Key finding: This work establishes a two-step data-driven framework combining compressor-agnostic statistical predictors capturing spatial correlation and quantized entropy with supervised models to predict lossy compression ratios across... Read more
Key finding: While primarily focused on lossless compression, this thesis introduces probabilistic models and arithmetic coding techniques with Bayesian inference to explicitly model input probability distributions, enabling tailored... Read more

All papers in File compression

Mass storage systems are finding greater use in scientific computing research environments for retrieving and archiving the large volumes of data generated and manipulated by scientific computations. This paper presents a queuing network... more
A noiseless-channel coding system in which the source probabilities change continuously in time is introduced. Using a geometry defined by the second derivative matrix of the information-theoretic entropy function, a bound on the number... more
A simple scheme was proposed by Knuth to generate balanced codewords from a random binary information sequence. However, this method presents a redundancy which is twice as that of the full sets of balanced codewords, that is the minimal... more
A simple scheme was proposed by Knuth to generate binary balanced codewords from any information word. However, this method is limited in the sense that its redundancy is twice as that of the full sets of balanced codes. The gap between... more
Since the Boyer-Moore algorithm was described in 1977, it has been the standard benchmark for the practical string search literature. Yet this yardstick compares badly with current practice. We describe two algorithms that perform 47%... more
Analysis of thermal data requires the processing of large amounts of temporal image data. The processing of the data for quantitative information can be time intensive especially out in the field where large areas are inspected resulting... more
Since the Boyer-Moore algorithm was described in 1977, it has been the standard benchmark for the practical string search literature. Yet this yardstick compares badly with current practice. We describe two algorithms that perform 47%... more
Here we propose GP-zip3, a system which uses Genetic Programming to find optimal ways to combine standard compression algorithms for the purpose of compressing files and archives. GP-zip3 evolves programs with multiple components. One... more
This paper surveys a variety of data compression methods spanning almost 40 years of research, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique developed in 1986. The aim of data compression is to reduce... more
A simple scheme was proposed by Knuth to generate balanced codewords from a random binary information sequence. However, this method presents a redundancy which is twice as that of the full sets of balanced codewords, that is the minimal... more
In this paper, the construction of binary balanced codes is revisited. Binary balanced codes refer to sets of bipolar codewords where the number of "1"s in each codeword equals that of "0"s. The first algorithm for balancing codes was... more
The commonly used data compression techniques do not necessarily provide maximal compression and neither do they define the most efficient framework for transmission of data. In this thesis we investigate variants of the standard... more
Data compression aims to reduce the size of data so that it requires less storage space and less communication channels bandwidth. Many compression techniques (such as LZ77 and its variants) su er from a problem that we call the... more
Mass storage systems are finding greater use in scientific computing research environments for retrieving and archiving the large volumes of data generated and manipulated by scientific computations. This paper presents a queuing network... more
Our approach to deduplication and compression in cloud computing aims at reduction in storage space and bandwidth usage during file transfers. The design depends on multiple metadata structures for deduplication. Only a copy of the... more
A Huffman code is a particular type of optimal prefix code that is commonly used for loss-less data compression. The process of finding such a code is known as Huffman coding. The output from Huffman's algorithm can be viewed as a... more
The bit recycling compression technique has been introduced to minimize the redundancy caused by the multiplicity of encoding feature present in many compression techniques. It has achieved about 9% as a reduction in the size of the files... more
•R educing strings overa rbitrary alphabet Σ o to strings overa fixed alphabet Σ c to standardize machine operations (|Σ c |<|Σ o |). − Binary representation of both operands and operators in machine instructions in computers.
We consider the problem of source coding. We investigate the cases of known and unknown statistics. The efficiency of the compression codes can be estimated by three characteristics: 1) the rebundancy (r), defined as the maximal... more
Research has been a challenge for most academic institutions. There are tools available to find research articled, but It is still hard to get inputs from experts, or experienced people, since most of the time researchers are assigned... more
... variablerate coding, IEEE Trans. Inf. Theory, IT-24 (1978), 530-536. /~1 John Cleary received a B. Sc. Hons in 1971 and an M. Sc. degree in 1974, both in mathematics from the University of Canterbury in New Zealand. From 1975 to ...
A novel fast recursive coding technique is proposed. It operates with only integer values not longer 8 bits and is multiplication free. Recursion the algorithm is based on indirectly provides rather effective coding of symbols for very... more
A novel fast recursive coding technique is proposed. It operates with only integer values not longer 8 bits and is multiplication free. Recursion the algorithm is based on indirectly provides rather effective coding of symbols for very... more
The web has become a resourceful tool for almost all domains today. Search engines prominently use inverted indexing technique to locate the web pages having the users query. The performance of inverted index fundamentally depends upon... more
The state of the art in data compression is arithmetic coding, not the betterknown Huffman method. Arithmetic coding gives greater compression, is faster for adaptive models, and clearly separates the model from the channel encoding.
by p H
The state of the art in data compression is arithmetic coding, not the betterknown Huffman method. Arithmetic coding gives greater compression, is faster for adaptive models, and clearly separates the model from the channel encoding.
Our approach to deduplication and compression in cloud computing aims at reduction in storage space and bandwidth usage during file transfers. The design depends on multiple metadata structures for deduplication. Only a copy of the... more
Download research papers for free!