Academia.eduAcademia.edu

Textual Data Compression

description58 papers
group306 followers
lightbulbAbout this topic
Textual data compression is the process of encoding textual information using fewer bits than the original representation, aiming to reduce the size of data for storage or transmission while preserving the original content. This involves algorithms that exploit redundancy and patterns in the text to achieve efficient encoding.
lightbulbAbout this topic
Textual data compression is the process of encoding textual information using fewer bits than the original representation, aiming to reduce the size of data for storage or transmission while preserving the original content. This involves algorithms that exploit redundancy and patterns in the text to achieve efficient encoding.

Key research themes

1. How can locality of reference and dictionary-based heuristics improve static text compression efficiency and parallelizability?

This research theme explores text compression schemes that exploit locality of reference—where certain words or patterns appear frequently within short intervals to achieve better compression than classic Huffman coding. It also investigates the greedy approach to dictionary-based static text compression, particularly its execution in distributed systems. The focus is on heuristics for word caching, factorization, and dictionary design, balancing compression effectiveness with computational efficiency and scalability on parallel architectures.

Key finding: Introduces a defined-word compression scheme using a move-to-front heuristic to exploit locality of reference in text. The scheme dynamically organizes a sequential word list, encoding recently used words with shorter... Read more
Key finding: Presents the implementation of a greedy factorization method for dictionary-based static text compression that can be executed efficiently via finite state machines and parallelized across distributed systems with minimal... Read more

2. What are the advancements and challenges in achieving fully-compressed suffix trees supporting dynamic updates for text indexing?

Suffix trees are fundamental data structures for string processing with widespread applications but traditionally suffer from large space requirements. This theme investigates the development of fully compressed suffix trees (FCSTs) that achieve space usage close to the entropy of the text while supporting efficient queries. It further studies dynamic FCSTs that can handle text updates and their complexity trade-offs, aiming to attain polylogarithmic query times within optimal compressed space.

Key finding: Develops a framework for dynamic fully compressed suffix trees occupying asymptotically optimal space proportional to the entropy of the text and supporting all suffix tree operations in polylogarithmic time. This extends... Read more

3. How can combined techniques of Burrows-Wheeler transform, pattern matching, and Huffman coding enhance lossless text compression?

This theme addresses the integration of statistical transforms and coding techniques to improve lossless text compression ratios. It focuses on leveraging the Burrows-Wheeler transform (BWT) to cluster repeated characters for efficient run-length representation, followed by pattern matching to detect frequently occurring substrings, and applying Huffman coding based on character frequencies. These combined methods aim to achieve superior compression performance compared to classical schemes while maintaining decompression efficiency.

Key finding: Proposes a novel lossless compression algorithm that applies the Burrows-Wheeler transform with a two-key approach to reduce consecutive character repetitions. Subsequently, frequent patterns are identified for additional... Read more

All papers in Textual Data Compression

Federated Learning (FL) is a machine learning technique that helps safeguard data privacy by financial institutions to collaborate to train AI models without exchanging real data. In FL, the AI model is distributed to each institution,... more
The "XOR-Torus" Implementation This document outlines the systematic approach to establishing the Blackwell Block, pairing the SHD-CCP Stream, and optimizing the Linguistic Crystallization pipeline for benchmarking on NVIDIA SM100... more
This study explores the use of sentiment analysis and machine learning models to predict the market trends of meme coins. By analyzing social media sentiment and financial metrics, the research achieved a 74% accuracy rate in forecasting... more
Accurate real estate price prediction is crucial in today's market to aid buyers, sellers, and investors in making informed decisions. This study employs machine learning algorithms-specifically Linear Regression, Decision Tree... more
Unification is known to be the most repeated operation in logic programming and PROLOG interpreters. To speed up the execution of logic programs, the performance of unification must be improved. We propose a parallel unification machine... more
The development of a highly resilient architecture for mission-critical systems is an integrated approach aimed at minimizing operational risks and ensuring the continuity of vital services. In the face of growing threats, including... more
This paper presents an efficient data compression technique based on using Lempel-Ziv coding algorithms such as the LZ-78 algorithm. The conventional LZ-78 algorithm was applied directly to a non-binary information source (i.e., original... more
Data compression has a paramount effect on Data warehouse for reducing data size and improving query processing. Distinct compression techniques are feasible at different levels, each of types either give good compression ratio or... more
The research on the phenomenon of text compression lies in response to the ever-increasing demands of the modern information society. These demands are intricately tied to the efficient utilization of knowledge and the continuous pursuit... more
The research on the phenomenon of text compression lies in response to the ever-increasing demands of the modern information society. These demands are intricately tied to the efficient utilization of knowledge and the continuous pursuit... more
Remote sensor systems maintain numerous applications in various fields. Sparing vitality in such systems is continuously a basic issue that should be considered to delay the network lifespan. Bunching in the systems is additionally... more
In this current age both communication and generic file compression technologies are using different kind of efficient data compression methods massively. This paper surveys a variety of data compression methods. The aim of data... more
A representation method using the non-symmetry and anti-packing model (NAM) for data compression of binary images is presented. The NAM representation algorithm is compared with the popular linear quadtree and run length encoding... more
Data compression is an art used to reduce the number of bits required to transmit the data of particular information. The goal of data compression is to eliminate the redundancy in a data in order to reduce its size. Data compression can... more
A new notion, that of semi-lossless text compression, is introduced, and its applicability in various settings is investigated. First results suggest that it might be hard to exploit the additional redundancy of English texts, but the new... more
It seems reasonable to expect from a good compression method that its output should not be further compressible, because it should behave essentially like random data. We investigate this premise for a variety of known lossless... more
Huffman coding is known to be optimal, yet its dynamic version may yield smaller compressed files. The best known bound is that the number of bits used by dynamic Huffman coding in order to encode a message of n characters is at most... more
With increasing need to store data in lesser memory several lossless compression techniques are developed. This paper intends to provide the performance analysis of lossless compression techniques over various parameters like compression... more
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
The development of data storage hardware is very rapidly over time. In line with the development of storage hardware, the amount of digital data shared on the internet is increasing every day. That way no matter how big the size of the... more
The main goal of data compression is to decrease redundancy in warehouse or communicated data, so growing effective data density. It is a common necessary for most of the applications. Data compression is very important relevancy in the... more
Data compression is now almost a common requirement for every applications as it is a means for saving the channel bandwidth and storage space. Data Compression is an art of allowing a technique to reduce the volume of data i.e. excess... more
The purpose of the study was to compare the compression ratios of file size, file complexity, and time used in compressing each text file in the four selected compression algorithms on a given modern computer running Windows 7. The... more
Huffman coding is known to be optimal, yet its dynamic version may yield smaller compressed files. The best known bound is that the number of bits used by dynamic Huffman coding in order to encode a message of n characters is at most... more
As the volume and importance of textual data in data science continues to grow, combined with advancements in its techniques, it has created numerous opportunities for extracting valuable insights from textual information. However,... more
In this paper, a energy-efficient data collection method is proposed in which an integration between Discrete Cosine Transform (DCT) matrix and clustering in wireless sensor networks (WSNs) is exploited.Based on the fact that sensory data... more
As the volume and importance of textual data in data science continues to grow, combined with advancements in its techniques, it has created numerous opportunities for extracting valuable insights from textual information. However,... more
Video compression is nothing but compression of video, it involves compression of video size, audio format. In other words we can state video compression as one of the encoding format of video that it can have less memory size than the... more
We introduce DashHashLM, an efficient data structure that stores an n-gram language model compactly while making minimal trade-offs on runtime lookup latency. The data structure implements a finite state transducer with a lossless... more
The optimal configuration for a Large Scale Wireless Sensor Networks (LS-WSN) is the one that minimizes the sampling rate, the CPU time and the channel accesses (thus maximizing the network lifetime), with a controlled distortion in the... more
Failure of Hard Disk is a term most companies and people, fear about. People get concerned regarding data loss. Therefore, predicting the failure of the HDD is an important and to ensure the storage security of the data center. There... more
Noise is an ever present phenomenon while dealing with recording devices, be it digital or analog, be it specks in images or background hiss in music recordings. Therefore, this paper aims at ways of reducing the effects of these forms of... more
Modern daily life activities result in a huge amount of data, which creates a big challenge for storing and communicating them. As an example, hospitals produce a huge amount of data on a daily basis, which makes a big challenge to store... more
In this paper, a energy-efficient data collection method is proposed in which an integration between Discrete Cosine Transform (DCT) matrix and clustering in wireless sensor networks (WSNs) is exploited.Based on the fact that sensory data... more
Failure of Hard Disk is a term most companies and people, fear about. People get concerned regarding data loss. Therefore, predicting the failure of the HDD is an important and to ensure the storage security of the data center. There... more
In this paper, a energy-efficient data collection method is proposed in which an integration between Discrete Cosine Transform (DCT) matrix and clustering in wireless sensor networks (WSNs) is exploited.Based on the fact that sensory data... more
Obiectivul utilitarului ICompress, descris în acest articol, este studiul în condiţii reale al compresiei secvenţelor masive de date având un număr limitat de culori. Datorită faptului că standardele actuale nu oferă suficientă... more
Video compression is nothing but compression of video, it involves compression of video size, audio format. In other words we can state video compression as one of the encoding format of video that it can have less memory size than the... more
Autonomous driving is gaining its importance due to the advancements in technology. With the intention of safety during human driving and with the longer-term aim to act as a communication enabler for autonomous driving, vehicle to... more
Plagiarism occurs when the content is copied without permission or citation. One of the contributing factors is that many text documents on the internet are easily copied and accessed. This paper introduces a plagiarism detection... more
The Flight ticket prices increase or decrease every now and then depending on various factors like timing of the flights, destination, duration of flights. In the proposed system a predictive model will be created by applying machine... more
In this paper, a energy-efficient data collection method is proposed in which an integration between Discrete Cosine Transform (DCT) matrix and clustering in wireless sensor networks (WSNs) is exploited.Based on the fact that sensory data... more
This work proposes a preprocessing method for image compression based on attribute filtering. This method is completely shape preserving and computationally cheap. Three filters were investigated, including one derived from the power... more
The optimal configuration for a Large Scale Wireless Sensor Networks (LS-WSN) is the one that minimizes the sampling rate, the CPU time and the channel accesses (thus maximizing the network lifetime), with a controlled distortion in the... more
In this paper, a energy-efficient data collection method is proposed in which an integration between Discrete Cosine Transform (DCT) matrix and clustering in wireless sensor networks (WSNs) is exploited.Based on the fact that sensory data... more
Insurance is a policy that helps to cover up all loss or decrease loss in terms of expenses incurred by various risks. A number of variables affect how much insurance costs. These considerations of different factors contribute to the... more
Wireless sensor network consists of large number of wireless node that are responsible for sensing processing and monitoring environmental .These sensor nodes are battery operated. Clustering is a standard approach for achieving efficient... more
Download research papers for free!