Next: II. BASIC APPROACH Up: SEQUENTIAL NEURAL TEXT COMPRESSION Previous: SEQUENTIAL NEURAL TEXT COMPRESSION

I. INTRODUCTION

Text compression is important (e.g. [1]). It is cheaper to communicate compressed text files instead of original text files. Moreover, compressed files are cheaper to store.

For such reasons, various text encoding algorithms have been developed, plus the corresponding decoding algorithms. A text encoding algorithm takes a text file and generates a shorter compressed file from it. The compressed file contains all the information necessary to restore the original file, which can be done by calling the corresponding decoding algorithm. Unlike with image compression, text compression typically requires loss-free compression. The most widely used text compression algorithms are based on Lempel-Ziv techniques (e.g., [13]). Lempel-Ziv compresses symbol strings sequentially, essentially replacing substrings by pointers to equal substrings encountered earlier. As the file size goes to infinity, Lempel-Ziv becomes asymptotically optimal in a certain information theoretic sense [12].

The average ratio between the lengths of original and compressed files is called the average compression ratio. We cite a statement from Held's book [2], where he refers to text represented by 8 bits per character:

"In general, good algorithms can be expected to achieve an average compression ratio of 1.5, while excellent algorithms based upon sophisticated processing techniques will achieve an average compression ratio exceeding 2.0."

This paper will show that neural networks may be used to design ``excellent'' text compression algorithms.

Outline of paper. Section II describes the basic approach combining neural nets and the technique of predictive coding. Section III focuses on the details of a neural predictor of conditional probabilities. In addition, section III describes three alternative coding techniques to be used in conjunction with the predictor. Section IV presents comparative simulations. Section V discusses limitations and extensions.

Next: II. BASIC APPROACH Up: SEQUENTIAL NEURAL TEXT COMPRESSION Previous: SEQUENTIAL NEURAL TEXT COMPRESSION

Juergen Schmidhuber 2003-02-13