next up previous
Next: ON-LINE METHODS / LIMITATIONS Up: EXAMPLE 2: Text Compression Previous: USING THE PREDICTOR FOR

SIMULATIONS

Our current computing environment prohibits extensive experimental evaluations of the method above. The predictor updates turn out to be quite time consuming, which makes special neural net hardware recommendable. The limited software simulations presented in this section, however, will show that the ``neural'' compression technique can achieve ``excellent'' compression ratios. Here the term ``excellent'' is defined by a statement from [5]:
"In general, good algorithms can be expected to achieve an average compression ratio of 1.5, while excellent algorithms based upon sophisticated processing techniques will achieve an average compression ratio exceeding 2.0."
Here the average compression ratio is the average ratio between the lengths of original and compressed files.

In collaboration with Stefan Heil (a former student at TUM), the method was applied to German newspaper articles [27,28]. The results were compared to those obtained with standard encoding techniques provided by the operating system UNIX, namely ``pack'', ``compress'', and ``gzip''. The corresponding decoding algorithms are ``unpack'', ``uncompress'', and ``gunzip'', respectively. ``pack'' is based on Huffman-Coding (e.g. [5]), while ``compress'' and ``gzip'' are based on the Lempel-Ziv technique [43]. As the file size goes to infinity, Lempel-Ziv becomes asymptotically optimal in a certain information theoretic sense [42].

The training set for the predictor was given by a set of 40 articles from the newspaper Münchner Merkur, each containing between 10000 and 20000 characters. The alphabet consisted of $k=80$ possible characters, including upper case and lower case letters, digits, interpunction symbols, and special German letters like ``ö'', ``ü'', ``ä''.

The test set consisted of 10 newspaper articles excluded from the training set, each containing between 10000 and 20000 characters. Table 1 lists the average compression ratios. The ``neural'' method outperformed the strongest conventional competitor, the UNIX ``gzip'' function based on the asymptotically optimal Lempel-Ziv algorithm.


Table 1: Compression ratios of various compression algorithms for short German text files ($<20000$ Bytes) from the unknown test set.
Method Compression Ratio  
Huffman Coding (UNIX: pack) 1.41  
Lempel-Ziv Coding (UNIX: compress) 1.98  
Improved Lempel-Ziv ( UNIX: gzip -9) 2.25  
``Neural'' method , $n=5$ 2.68  


Note that the time-window was quite small ($n=5$). In general, larger time windows will make more information available to the predictor. In turn, this will improve the prediction quality and increase the compression ratio. Therefore we expect to obtain even better results for $n>5$ and for recurrent predictor networks.


next up previous
Next: ON-LINE METHODS / LIMITATIONS Up: EXAMPLE 2: Text Compression Previous: USING THE PREDICTOR FOR
Juergen Schmidhuber 2003-02-19