next up previous


Our current computing environment prohibits extensive experimental evaluations of the three methods above. On an HP 700 workstation, the training phase for the predictor turns out to be quite time consuming, taking days of computation time. Once the predictor is trained, the method still tends to be on the order of 1000 times slower than standard methods. In many data transmission applications, communication is not expensive enough to justify this in absence of specialized hardware (given the current state of workstation technology). This leads us to recommending special neural net hardware for our approach. The software simulations presented in this section, however, will show that ``neural'' compression techniques can sometimes achieve excellent compression ratios.

We applied our off-line methods to German newspaper articles (simulations were run and evaluated by the second author). We compared the results to those obtained with standard encoding techniques provided by the operating system UNIX, namely ``pack'', ``compress'', and ``gzip''. The corresponding decoding algorithms are ``unpack'', ``uncompress'', and ``gunzip'', respectively. ``pack'' is based on Huffman-Coding (e.g. [4]), while ``compress'' and ``gzip'' are based on the asymptotically ``optimal'' Lempel-Ziv technique [2]. It should be noted that ``pack'', ``compress'', and ``gzip'' ought to be classified as on-line methods - they adapt to the specific text file they see. In contrast, the competing ``neural'' methods ran off-line, due to time limitations. Therefore our comparison was unfair in the sense that it was biased against the ``neural'' methods. See section V, however, for on-line ``neural'' alternatives.

The training set for the predictor was given by a set of 40 articles from the newspaper Münchner Merkur, each containing between 10000 and 20000 characters. The alphabet consisted of $k=80$ possible characters, including upper case and lower case letters, ciphers, interpunction symbols, and special German letters like ``ö'', ``ü'', ``ä''. $P$ had 430 hidden units. A ``true'' unit with constant activation 1.0 was connected to all hidden and output units. The learning rate was 0.2. The training phase consisted of 25 sweeps through the training set, taking 3 days of computation time on an HP 700 station. Why just 25 sweeps? On a separate test set, numbers of sweeps between 20 and 40 were empirically found to lead to acceptable performance. Note that a single sweep actually provides many different training examples for the predictor.

The test set consisted of 20 newspaper articles (from the same newspaper), each containing between 10000 and 20000 characters. Of course, the test set did not overlap with the training set. Table 1 lists the average compression ratios and the corresponding variances. Our methods achieved ``excellent'' performance (according to Held's statement quoted in the introduction). Even METHOD 3 led to an ``excellent'' compression ratio, although it does not make use of all the information about the conditional probabilities. The best performance was obtained with METHOD 2, which outperformed the strongest conventional competitor, the UNIX ``gzip'' function based on the asymptotically optimal Lempel-Ziv algorithm. Note that variance goes up (but always remains within acceptable limits) as compression performance improves.

------------ INSERT TABLE 1 HERE ----------------

The hidden units were actually necessary to achieve good performance. A network without hidden units was not able to achieve average compression ratios exceeding 2.0. The precise number of hidden units appeared to be not very important, though. A network with 300 hidden units achieved performance similar to the one of the network above.

How does a neural net trained on articles from Münchner Merkur perform on articles from other sources? Without retraining the neural predictor, we applied all competing methods to 10 articles from another German newspaper (the Frankenpost). The results are given in table 2.

------------ INSERT TABLE 2 HERE ----------------

The Frankenpost articles were harder to compress for all algorithms. But relative performance remained comparable.

Note that we used quite a small time-window ($n=5$). In general, larger time windows will make more information available to the predictor. In turn, this will improve the prediction quality and increase the compression ratio. Therefore we expect to obtain even better results for $n>5$ and for recurrent predictor networks (note that recurrent nets are less limited than the time window approach - in principle they can emit predictions based on all previous characters). Another reason for optimism is given by a performance comparison with three human subjects who had to predict characters (randomly selected from the test files) from $n$ preceding characters. With $n=5$, the humans were able to predict 52 percent of all characters, while our predictor predicted 49 percent (the character with the highest predicted probability was taken as the prediction). With $n = 10$, humans were able to predict about 59 percent of all characters. With $n = 15$, humans were able to predict about 63 percent of all characters. We expect that $P$ will remain close to human performance for $n>5$. More training data, however, are required to avoid overfitting.

next up previous
Juergen Schmidhuber 2003-02-19