next up previous
Next: E. COMPARISON OF METHODS Up: D. METHOD 3 Previous: D. METHOD 3

Application to Text Compression

Like with METHODs 1 and 2, the ``time-window'' corresponding to the predictor input is sequentially shifted across the unknown text file. The $ P^f_m$, however, are used in a different way. The character $z_i$ whose representation $r(z_i)$ has minimal Euclidean distance to $ P^f_m$ is taken as the predictor's deterministic prediction (if there is more than one character with minimal distance to the output, then we take the one with lowest ASCII value). If $c^f_m$ does not match the prediction, then it is stored in a second file, together with a number indicating how many characters were processed since the last non-matching character [6][7]. Expected characters are simply ignored - they represent redundant information. To avoid confusions between unexpected numbers from the original file and numbers indicating how many correct predictions went by since the last wrong prediction, we introduce an escape character to mark unexpected number characters in the second file. The escape character is used to mark unexpected escape characters, too. Finally we apply Huffman-Coding (as embodied by the UNIX function pack) to the second file and obtain the final compressed file.

The ``uncompress'' algorithm works as follows: we first unpack the compressed file by inverse Huffman-Coding (as employed by the UNIX function unpack). Then, starting from $n$ default characters, the predictor sequentially tries to predict each character of the original file from the $n$ previous characters (deterministic predictions are obtained like with the compression procedure above.) The numbers in the unpacked file contain all information about which predictions are wrong, and the associated characters tell us how to correct wrong predictions: if the unpacked file indicates that the current prediction is correct, it is fed back to the predictor input and becomes part of the basis for the next prediction. If the unpacked file indicates that the current prediction is wrong, the corresponding entry in the unpacked file (the correct character associated with the number indicating how many correct predictions went by since the last unexpected character) replaces the prediction and is fed back to the predictor input where it becomes part of the basis for the next prediction.


next up previous
Next: E. COMPARISON OF METHODS Up: D. METHOD 3 Previous: D. METHOD 3
Juergen Schmidhuber 2003-02-19