Next: B. METHOD 1 Up: III. OFF-LINE METHODS Previous: III. OFF-LINE METHODS

## A. THE PREDICTOR NETWORK P

Assume that the alphabet contains possible characters . The (local) representation of is a binary -dimensional vector with exactly one non-zero component (at the -th position). has input units and output units. is called the time-window'' size. We insert default characters at the beginning of each file. The representation of the default character, , is the -dimensional zero-vector. The -th character of file (starting from the first default character) is called .

For all and all possible , receives as an input

where is the concatenation operator for vectors. produces as an output , a -dimensional output vector. Using back-propagation [8][9], is trained to minimize
 (1)

(1) is minimal if always equals
 (2)

the conditional expectation of , given . Due to the local character representation, this is equivalent to being equal to the conditional probability
 (3)

for all and for all appropriate , where denotes the -th component of the vector .

For instance, assume that a given context string'' of size is followed by a certain character in one third of all training exemplars involving this string. Then, given the context, the predictor's corresponding output unit will tend to predict a value of 0.3333.

In practical applications, the will not always sum up to 1. To obtain outputs satisfying the properties of a proper probability distribution, we normalize by defining

 (4)

Next: B. METHOD 1 Up: III. OFF-LINE METHODS Previous: III. OFF-LINE METHODS
Juergen Schmidhuber 2003-02-19