Assume that the alphabet contains possible characters . The (local) representation of is a binary -dimensional vector with exactly one non-zero component (at the -th position). has input units and output units. is called the ``time-window'' size. We insert default characters at the beginning of each file. The representation of the default character, , is the -dimensional zero-vector. The -th character of file (starting from the first default character) is called .

For all and all possible ,
receives as an input

where is the concatenation operator for vectors. produces as an output , a -dimensional output vector. Using back-propagation [8][9], is trained to minimize

(1) |

(2) |

(3) |

For instance, assume that a given ``context string'' of size is followed by a certain character in one third of all training exemplars involving this string. Then, given the context, the predictor's corresponding output unit will tend to predict a value of 0.3333.

In practical applications, the
will not always sum up to 1.
To obtain outputs satisfying the properties of
a proper probability distribution,
we normalize by defining

(4) |