A. THE PREDICTOR NETWORK P

Next: B. METHOD 1 Up: III. OFF-LINE METHODS Previous: III. OFF-LINE METHODS

A. THE PREDICTOR NETWORK P

Assume that the alphabet contains possible characters $z_1, z_2, \ldots, z_k$ . The (local) representation of is a binary -dimensional vector with exactly one non-zero component (at the -th position). has input units and output units. is called the ``time-window'' size. We insert default characters at the beginning of each file. The representation of the default character, , is the -dimensional zero-vector. The -th character of file (starting from the first default character) is called .

For all $f \in F$ and all possible , receives as an input

$\begin{displaymath} r(c^f_{m-n}) \circ r(c^f_{m-n+1}) \circ \ldots \circ r(c^f_{m-1}) , \end{displaymath}$

where $\circ$ is the concatenation operator for vectors.

produces as an output

, a

-dimensional output vector. Using back-propagation [8][9],

is trained to minimize

$\begin{displaymath} \frac{1}{2} \sum_{f \in F} \sum_{m > n} \mid \mid r(c^f_{m}) - P^f_m \mid \mid ^2. \end{displaymath}$

(1)

(1) is minimal if

always equals

$\begin{displaymath} E( r(c^f_{m}) \mid c^f_{m-n}, \ldots, c^f_{m-1}), \end{displaymath}$

(2)

the conditional expectation of $r(c^f_{m})$ , given $r(c^f_{m-n}) \circ r(c^f_{m-n+1}) \circ \ldots \circ r(c^f_{m-1})$ . Due to the local character representation, this is equivalent to

being equal to the conditional probability

$\begin{displaymath} Pr(c^f_m = z_i \mid c^f_{m-n}, \ldots, c^f_{m-1}) \end{displaymath}$

(3)

for all

and for all appropriate

, where

denotes the

-th component of the vector

For instance, assume that a given ``context string'' of size is followed by a certain character in one third of all training exemplars involving this string. Then, given the context, the predictor's corresponding output unit will tend to predict a value of 0.3333.

In practical applications, the will not always sum up to 1. To obtain outputs satisfying the properties of a proper probability distribution, we normalize by defining

$\begin{displaymath} P^f_m(i) = \frac {(P^f_m)_i}{\sum_{j=1}^k (P^f_m)_j }. \end{displaymath}$

(4)

Next: B. METHOD 1 Up: III. OFF-LINE METHODS Previous: III. OFF-LINE METHODS

Juergen Schmidhuber 2003-02-19