next up previous
Next: B. METHOD 1 Up: III. OFF-LINE METHODS Previous: III. OFF-LINE METHODS

A. THE PREDICTOR NETWORK P

Assume that the alphabet contains $k$ possible characters $z_1, z_2, \ldots, z_k$. The (local) representation of $z_i$ is a binary $k$-dimensional vector $r(z_i)$ with exactly one non-zero component (at the $i$-th position). $P$ has $nk$ input units and $k$ output units. $n$ is called the ``time-window'' size. We insert $n$ default characters $z_0$ at the beginning of each file. The representation of the default character, $r(z_0)$, is the $k$-dimensional zero-vector. The $m$-th character of file $f$ (starting from the first default character) is called $c^f_m$.

For all $f \in F$ and all possible $m>n$, $P$ receives as an input

\begin{displaymath}
r(c^f_{m-n})
\circ
r(c^f_{m-n+1})
\circ
\ldots
\circ
r(c^f_{m-1}) ,
\end{displaymath}

where $\circ$ is the concatenation operator for vectors. $P$ produces as an output $ P^f_m$, a $k$-dimensional output vector. Using back-propagation [8][9], $P$ is trained to minimize
\begin{displaymath}
\frac{1}{2}
\sum_{f \in F}
\sum_{m > n}
\mid \mid r(c^f_{m}) - P^f_m \mid \mid ^2.
\end{displaymath} (1)

(1) is minimal if $ P^f_m$ always equals
\begin{displaymath}
E( r(c^f_{m}) \mid c^f_{m-n}, \ldots, c^f_{m-1}),
\end{displaymath} (2)

the conditional expectation of $r(c^f_{m})$, given $r(c^f_{m-n})
\circ
r(c^f_{m-n+1})
\circ
\ldots
\circ
r(c^f_{m-1})$. Due to the local character representation, this is equivalent to $(P^f_m)_i$ being equal to the conditional probability
\begin{displaymath}
Pr(c^f_m = z_i \mid c^f_{m-n}, \ldots, c^f_{m-1})
\end{displaymath} (3)

for all $f$ and for all appropriate $m>n$, where $(P^f_m)_j$ denotes the $j$-th component of the vector $ P^f_m$.

For instance, assume that a given ``context string'' of size $n$ is followed by a certain character in one third of all training exemplars involving this string. Then, given the context, the predictor's corresponding output unit will tend to predict a value of 0.3333.

In practical applications, the $(P^f_m)_i$ will not always sum up to 1. To obtain outputs satisfying the properties of a proper probability distribution, we normalize by defining

\begin{displaymath}
P^f_m(i) = \frac {(P^f_m)_i}{\sum_{j=1}^k (P^f_m)_j }.
\end{displaymath} (4)


next up previous
Next: B. METHOD 1 Up: III. OFF-LINE METHODS Previous: III. OFF-LINE METHODS
Juergen Schmidhuber 2003-02-19