next up previous
Next: USING THE PREDICTOR FOR Up: EXAMPLE 2: Text Compression Previous: EXAMPLE 2: Text Compression

PREDICTING CONDITIONAL PROBABILITIES

With the offline variant of the approach, $P$'s training phase is based on a set $F$ of training files. Assume that the alphabet contains $k$ possible characters $z_1, z_2, \ldots, z_k$. The (local) representation of $z_i$ is a binary $k$-dimensional vector $r(z_i)$ with exactly one non-zero component (at the $i$-th position). $P$ has $nk$ input units and $k$ output units. $n$ is called the ``time-window size''. We insert $n$ default characters $z_0$ at the beginning of each file. The representation of the default character, $r(z_0)$, is the $k$-dimensional zero-vector. The $m$-th character of file $f$ (starting from the first default character) is called $c^f_m$.

For all $f \in F$ and all possible $m>n$, $P$ receives as an input

\begin{displaymath}
r(c^f_{m-n})
\circ
r(c^f_{m-n+1})
\circ
\ldots
\circ
r(c^f_{m-1}) ,
\end{displaymath}

where $\circ$ is the concatenation operator for vectors. $P$ produces as an output $ P^f_m$, a $k$-dimensional output vector. Using back-propagation [36][9][16][19], $P$ is trained to minimize

\begin{displaymath}
\frac{1}{2}
\sum_{f \in F}
\sum_{m > n}
\mid \mid r(c^f_{m}) - P^f_m \mid \mid ^2.
\end{displaymath}

Let $(P^f_m)_j$ denote the $j$-th component of the vector $ P^f_m$. Due to the local character representation, this error function is minimized if, for all $f$ and for all appropriate $m>n$, $(P^f_m)_i$ is equal to the conditional probability

\begin{displaymath}
P(c^f_m = z_i \mid c^f_{m-n}, \ldots, c^f_{m-1}).
\end{displaymath}

For normalization purposes, we define

\begin{displaymath}
P^f_m(i) = \frac {(P^f_m)_i}{\sum_{j=1}^k (P^f_m)_j }.
\end{displaymath}


next up previous
Next: USING THE PREDICTOR FOR Up: EXAMPLE 2: Text Compression Previous: EXAMPLE 2: Text Compression
Juergen Schmidhuber 2003-02-19