PREDICTING CONDITIONAL PROBABILITIES

Next: USING THE PREDICTOR FOR Up: EXAMPLE 2: Text Compression Previous: EXAMPLE 2: Text Compression

PREDICTING CONDITIONAL PROBABILITIES

With the offline variant of the approach, 's training phase is based on a set of training files. Assume that the alphabet contains possible characters $z_1, z_2, \ldots, z_k$ . The (local) representation of is a binary -dimensional vector with exactly one non-zero component (at the -th position). has input units and output units. is called the ``time-window size''. We insert default characters at the beginning of each file. The representation of the default character, , is the -dimensional zero-vector. The -th character of file (starting from the first default character) is called .

For all $f \in F$ and all possible , receives as an input

$\begin{displaymath} r(c^f_{m-n}) \circ r(c^f_{m-n+1}) \circ \ldots \circ r(c^f_{m-1}) , \end{displaymath}$

where $\circ$ is the concatenation operator for vectors.

produces as an output

, a

-dimensional output vector. Using back-propagation [36][9][16][19],

is trained to minimize

$\begin{displaymath} \frac{1}{2} \sum_{f \in F} \sum_{m > n} \mid \mid r(c^f_{m}) - P^f_m \mid \mid ^2. \end{displaymath}$

Let

denote the

-th component of the vector

. Due to the local character representation, this error function is minimized if, for all

and for all appropriate

,

is equal to the conditional probability

$\begin{displaymath} P(c^f_m = z_i \mid c^f_{m-n}, \ldots, c^f_{m-1}). \end{displaymath}$

For normalization purposes, we define

$\begin{displaymath} P^f_m(i) = \frac {(P^f_m)_i}{\sum_{j=1}^k (P^f_m)_j }. \end{displaymath}$

Next: USING THE PREDICTOR FOR Up: EXAMPLE 2: Text Compression Previous: EXAMPLE 2: Text Compression

Juergen Schmidhuber 2003-02-19