MDL-JUSTIFICATION OF FLATNESS CONDITION 2

Next: EXPERIMENTAL RESULTS Up: DERIVATION OF THE ALGORITHM Previous: DERIVATION OF THE ALGORITHM

MDL-JUSTIFICATION OF FLATNESS CONDITION 2

Let us assume a sender wants to send a description of the function induced by

to a receiver who knows the inputs

but not the targets

, where $(x_p,y_p) \in D_0$ . The MDL principle suggests that the sender wants to minimize the expected description length of the net function. Let $ED_{mean}(w, X_0)$ denote the mean value of

on the box. Expected description length is approximated by $\mu ED_{mean}(w,X_0) + B(w,X_0) + c$ , where $c,\mu$ are positive constants. One way of seeing this is to apply Hinton and van Camp's ``bits back'' argument to a uniform weight prior ( $ED_{mean}$ corresponds to the output variance). However, we prefer to use a different argument: we encode each weight $w_{ij}$ of the box center

by a bitstring according to the following procedure ( $\Delta w_{ij}$ is given):
(0) Define a variable interval $I_{ij} \subset R$ .
(1) Make $I_{ij}$ equal to the interval constraining possible weight values.
(2) While $I_{ij} \not\subset [w_{ij}-\Delta w_{ij}, w_{ij} + \Delta w_{ij}]$ :
Divide $I_{ij}$ into 2 equally-sized disjunct intervals

and

.
If $w_{ij} \in I_1$ then $I_{ij} \leftarrow I_1$ ; write `1'.
If $w_{ij} \in I_2$ then $I_{ij} \leftarrow I_2$ ; write `0'.
The final set $\{I_{ij}\}$ corresponds to a ``bit-box'' within our box. This ``bit-box'' contains

's center

and is described by a bitstring of length $\tilde B(w,X_0) + c$ , where the constant

is independent of the box

. From

(

is the center of the ``bit-box'') and the bitstring describing the ``bit-box'', the receiver can compute

as follows: he selects an initialization weight vector within the ``bit-box'' and uses gradient descent to decrease

until

, where

in the bit-box denotes the receiver's current approximation of

(

is constantly updated by the receiver). This is like ``FMS without targets'' - recall that the receiver knows the inputs

. Since

corresponds to the weight vector with the highest degree of local flatness within the ``bit-box'', the receiver will find the correct

is described by a Gaussian distribution with mean zero. Hence, the description length of is $\mu ED(w,w_b-w)$ (Shannon, 1948). , the center of the ``bit-box'', cannot be known before training. However, we do know the expected description length of the net function, which is $\mu ED_{mean} + \tilde B(w,X_0) + c$ ( is a constant independent of ). Let us approximate $ED_{mean}$ : $ED_{l,mean}(w,\delta w) := \frac{1}{V(\delta w)} \int_{AM_w} ED_l(w,\delta v) d \delta v =$
$\frac{1}{V(\delta w)} 2^L \frac{1}{3} \sum_{i,j} \left( (\delta w_{ij})^3 \sum_... ...w_{ij} \right)^2 \sum_{k} \left( \frac{\partial o^k}{\partial w_{ij}} \right)^2$ .

Among those that lead to equal (the negative logarithm of the box volume plus $L \log 2$ ), we want to find those with minimal description length of the function induced by . Using Lagrange multipliers (viewing the $\delta w_{ij}$ as variables), it can be shown that $ED_{l,mean}$ is minimal under the condition iff flatness condition 2 holds. To conclude: with given box volume, we need flatness condition 2 to minimize the expected description length of the function induced by .

Next: EXPERIMENTAL RESULTS Up: DERIVATION OF THE ALGORITHM Previous: DERIVATION OF THE ALGORITHM

Juergen Schmidhuber 2003-02-13

Back to Financial Forecasting page