next up previous
Next: EXPERIMENTAL RESULTS Up: DERIVATION OF THE ALGORITHM Previous: DERIVATION OF THE ALGORITHM

MDL-JUSTIFICATION OF FLATNESS CONDITION 2

Let us assume a sender wants to send a description of the function induced by $w$ to a receiver who knows the inputs $x_p$ but not the targets $y_p$, where $(x_p,y_p) \in D_0$. The MDL principle suggests that the sender wants to minimize the expected description length of the net function. Let $ED_{mean}(w, X_0)$ denote the mean value of $ED$ on the box. Expected description length is approximated by $\mu ED_{mean}(w,X_0) + B(w,X_0) + c$, where $c,\mu$ are positive constants. One way of seeing this is to apply Hinton and van Camp's ``bits back'' argument to a uniform weight prior ($ED_{mean}$ corresponds to the output variance). However, we prefer to use a different argument: we encode each weight $w_{ij}$ of the box center $w$ by a bitstring according to the following procedure ($ \Delta w_{ij} $ is given):
(0) Define a variable interval $I_{ij} \subset R$.
(1) Make $I_{ij}$ equal to the interval constraining possible weight values.
(2) While $I_{ij} \not\subset [w_{ij}-\Delta w_{ij}, w_{ij} + \Delta w_{ij}]$:
Divide $I_{ij}$ into 2 equally-sized disjunct intervals $I_1$ and $I_2$.
If $w_{ij} \in I_1$ then $I_{ij} \leftarrow I_1$; write `1'.
If $w_{ij} \in I_2$ then $I_{ij} \leftarrow I_2$; write `0'.
The final set $\{I_{ij}\}$ corresponds to a ``bit-box'' within our box. This ``bit-box'' contains $M_w$'s center $w$ and is described by a bitstring of length $\tilde B(w,X_0) + c$, where the constant $c$ is independent of the box $M_w$. From $ED(w,w_b-w)$ ($w_b$ is the center of the ``bit-box'') and the bitstring describing the ``bit-box'', the receiver can compute $w$ as follows: he selects an initialization weight vector within the ``bit-box'' and uses gradient descent to decrease $B(w_a,X_0)$ until $ED(w_a,w_b-w_a)= ED(w,w_b-w)$, where $w_a$ in the bit-box denotes the receiver's current approximation of $w$ ($w_a$ is constantly updated by the receiver). This is like ``FMS without targets'' - recall that the receiver knows the inputs $x_p$. Since $w$ corresponds to the weight vector with the highest degree of local flatness within the ``bit-box'', the receiver will find the correct $w$.

$ED(w,w_b-w)$ is described by a Gaussian distribution with mean zero. Hence, the description length of $ED(w,w_b-w)$ is $\mu ED(w,w_b-w)$ (Shannon, 1948). $w_b$, the center of the ``bit-box'', cannot be known before training. However, we do know the expected description length of the net function, which is $\mu ED_{mean} + \tilde B(w,X_0) + c$ ($c$ is a constant independent of $w$). Let us approximate $ED_{mean}$: $ED_{l,mean}(w,\delta w) := \frac{1}{V(\delta w)} \int_{AM_w}
ED_l(w,\delta v) d \delta v =$
$\frac{1}{V(\delta w)} 2^L \frac{1}{3} \sum_{i,j} \left( (\delta w_{ij})^3 \sum_...
...w_{ij} \right)^2 \sum_{k} \left(
\frac{\partial o^k}{\partial w_{ij}} \right)^2$.

Among those $w$ that lead to equal $B(w,X_0)$ (the negative logarithm of the box volume plus $L \log 2$), we want to find those with minimal description length of the function induced by $w$. Using Lagrange multipliers (viewing the $\delta w_{ij}$ as variables), it can be shown that $ED_{l,mean}$ is minimal under the condition $B(w,X_0) = constant$ iff flatness condition 2 holds. To conclude: with given box volume, we need flatness condition 2 to minimize the expected description length of the function induced by $w$.


next up previous
Next: EXPERIMENTAL RESULTS Up: DERIVATION OF THE ALGORITHM Previous: DERIVATION OF THE ALGORITHM
Juergen Schmidhuber 2003-02-13


Back to Financial Forecasting page