next up previous
Next: A.3. RELATION TO HINTON Up: APPENDIX - THEORETICAL JUSTIFICATION Previous: A.1. OVERFITTING ERROR

A.2. HOW TO FLATTEN THE NETWORK OUTPUT

To find nets with flat outputs, two conditions will be defined to specify $B(w,D_0)$ (see section 3). The first condition ensures flatness. The second condition enforces ``equal flatness'' in all weight space directions. In both cases, linear approximations will be made (to be justified in [4]). We are looking for weights (causing tolerable error) that can be perturbed without causing significant output changes. Perturbing the weights $w$ by $\delta w$ (with components $\delta w_{ij}$), we obtain $ED(w,\delta w) := \sum_{k} (o^k(w + \delta w) - o^k(w))^{2}$, where $o^k(w)$ expresses $o^k$'s dependence on $w$ (in what follows, however, $w$ often will be suppressed for convenience). Linear approximation (justified in [4]) gives us ``Flatness Condition 1'':
$\displaystyle ED(w,\delta w)
\approx \sum_{k} (\sum_{i,j} \frac{\partial o^k}{\...
...^k}{\partial w_{ij}}\vert \vert \delta w_{ij}\vert)^{2}
\leq \epsilon \mbox{ ,}$     (4)

where $\epsilon > 0$ defines tolerable output changes within a box and is small enough to allow for linear approximation (it does not appear in $B(w,D_0)$'s gradient, see section 3).

Many $M_w$ satisfy flatness condition 1. To select a particular, very flat $M_w$, the following ``Flatness Condition 2'' uses up degrees of freedom left by (4):

$\displaystyle \forall i,j,u,v:(\delta w_{ij})^{2} \sum_{k} (\frac{\partial o^k}...
...delta w_{uv})^{2} \sum_{k} (\frac{\partial o^k}{\partial w_{uv}})^{2} \mbox{ .}$     (5)

Flatness Condition 2 enforces equal ``directed errors''
$ED_{ij}(w,\delta w_{ij}) = \sum_{k} (o^k(w_{ij} + \delta w_{ij}) - o^k(w_{ij}))^{2} \approx
\sum_{k} (\frac{\partial o^k}{\partial w_{ij}} \delta w_{ij})^{2}$, where $o^k(w_{ij})$ has the obvious meaning. It can be shown (see [4]) that with given box volume, we need flatness condition 2 to minimize the expected description length of the box center. Flatness condition 2 influences the algorithm as follows: (1) The algorithm prefers to increase the $\delta w_{ij}$'s of weights which currently are not important to generate the target output. (2) The algorithm enforces equal sensitivity of all output units with respect to the weights. Hence, the algorithm tends to group hidden units according to their relevance for groups of output units. Flatness condition 2 is essential: flatness condition 1 by itself corresponds to nothing more but first order derivative reduction (ordinary sensitivity reduction, e.g. []). Linear approximation is justified by the choice of $\epsilon$ in equation (4).

We first solve equation (5) for $\vert\delta w_{ij}\vert = \vert\delta w_{uv}\vert
\left(\sqrt{\sum_k \left( \f...
...} /
\sqrt{\sum_k \left( \frac{\partial o^k}{\partial w_{ij}} \right)^2} \right)$ (fixing $u,v$ for all $i,j$). Then we insert $\vert\delta w_{ij}\vert$ into equation (4) (replacing the second ``$\leq$'' in (4) by ``$=$''). This gives us an equation for the $\vert\delta w_{ij}\vert$ (which depend on $w$, but this is notationally suppressed):

\begin{displaymath}
\vert\delta w_{ij}\vert = \sqrt{\epsilon}/\left( \sqrt{\sum_...
...al o^k}{\partial w_{ij}})^{2}} } \right)^{2}} \right)
\mbox{.}
\end{displaymath} (6)

The $\vert\delta w_{ij}\vert$ approximate the $\Delta w_{ij}$ from section 2. Thus, $\tilde B(w,D_0)$ (see section 3) can be approximated by $B(w,D_0):= \sum_{i,j} - \log \vert \delta w_{ij}\vert $. This immediately leads to the algorithm given by equation (1).

How can this approximation be justified? The learning process itself enforces its validity (see justification in [4]). Initially, the conditions above are valid only in a very small environment of an ``initial'' acceptable minimum. But during search for new acceptable minima with more associated box volume, the corresponding environments are enlarged, which implies that the absolute values of the entries in the Hessian decrease. It can be shown (see [4]) that the algorithm tends to suppress the following values: (1) unit activations, (2) first order activation derivatives, (3) the sum of all contributions of an arbitary unit activation to the net output. Since weights, inputs, activation functions, and their first and second order derivatives are bounded, it can be shown (see [4]) that the entries in the Hessian decrease where the corresponding $\vert\delta w_{ij}\vert$ increase.


next up previous
Next: A.3. RELATION TO HINTON Up: APPENDIX - THEORETICAL JUSTIFICATION Previous: A.1. OVERFITTING ERROR
Juergen Schmidhuber 2003-02-25


Back to Financial Forecasting page