next up previous
Next: A.3. EFFICIENT IMPLEMENTATION OF Up: APPENDIX - THEORETICAL JUSTIFICATION Previous: A.1. FLAT NETS: THE

A.2. WHY DOES THE HESSIAN DECREASE?

Outline. This section shows that second order derivatives of the output function vanish during flat minimum search. This justifies the linear approximations in section 4.

Intuition. We show that the algorithm tends to suppress the following values: (1) unit activations, (2) first order activation derivatives, (3) the sum of all contributions of an arbitrary unit activation to the net output. Since weights, inputs, activation functions, and their first and second order derivatives are bounded, the entries in the Hessian decrease where the corresponding $\vert\delta w_{ij}\vert$ increase.

Formal details. We consider a strictly layered feedforward network with $K$ output units and $g$ layers. We use the same activation function $f$ for all units. For simplicity, in what follows we focus on a single input vector $x_p$. $x_p$ (and occasionally $w$ itself) will be notationally suppressed. We have


\begin{displaymath}
\frac{\partial y^{l}}{\partial w_{ij}} =
f'(s_{l}) \left\{ {...
...ox{ for } i=l \atop \mbox{ for } i \not= l} \right\} \mbox{ ,}
\end{displaymath} (34)

where $y^{a}$ denotes the activation of the $a$-th unit, and $s_l = \sum_m w_{lm} y^m$.

The last term of equation (1) (the ``regulator'') expresses output sensitivity (to be minimized) with respect to simultaneous perturbations of all weights. ``Regulation'' is done by equalizing the sensitivity of the output units with respect to the weights. The ``regulator'' does not influence the same particular units or weights for each training example. It may be ignored for the purposes of this section. Of course, the same holds for the first (constant) term in (1). We are left with the second term. With (34) we obtain:


$\displaystyle \sum_{i,j} \log \sum_{k} (\frac{\partial o^{k}}
{\partial w_{ij}})^{2} =$      
$\displaystyle 2 \sum_{\mbox{{\scriptsize unit }} k \mbox{ {\scriptsize in the }...
...ize th layer}}}
(\mbox{{\small fan-in of unit }} k) \log \vert f'(s_{k})\vert +$      
$\displaystyle 2 \sum_{\mbox{{\scriptsize unit }} j \mbox{ {\scriptsize in the }...
...tsize th layer}}}
(\mbox{{\small fan-out of unit }} j )
\log \vert y^{j}\vert +$      
$\displaystyle \sum_{\mbox{{\scriptsize unit }} j \mbox{ {\scriptsize in the }} ...
...\small fan-in of unit }} j) \log \sum_{k} \left( f'(s_{k})
w_{kj} \right)^{2} +$      
$\displaystyle 2 \sum_{\mbox{{\scriptsize unit }} j \mbox{ {\scriptsize in the }...
...ize th layer}}}
(\mbox{{\small fan-in of unit }} j) \log \vert f'(s_{j})\vert +$      
$\displaystyle 2 \sum_{\mbox{{\scriptsize unit }} j \mbox{ {\scriptsize in the }...
...tsize th layer}}}
(\mbox{{\small fan-out of unit }} j )
\log \vert y^{j}\vert +$      
$\displaystyle \sum_{\mbox{{\scriptsize unit }} j \mbox{ {\scriptsize in the }} ...
...) \log \sum_{k} \left( f'(s_{k}) \sum_{l}
f'(s_{l}) w_{kl} w_{lj} \right)^{2} +$      
$\displaystyle 2 \sum_{\mbox{{\scriptsize unit }} j \mbox{ {\scriptsize in the }...
...ize th layer}}}
(\mbox{{\small fan-in of unit }} j) \log \vert f'(s_{j})\vert +$      
$\displaystyle 2 \sum_{\mbox{{\scriptsize unit }} j \mbox{ {\scriptsize in the }...
...tsize th layer}}}
(\mbox{{\small fan-out of unit }} j )
\log \vert y^{j}\vert +$      
$\displaystyle \sum_{\mbox{ {\scriptsize {\em i,j}, where unit }} i
\mbox{ {\scr...
...l_1} \sum_{l_2}
w_{l_1l_2} \frac{\partial y^{l_2}}{\partial w_{ij}} \right)^{2}$     (35)

Let us have a closer look at this equation. We observe:
(1) Activations of units decrease in proportion to their fan-outs.
(2) First order derivatives of the activation functions decrease in proportion to their fan-ins.
(3) A term of the form $\sum_{k} \left( f'(s_{k}) \sum_{l_1}
f'(s_{l_1}) w_{k l_1} \sum_{l_2} ... \sum_{l_r} f'(s_{l_r}) w_{l_{r-1} l_r}
w_{l_r j} \right)^2$ expresses the sum of unit $j$'s squared contributions to the net output. Here $r$ ranges over $\{0, 1, \ldots, g-2 \}$, and unit $j$ is in the $(g-1-r)$th layer (for the special case $r=0$, we get $\sum_{k} \left( f'(s_{k}) w_{kj} \right)^{2}$). These terms also decrease in proportion to unit $j$'s fan-in. Analogously, equation (35) can be extended to the case of additional layers.

Comment. Let us assume that $f'(s_{j}) = 0$ and $f(s_{j}) = 0$ is ``difficult to achieve'' (can be achieved only by fine-tuning all weights on connections to unit $j$). Instead of minimizing $\vert f(s_{j})\vert$ or $\vert f'(s_{j})\vert$ by adjusting the net input of unit $j$ (this requires fine-tuning of many weights), our algorithm prefers pushing weights $w_{kl}$ on connections to output units towards zero (other weights are less affected). On the other hand, if $f'(s_{j}) = 0$ and $f(s_{j}) = 0$ is not ``difficult to achieve'', then, unlike weight decay, our algorithm does not necessarily prefer weights close to zero. Instead, it prefers (possibly very strong) weights which push $f(s_j)$ or $f'(s_j)$ towards zero (e.g., with sigmoid units active in [0,1]: strong inhibitory weights are preferred; with Gaussian units: high absolute weight values are preferred). See the experiment in section 5.2.

How does this influence the Hessian? The entries in the Hessian corresponding to output $o^k$ can be written as follows:

$\displaystyle \frac{\partial^{2} o^{k}}{\partial w_{ij} \partial w_{uv}} =
\fra...
...v}} +
\bar \delta_{uk} \frac{\partial y^{v}}{\partial w_{ij}} \right) \mbox{ ,}$     (36)

where $\bar \delta$ is the Kronecker-Delta. Searching for big boxes, we run into regions of acceptable minima with $o^k$'s close to target (section 2). Thus, by scaling the targets, $\frac{f''(s_{k})}{(f'(s_{k}))^{2}}$ can be bounded. Therefore, the first term in equation (36) decreases during learning.

According to the analysis above, the first order derivatives in the second term of (36) are pushed towards zero. So are the $w_{kl}$ of the sum in the second term of (36).

The only remaining expressions of interest are second order derivatives of units in layer $(g-1)$. The $\frac{\partial^{2}
y^{l}}{\partial w_{ij} \partial w_{uv}}$ are bounded if (a) the weights, (b) the activation functions, (c) their first and second order derivatives, and (d) the inputs are bounded. This is indeed the case, as will be shown for networks with one or two hidden layers:

Case 1: For unit $l$ in a single hidden layer ($g=3$), we obtain


$\displaystyle \vert \frac{\partial^{2} y^{l}}{\partial w_{ij} \partial w_{uv}}\...
...=
\vert\bar \delta_{li} \bar \delta_{lu} f''(s_l) y^j y^v \vert < C_1
\mbox{ ,}$     (37)

where $y^j, y^v$ are the components of an input vector $x_p$, and $C_1$ is a positive constant.

Case 2: For unit $l$ in the third layer of a net with 2 hidden layers ($g=4$), we obtain


$\displaystyle \vert \frac{\partial^{2} y^{l}}{\partial w_{ij} \partial w_{uv}} ...
...(s_l) (w_{li} y^j + \bar \delta_{il} y^j)
(w_{lu} y^v + \bar \delta_{ul} y^v) +$      
$\displaystyle f'(s_l) \left( w_{li} \bar \delta_{iu} f''(s_i) y^j y^v +
\bar \d...
...v +
\bar \delta_{ul} \bar \delta_{iv} f'(s_v) y^j \right) \vert < C_2
\mbox{ ,}$     (38)

where $C_2$ is a positive constant. Analogously, the boundedness of second order derivatives can be shown for additional hidden layers.

Conclusion: As desired, our algorithm makes the $H_{ij, uv}^{k}$ decrease where $\vert\delta w_{ij}\vert$ or $\vert\delta w_{uv}\vert$ increase.


next up previous
Next: A.3. EFFICIENT IMPLEMENTATION OF Up: APPENDIX - THEORETICAL JUSTIFICATION Previous: A.1. FLAT NETS: THE
Juergen Schmidhuber 2003-02-13


Back to Financial Forecasting page