next up previous
Next: EXPERIMENTS Up: FLAT MINIMUM SEARCH: REVIEW Previous: FLAT MINIMUM SEARCH: REVIEW

FMS: A Novel Analysis

Simple basis functions (BFs). A BF is the function determining the activation of a code component in response to a given input. Minimizing $B$'s term

\begin{displaymath}
T1 \ := \ \sum_{i,j: \ i \in O \cup H}
\log \sum_{k \in O} \left(\frac{\partial y^k}{\partial w_{ij}}\right)^{2}
\end{displaymath}

obviously reduces output sensitivity with respect to weights (and therefore units). $T1$ is responsible for pruning weights (and, therefore, units). $T1$ is one reason why low-complexity (or simple) BFs are preferred: weight precision (or complexity) is mainly determined by $\frac{\partial y^k}{\partial w_{ij}}$.

Sparseness. Because $T1$ tends to make unit activations decrease to zero it favors sparse codes. But $T1$ also favors a sparse hidden layer in the sense that few hidden units contribute to producing the output. $B$'s second term

\begin{displaymath}
T2 \ \ := \ \
W \log \sum_{k \in O}
\left( \sum_{i,j: \ i \...
...tial y^k}{\partial w_{ij}}\right)^{2}}} \right)^{2}
\mbox{ }
\end{displaymath}

punishes units with similar influence on the output. We reformulate it:
    $\displaystyle T2 =
W \log
\left( \sum_{i,j: \ i \in O \cup H} \ \
\sum_{u,v: \...
...t{\sum_{k \in O}
\left(\frac{\partial y^k}{\partial y^u}\right)^{2}}} \right) =$  
    $\displaystyle W \log
\left( \left\vert O\right\vert \ \left\vert O \times H\rig...
...t{\sum_{k \in O}
\left(\frac{\partial y^k}{\partial y^u}\right)^{2}}} \right)
.$  

See intermediate steps in [15]. We observe: (1) an output unit that is very sensitive with respect to two given hidden units will heavily contribute to $T2$ (compare the numerator in the last term of $T2$). (2) This large contribution can be reduced by making both hidden units have large impact on other output units (see denominator in the last term of $T2$).

Few separated basis functions. Hence FMS tries to figure out a way of using (1) as few BFs as possible for determining the activation of each output unit, while simultaneously (2) using the same BFs for determining the activations of as many output units as possible (common BFs). (1) and $T1$ separate the BFs: the force towards simplicity (see $T1$) prevents input information from being channelled through a single BF; the force towards few BFs per output makes them non-redundant. (1) and (2) cause few BFs to determine all outputs.

Summary. Collectively $T1$ and $T2$ (which make up $B$) encourage sparse codes based on few separated simple basis functions producing all outputs. Due to space limitations a more detailed analysis (e.g. linear output activation) had to be left to a TR [15] (on the WWW).



next up previous
Next: EXPERIMENTS Up: FLAT MINIMUM SEARCH: REVIEW Previous: FLAT MINIMUM SEARCH: REVIEW
Juergen Schmidhuber 2003-02-25