next up previous
Next: EFFECTS OF THE ADDITIONAL Up: Feature extraction through LOCOCODE Previous: INTRODUCTION


FLAT MINIMUM SEARCH: REVIEW

FMS Overview. FMS is a general method for finding low complexity-networks with high generalization capability. FMS finds a large region in weight space such that each weight vector from that region has similar small error. Such regions are called ``flat minima''. In MDL terminology, few bits of information are required to pick a weight vector in a ``flat'' minimum (corresponding to a low complexity-network) -- the weights may be given with low precision. In contrast, weights in a ``sharp'' minimum require a high-precision specification. As a natural by-product of net complexity reduction, FMS automatically prunes weights and units, and reduces output sensitivity with respect to remaining weights and units. Previous FMS applications focused on supervised learning (Hochreiter and Schmidhuber 1995, 1997a): FMS led to better stock market prediction results than ``weight decay'' and ``optimal brain surgeon'' (Hassibi and Stork 1993). In this paper, however, we will use it for unsupervised coding only.

Architecture. We use a 3-layer feedforward net. Each layer is fully connected to the next. Let $O,H,I$ denote index sets for output, hidden, input units, respectively. Let $\left\vert.\right\vert$ denote the number of elements in a set. For $l \in O \cup H$, the activation $y^l$ of unit $l$ is $y^l = f_l\left(s_l\right)$, where $s_l= \sum_m w_{lm} y^m$ is the net input of unit $l$ ($m \in H$ for $l \in O$ and $m \in I$ for $l \in H$), $w_{lm}$ denotes the weight on the connection from unit $m$ to unit $l$, $f_l$ denotes unit $l$'s activation function, and for $m \in I$, $y^m$ denotes the $m$-th component of an input vector. $W=\left\vert\left(O \times H\right) \cup \left(H \times I\right)\right\vert$ is the number of weights.

Algorithm. FMS' objective function $E$ features an unconventional error term:

\begin{displaymath}
B = \sum_{i,j \in O \times H \cup H \times I}
\log \sum_{k ...
...ial y^k}{\partial w_{ij}}\right)^{2}}} \right)^{2}
\mbox{ .}
\end{displaymath}

$E = E_q + \lambda B$ is minimized by gradient descent, where $E_q$ is the training set mean squared error (MSE), and $\lambda$ a positive ``regularizer constant'' scaling $B$'s influence. Defining $\lambda$ corresponds to choosing a tolerable error level (there is no a priori ``optimal'' way of doing so). $B$ measures the weight precision (number of bits needed to describe all weights in the net). Reducing $B$ without increasing $E_q$ means removing weight precision without increasing MSE. Given a constant number of output units, all of this can be done efficiently, namely, with standard BP's order of computational complexity. For details see Hochreiter and Schmidhuber's article (1997a) or their home pages. For even more general, algorithmic methods reducing net complexity see Schmidhuber (1997a).


next up previous
Next: EFFECTS OF THE ADDITIONAL Up: Feature extraction through LOCOCODE Previous: INTRODUCTION
Juergen Schmidhuber 2003-02-13


Back to Independent Component Analysis page.