next up previous
Next: DERIVATION OF THE ALGORITHM Up: FLAT MINIMA NEURAL COMPUTATION Previous: TASK / ARCHITECTURE /


THE ALGORITHM

Let $X_0 = \{x_p \mid (x_p,y_p) \in D_0 \}$ denote the inputs of the training set. We approximate $\Delta w(X)$ by $\Delta w(X_0)$, where $\Delta w(X_0)$ is defined like $\Delta w(X)$ in the previous section (replacing $X$ by $X_0$). For simplicity, in what follows, we will abbreviate $\Delta w(X_0)$ by $\Delta w$. Starting with a random initial weight vector, flat minimum search (FMS) tries to find a $w$ that not only has low $E(net(w),D_0)$ but also defines a box $M_w$ with maximal box volume $V(\Delta w)$ and, consequently, minimal $\tilde B(w,X_0) := - \log (\frac{1}{2^L} V(\Delta w)) =
\sum_{i,j} - \log \Delta w_{ij}$. Note the relationship to MDL: $\tilde B$ is the number of bits required to describe the weights, whereas the number of bits needed to describe the $y_p$, given $w$ (with $(x_p,y_p) \in D_0$), can be bounded by fixing $E_{tol}$ (see appendix A.1). In the next section we derive the following algorithm. We use gradient descent to minimize $E(w,D_0)= E(net(w),D_0) + \lambda B(w,X_0)$, where $B(w,X_0)= \sum_{x_p \in X_0} B(w,x_p)$, and

\begin{displaymath}
B(w,x_p)=
\frac{1}{2} \left( - L \log \epsilon + \sum_{i,j}...
...(w,x_p)}{\partial w_{ij}})^{2}}} \right)^{2} \right) \mbox{ .}
\end{displaymath} (1)

Here $o^k(w,x_p)$ is the activation of the $k$th output unit (given weight vector $w$ and input $x_p$), $\epsilon$ is a constant, and $\lambda$ is the regularization constant (or hyperparameter) which controls the trade-off between regularization and training error (see appendix A.1). To minimize $B(w,X_0)$, for each $x_p \in X_0$ we have to compute
\begin{displaymath}
\frac{\partial B(w,x_p)}{\partial w_{uv}} =
\sum_{k, i,j} \f...
...tial w_{ij} \partial w_{uv}}
\mbox{ for all } u,v
\mbox{ . }
\end{displaymath} (2)

It can be shown that by using Pearlmutter's and M$\o$ller's efficient second order method, the gradient of $B(w,x_p)$ can be computed in $O(L)$ time (see details in A.3). Therefore, our algorithm has the same order of computational complexity as standard backprop.


next up previous
Next: DERIVATION OF THE ALGORITHM Up: FLAT MINIMA NEURAL COMPUTATION Previous: TASK / ARCHITECTURE /
Juergen Schmidhuber 2003-02-13


Back to Financial Forecasting page