next up previous
Next: Error path integral Up: Exponential error decay Previous: Exponential error decay

Gradients of the error function

The results we are going to prove hold regardless of the particular kind of cost function used (as long as its continuous in the output) and regardless of the particular algorithm which is employed to compute the gradient. Here we shortly explain how gradients are computed by the standard BPTT algorithm (e.g., [27], see also crossreference Chapter 14 for more details) because its analytical form is better suited to the forthcoming analyses. The error at time $t$ is denoted by $E(t)$. Considering only the error at time $t$, output unit $k$'s error signal is

\begin{displaymath}\delta_k(t)=
\frac{\partial E(t)}{\partial net_k(t)}\end{displaymath}

and some non-output unit $j$'s backpropagated error signal at time $\tau < t$ is

\begin{displaymath}\delta_j(\tau)=f'_j(net_j(\tau)) \ \left(\sum_i w_{ij} \ \delta_i(\tau+1)\right),\end{displaymath}

where

\begin{displaymath}net_i(\tau)=\sum_j w_{ij} \ a_j(\tau-1)\end{displaymath}

is unit $i$'s current net input,

\begin{displaymath}a_i(\tau)=f_i(net_i(\tau))\end{displaymath}

is the activation of a non-input unit $i$ with differentiable transfer function $f_i$, and $w_{ij}$ is the weight on the connection from unit $j$ to $i$. The corresponding contribution to $w_{jl}$'s total weight update is $\eta \ \delta_j(\tau) \ a_l(\tau-1)$, where $\eta$ is the learning rate, and $l$ stands for an arbitrary unit connected to unit $j$.
next up previous
Next: Error path integral Up: Exponential error decay Previous: Exponential error decay
Juergen Schmidhuber 2003-02-19