next up previous
Next: The Architecture and the Up: LEARNING TO CONTROL FAST-WEIGHT Previous: LEARNING TO CONTROL FAST-WEIGHT

The Task

A training sequence $p$ with $n_p$ discrete time steps (called an episode) consists of $n_p$ ordered pairs $(x^p(t),d^p(t)) \in R^n \times R^m$ , $0 < t \leq n_p$. At time $t$ of episode $p$ a learning system receives $x^p(t)$ as an input and produces the output $y^p(t)$. The goal of the learning system is to minimize

\begin{displaymath}
\hat{E}= \frac{1}{2} \sum_p \sum_t \sum_i (d^p_i(t)-y^p_i(t))^2 ,
\end{displaymath}

where $d^p_i(t)$ is the $i$th of the $m$ components of $d^p(t)$, and $y^p_i(t)$ is the $i$th of the $m$ components of $y^p(t)$.

In general, this task requires storage of input events in a short-term memory. Previous solutions to this problem have employed gradient-based dynamic recurrent nets (e.g., [Robinson and Fallside, 1987], [Pearlmutter, 1989], [Williams and Zipser, 1989]). In the next section an alternative gradient-based approach is described. For convenience, we drop the indices $p$ which stand for the various episodes.

The gradient of the error over all episodes is equal to the sum of the gradients for each episode. Thus we only require a method for minimizing the error observed during one particular episode:

\begin{displaymath}
\bar{E}= \sum_t E(t) ,
\end{displaymath}

where $E(t) = \frac{1}{2} \sum_i (d_i(t)-y_i(t))^2$. (In the practical on-line version of the algorithm below there will be no episode boundaries; one episode will 'blend' into the next [Williams and Zipser, 1989].)


next up previous
Next: The Architecture and the Up: LEARNING TO CONTROL FAST-WEIGHT Previous: LEARNING TO CONTROL FAST-WEIGHT
Juergen Schmidhuber 2003-02-13

Back to Recurrent Neural Networks page