next up previous
Next: Dynamic Equilibria Through the Up: THE SYSTEM Previous: Outline and Motivation of

Formal Details

In the comparatively simple case considered here, the controller $C$ is a standard back-propagation network. There are discrete time steps. Each fovea trajectory involves $k$ discrete time steps 1 ... $k$. At time step $t$ of trajectory $p$, $C$'s input is the real-valued vector $x_p(t)$ which is determined by sensory perceptions from the artificial `fovea'. $C$'s output at time step $t$ of trajectory $p$ is the vector $c_p(t)$. At each time step $t$ motoric actions like `move fovea left', `rotate fovea' are based on $c_p(t)$. The actions cause a new input $x_p(t+1)$. The final desired input $d_{pfin}$ of the trajectory $p$ is a predefined activation pattern corresponding to the target to be found in a static visual scene. The task is to sequentially generate fovea trajectories such that for each trajectory $p$ $d_{pfin}$ matches $x_p(k)$. The final input error $e_{pfin}$ at the end of trajectory $p$ (externally interrupted at time step $k$) is

\begin{displaymath}e_{pfin} = (d_{pfin}-x_p(k))^T (d_{pfin}-x_p(k)). \end{displaymath}

Thus $e_{pfin}$ is determined by the differences between the desired final inputs and the actual final inputs.

In order to allow credit assignment to past output actions of the control network, we first train the model network $M$ (another standard back-propagation network) to emulate the visible environmental dynamics. This is done by training $M$ at a given time to predict $C$'s next input, given the previous input and output of $C$. The following discussion refers to the case where both $M$ and $C$ learn in parallel. In some of the experiments below we use two separate training phases for $M$ and $C$. However, the modifications are straight-forward and mainly notational.

$M$'s input vector at time $t$ of trajectory $p$ is the concatenation of $c_p(t)$ and $x_p(t)$. $M$'s real-valued output vector at time $t$ of trajectory $p$ is $m_p(t)$, where $ \mid m_p(t) \mid = \mid x_p(t) \mid$. (Here $\mid x \mid$ is the dimension of $x$, $M$ has as many output units as there are input units for $C$.) $m_p(t)$ is $M$'s prediction of $x_p(t+1)$. The error of $M$'s prediction at time $0 < t < k$ of trajectory $p$ is

\begin{displaymath}E_p(t) = (x_p(t+1) - m_p(t))^T (x_p(t+1) - m_p(t)). \end{displaymath}

$M$'s goal is to minimize $\sum_{p,t} E_p(t)$, which is done by conventional back-propagation [17][7][4][9]:

\begin{displaymath}\triangle W_M^T = -\alpha_M
\frac{\partial \sum_{p,t} E_p(t)}{\partial W_M} . \end{displaymath}

Here $W_M$ is $M$'s weight vector, $\triangle W_M$ its change caused by the back-propagation procedure, and $\alpha_M$ is $M$'s constant learning rate. (In the experiments described below we will deviate from pure gradient descent by changing $M$'s weights after each time step of each trajectory.)

$C$'s training phase is more complex than $M$'s. It is assumed that $ \sum_p e_{pfin}$ is a differentiable function of $W_C$, where $W_C$ is $C$'s weight vector. To approximate

\begin{displaymath}\frac{\partial \sum_p e_{pfin}}{\partial W_C}, \end{displaymath}

it is assumed that $M$ with fixed $W_M$ can substitute the environmental dynamics. As described below, $M$ is used to approximate the desired partial derivative, but only $C$'s weights are allowed to change, $W_M$ remains fixed. The `unfolding in time' algorithm [9][18] is applied to the recurrent combination of $M$ and $C$ (figure 3) to compute

\begin{displaymath}\triangle W_C^T = - \alpha_C \sum_p
\left( \frac{\partial m_p(k)}{\partial W_C} \right)^T
(d_{pfin} - x_p(k))
. \end{displaymath}

Here $\triangle W_C$ is $W_C$'s increment caused by the back-propagation procedure, and $\alpha_C$ is the learning rate of the controller. Note that the differences between target inputs and actual final inputs at the end of each trajectory are used for computing error signals for the controller. We do not use the differences between desired final inputs and predicted final inputs.

To apply the `unfolding in time' algorithm [9][18] to the recurrent combination of $M$ and $C$, do the following:

For all trajectories $p$:

1. During the activation spreading phase of $p$, for each time step $t$ of $p$ create a copy of $C$ (called $C(t)$) and a copy of $M$ (called $M(t)$).

2. Construct a large `unfolded' feed-forward back-propagation network consisting of $2k$ sub-modules by doing the following:

2.a) For $t>1$ replace each input unit $u$ of $C(t)$ by the unit in $M(t-1)$ which predicted $u$'s activation.

2.b) For $t \geq 1$: Replace each input unit of $M(t)$ whose activation was provided by an output unit $u$ of $C(t)$ by $u$.

3. Propagate the difference $(d_{pfin} - x_p(k))$ back through the entire `unfolded' network constructed in step 2. Change each weight of $C$ in proportion to the sum of the partial derivatives computed for the corresponding $k$ connection copies in the unfolded network. Do not change the weights of $M$.

Since the weights remain constant during the activation spreading phase of one trajectory, the practical algorithm used in the experiments does not really create copies of the weights. It is more efficient to introduce one additional variable for each controller weight: This variable is used for accumulating the corresponding sum of weight changes. During trajectory execution, it is convenient to push the time-varying activations of the units in $M$ and $C$ on stacks of activations, one for each unit. During the back-propagation phase these activations can be successively popped off for the computation of error signals.

next up previous
Next: Dynamic Equilibria Through the Up: THE SYSTEM Previous: Outline and Motivation of
Juergen Schmidhuber 2003-02-21

Back to Reinforcement Learning page
Back to main page on Learning attentive vision