In the comparatively simple case considered here,
the controller is a standard back-propagation network.
There are discrete time steps.
Each fovea trajectory involves discrete time steps 1 ... .
At time step of trajectory , 's input is
the real-valued vector
which is determined by sensory perceptions from
the artificial `fovea'.
's output at time step of
trajectory is the vector
.
At each time step motoric actions like
`move fovea left', `rotate fovea' are based
on . The actions cause a new input .
The final desired input of the trajectory
is a
predefined activation pattern
corresponding to the target to be found in
a static visual scene.
The task is to sequentially
generate fovea trajectories such that for each trajectory
matches .
The final input error at
the end of trajectory (externally interrupted
at time step ) is
Thus is determined by the differences between the desired final inputs and the actual final inputs.
In order to allow credit assignment to past output actions of the control network, we first train the model network (another standard back-propagation network) to emulate the visible environmental dynamics. This is done by training at a given time to predict 's next input, given the previous input and output of . The following discussion refers to the case where both and learn in parallel. In some of the experiments below we use two separate training phases for and . However, the modifications are straight-forward and mainly notational.
's input vector at time of trajectory is the
concatenation of and .
's real-valued output vector at time of trajectory is ,
where
. (Here is the
dimension of , has as many
output units as there are input units for .) is 's
prediction of .
The error of 's prediction at time of trajectory is
's goal is to minimize
, which is done
by conventional back-propagation
[17][7][4][9]:
's training phase is more complex than 's.
It is assumed that
is a differentiable function of ,
where is 's weight vector. To approximate
Here is 's increment caused by the back-propagation procedure, and is the learning rate of the controller. Note that the differences between target inputs and actual final inputs at the end of each trajectory are used for computing error signals for the controller. We do not use the differences between desired final inputs and predicted final inputs.
To apply the `unfolding in time' algorithm [9][18] to the recurrent combination of and , do the following:
For all trajectories :
1. During the activation spreading phase of , for each time step of create a copy of (called ) and a copy of (called ).
2. Construct a large `unfolded' feed-forward back-propagation network consisting of sub-modules by doing the following:
2.a) For replace each input unit of by the unit in which predicted 's activation.
2.b) For : Replace each input unit of whose activation was provided by an output unit of by .
3. Propagate the difference
back
through the entire `unfolded' network constructed in step 2.
Change each weight of in proportion to the sum of the
partial derivatives
computed for the corresponding connection copies in the unfolded
network. Do not change the weights of .
Since the weights remain constant during the activation spreading phase of one trajectory, the practical algorithm used in the experiments does not really create copies of the weights. It is more efficient to introduce one additional variable for each controller weight: This variable is used for accumulating the corresponding sum of weight changes. During trajectory execution, it is convenient to push the time-varying activations of the units in and on stacks of activations, one for each unit. During the back-propagation phase these activations can be successively popped off for the computation of error signals.