next up previous
Next: Cascaded sub-goaling with a Up: Learning to Generate Sub-Goals Previous: Learning to Generate Sub-Goals

The basic approach

Figure 1 shows the basic components of a system consisting of a number of interacting neural networks. The heart of the system is a neural network with internal and external feedback, called the control network $C$. $C$ serves as a program executer. While producing a sequence of outputs $o_p(t)$ during execution of program $p$, it receives as input a stationary vector $s_p$ representing the start state, a stationary vector $g_p$ representing a desired goal state, and time-varying input vectors $i_p(t)$ from the environment. The concatenation of $s_p$ and $g_p$ serves as a `program name'. We assume that $C$ already has learned to solve a number of tasks. This means that there already are various working programs that actually lead from the start states to the goal states by which the programs are indexed. These programs may have been learned by an algorithm for dynamic networks (as described by the authors mentioned above), or by a recursive application of the principle outlined below.

Figure 1 shows a second important module: an adaptive evaluator network $E$ which receives as input a start state $s$ and a goal state $g$, and is trained to produce an output $e(s,g) \in [0...1]$ that indicates whether $C$ knows a program that leads from the start state to or `close' to the goal state. An output of $0$ means that there is an appropriate sub-program, an output of $1$ means that there is no such sub-program. An output between $0$ and $1$ means that there is a sub-program that leads from the start state to a state that comes close to the goal in a certain sense. This measure of closeness has to be given by some evaluative process that may be adaptive or not (it might be an adaptive critic based on TD-methods, for instance). $E$ represents the system's current model of its own capabilities. We assume that $E$ has learned to correctly predict that each of the already existing sub-programs works. We also assume that $E$ is able to predict the closeness of an end state of a sub-program to a goal state, given a start state. In the simplest case, $E$ can be trained in an exploratory phase during which various combinations of start and goal states are given to the program executer.

Finally, the system contains an adaptive network $S$ which serves as a sub-goal generator. With a given task $p$ specified by a start/goal pair $(s_p^0,g_p=s_p^{n+1})$, the sub-goal generator receives as input the concatenation of $s_p^0$ and $s_p^{n+1}$.

The output of the sub-goal generator is a list of $n$ sub-goals $s_p^1, s_p^2, \ldots, s_p^n$. Like the goal, a sub-goal $s_p^i$ is an activation patterns describing the desired external input at the end of some sub-program, which also is the start input for the next sub-program. After training the subgoal generator, the list of sub-goals should satisfy the following condition:

\begin{displaymath}e(s_p^0,s_p^1)=
e(s_p^1,s_p^2)=
\ldots
=e(s_p^n,s_p^{n+1}) = 0. \end{displaymath}

This means that there exists a sub-program leading from the start state to the first sub-goal, and another subprogram program leading from the first sub-goal to the second sub-goal etc. until the final goal is reached.

How does the sub-goal generator, which initially is a tabula rasa, learn to generate appropriate sub-goals? We take $n+1$ copies of $E$, and connect them to the sub-goal generator as shown in figure 2. The desired output of each of the copies is $0$. For all positive outputs of some copies, error gradients are propagated through $E$'s copies down into the sub-goal generator. $E$ (as well as its copies, of course) remains unchanged during this procedure. Only the weights of the sub-goal generator change according to


\begin{displaymath}\triangle W_S^T = \eta_S \frac{\partial \sum_{i=0}^n \frac{1}...
...ial e(s_p^i,s_p^{i+1})}{\partial s_p^i}^T
e(s_p^i,s_p^{i+1})
\end{displaymath}

where $\eta_S$ is the learning rate of the sub-goal generator, $W_S$ is its weight vector, and $\triangle W_S$ its increment. For a given problem the procedure is iterated until the complete error is zero (corresponding to a solution obtained by combining the two sub-programs), or until a local minimum is reached (no solution found). The gradient descent procedure is used for a search in sub-goal space. This works because the effects of programs are made differentiable with respect to program names ( = start/goal combinations). This contrasts the approach of `back-propagation through time' which makes the effects of programs differentiable with respect to whole action sequences (instead of selectively focussing on more abstract short representations of action sequences).


next up previous
Next: Cascaded sub-goaling with a Up: Learning to Generate Sub-Goals Previous: Learning to Generate Sub-Goals
Juergen Schmidhuber 2003-03-14

Back to Subgoal learning - Hierarchical Learning
German pages with Subgoal learning pictures