Consider figure 1.
An `animat' moves in the real plane defined by the and axis,
producing a trajectory in
. There are obstacles in the form of circular
swamps. As long as the `animat' does not cross a swamp,
there are no costs1
( negative
reinforcement). The -th swamp
with center
builds the basis of a
cone (growing in the third dimension) with tip
.
Crossing
costs
(1) |
A problem is defined by a start state and a goal state . In the above example, - start states and goal states are simply given by pairs of cartesian coordinates. We are looking for an action sequence leading from to with minimal costs.
It is true that in theory such sequences could be learned by conventional reinforcement learning algorithms (e.g. [Barto, 1989], [Barto et al., 1983], [Anderson, 1986], [Schmidhuber, 1991b], [Sutton, 1984], [Lin, 1991], [Williams, 1988], [Watkins, 1989]). For the sake of argument, assume that the maximal step size of the `animat' is just a tiny fraction of the obstacle diameter. Then all the above algorithms will take nearly forever to find appropriate cost-free trajectories for other than trivial start/goal combinations. One drawback of conventional algorithms is that they will try to learn each new task from scratch, instead of exploiting a possibility for speeding up learning and gaining efficiency by solving new tasks through composition of solutions for older tasks.