next up previous
Next: DS Advantage 4: Non-Hierarchical Up: Advantages of Direct Search Previous: DS Advantage 2: No

DS Advantage 3: Straight-forward Hierarchical Credit Assignment

There has been a lot of recent work on hierarchical DPRL. Some researchers address the case where an external teacher provides intermediate subgoals and/or prewired macro actions consisting of sequences of lower-level actions [Moore AtkesonMoore Atkeson1993,ThamTham1995,SuttonSutton1995,SinghSingh1992,HumphrysHumphrys1996,DigneyDigney1996,Sutton, Singh, Precup, RavindranSutton et al.1999]. Others focus on the more ambitious goal of automatically learning useful subgoals and macros [SchmidhuberSchmidhuber1991b,Eldracher BaginskiEldracher Baginski1993,RingRing1991,RingRing1994,Dayan HintonDayan Hinton1993,Wiering SchmidhuberWiering Schmidhuber1998,Sun SessionsSun Sessions2000]. Compare also work presented at the recent NIPS*98 workshop on hierarchical RL organized by Doina Precup and Ron Parr [McGovernMcGovern1998,AndreAndre1998,Moore, Baird, KaelblingMoore et al.1998,Bowling VelosoBowling Veloso1998,Harada RussellHarada Russell1998,Wang MahadevanWang Mahadevan1998,KirchnerKirchner1998,Coelho GrupenCoelho Grupen1998,Huber GrupenHuber Grupen1998].

Most current work in hierarchical DPRL aims at speeding up credit assignment in fully observable environments. Approaches like HQ-learning [Wiering SchmidhuberWiering Schmidhuber1998], however, additionally achieve a qualitative (as opposed to just quantitative) decomposition by learning to decompose problems that cannot be solved at all by standard DPRL into several DPRL-solvable subproblems and the corresponding macro-actions.

Generally speaking, non-trivial forms of hierarchical RL almost automatically run into problems of partial observability, even those with origins in the MDP framework. Feudal RL [Dayan HintonDayan Hinton1993], for instance, is subject to such problems (Ron Williams, personal communication). As Peter Dayan himself puts it (personal communication): ``Higher level experts are intended to be explicitly ignorant of the details of the state of the agent at any resolution more detailed than their action choice. Therefore, the problem is really a POMDP from their perspective. It's easy to design unfriendly state decompositions that make this disastrous. The key point is that it is highly desirable to deny them information - the chief executive of [a major bank] doesn't really want to know how many paper clips his most junior bank clerk has - but arranging for this to be benign in general is difficult.''

In the DS framework, however, hierarchical credit assignment via frequently used, automatically generated subprograms becomes trivial in principle. For instance, suppose policies are programs built from a general programming language that permits parameterized conditional jumps to arbitrary code addresses [Dickmanns, Schmidhuber, WinklhoferDickmanns et al.1987,RayRay1992,Wiering SchmidhuberWiering Schmidhuber1996,Schmidhuber, Zhao, WieringSchmidhuber et al.1997b,Schmidhuber, Zhao, SchraudolphSchmidhuber et al.1997a]. DS will simply keep successful hierarchical policies that partially reuse code (subprograms) via appropriate jumps. Again, partial observability is not an issue.


next up previous
Next: DS Advantage 4: Non-Hierarchical Up: Advantages of Direct Search Previous: DS Advantage 2: No
Juergen Schmidhuber 2003-02-19


Back to Reinforcement Learning and POMDP page