Advantage 6: Exploring Limited Spatio-Temporal Predictability

Next: Summary Up: Advantages of Direct Search Previous: DS Advantage 5: Metalearning

Advantage 6: Exploring Limited Spatio-Temporal Predictability

Knowledge of the world may boost performance. That's why exploration is a major RL research issue. How should a learner explore a spatio-temporal domain? By predicting and learning from success/failure what's predictable and what's not.

Most previous work on exploring unknown data sets has focused on selecting single training exemplars maximizing traditional information gain [FedorovFedorov1972,Hwang, Choi, Oh, IIHwang et al.1991,MacKayMacKay1992,Plutowski, Cottrell, WhitePlutowski et al.1994,CohnCohn1994]. Here typically the concept of a surprise is defined in Shannon's sense [ShannonShannon1948]: some event's surprise value or information content is the negative logarithm of its probability. This inspired simple reinforcement learning approaches to pure exploration [SchmidhuberSchmidhuber1991a,Storck, Hochreiter, SchmidhuberStorck et al.1995,Thrun MöllerThrun Möller1992] that use adaptive predictors to predict the entire next input, given current input and action. The basic idea is that the action-generating module gets rewarded in case of predictor failures. Hence it is motivated to generate action sequences leading to yet unpredictable states that are ``informative'' in the classic sense. Some of these explorers actually like white noise simply because it is so unpredictable, thus conveying a lot of Shannon information. Compare alternative, hardwired exploration strategies [Sutton PinetteSutton Pinette1985,KaelblingKaelbling1993,Dayan SejnowskiDayan Sejnowski1996,Koenig SimmonsKoenig Simmons1996].

Most existing systems either always predict all details of the next input or are limited to picking out simple statistic regularities such as ``performing action A in discrete, fully observable environmental state B will lead to state C with probability 0.8.'' They are not able to limit their predictions solely to certain computable aspects of inputs [Schmidhuber PrelingerSchmidhuber Prelinger1993] or input sequences, while ignoring random and irregular aspects. For instance, they cannot even express (and therefore cannot find) complex, abstract, predictable regularities such as ``executing a particular sequence of eye movements, given a history of incomplete environmental inputs partially caused by a falling glass of red wine, will result in the view of a red stain on the carpet within the next 3 seconds, where details of the shape of the stain are expected to be unpredictable and left unspecified.''

General spatio-temporal abstractions and limited predictions of this kind apparently can be made only by systems that can run fairly general algorithms mapping input/action sequences to compact internal representations conveying only certain relevant information embedded in the original inputs. For instance, there are many different, realistic, plausible red stains -- all may be mapped onto the same compact internal representation predictable from all sequences compatible with the abstraction ``falling glass." If the final input sequence caused by eye movements scanning the carpet does not map onto the concept ``red stain" (because the glass somehow decelerated in time and for some strange reason never touched the ground), there will be a surprise. There won't be a surprise, however, if the stain exhibits a particular, unexpected, irregular shape, because there was no explicit confident expectation of a particular shape in the first place.

The central questions are: In a given environment, in absence of a state model, how to extract the predictable concepts corresponding to algorithmic regularities that are not already known? How to discover novel spatio-temporal regularities automatically among the many random or unpredictable things that should be ignored? Which novel input sequence-transforming algorithms do indeed compute reduced reduced internal representations permitting reliable predictions?

Usually we cannot rely on a teacher telling the system which concepts are interesting, such as in the EURISKO system [LenatLenat1983]. The DPRL framework is out of the question due to issues of partial observability. Lookup-table approaches like those used in more limited scenarios are infeasible due to the huge number of potentially interesting sequence-processing, event-memorizing algorithms. On the other hand, it is possible to use DS-like methods for building a ``curious" embedded agent that differs from previous explorers in the sense that it can limit its predictions to fairly arbitrary, computable aspects of event sequences and thus can explicitly ignore almost arbitrary unpredictable, random aspects [SchmidhuberSchmidhuber1999]. It constructs initially random algorithms mapping event sequences to abstract internal representations (IRs). It also constructs algorithms predicting IRs from IRs computed earlier. Its goal is to learn novel algorithms creating IRs useful for correct IR predictions, without wasting time on those learned before. This can be achieved by a co-evolutionary scheme involving two competing modules collectively designing single algorithms to be executed. The modules have actions for betting on the outcome of IR predictions computed by the algorithms they have agreed upon. If their opinions differ then the system checks who's right, punishes the loser (the surprised one), and rewards the winner. A DS-like RL algorithm forces each module to increase its reward. This motivates each to lure the other into agreeing upon algorithms involving predictions that surprise it. Since each module essentially can put in its veto against algorithms it does not consider profitable, the system is motivated to focus on those computable aspects of the environment where both modules still have confident but different opinions. Once both share the same opinion on a particular issue (via the loser's DS-based learning process, e.g., the winner is simply copied onto the loser), the winner loses a source of reward -- an incentive to shift the focus of interest onto novel, yet unknown algorithms. There are simulations where surprise-generation of this kind can actually help to speed up external reward [SchmidhuberSchmidhuber1999].

Next: Summary Up: Advantages of Direct Search Previous: DS Advantage 5: Metalearning

Juergen Schmidhuber 2003-02-19

Back to Reinforcement Learning and POMDP page