Next: DS Theory Up: Sequential Decision Making Based Previous: Summary

Typical Applications

Both DPRL and DS boast impressive practical successes. For instance, there is the world-class DPRL-based backgammon player [TesauroTesauro1994], although it's not quite clear yet (beyond mere intuition) why exactly it works so well. And DS has proven superior to alternative traditional methods in engineering domains such as wing design, combustion chamber design, turbulence control (the historical origins of ``evolutionary computation''). There are overviews [SchwefelSchwefel1995,Koumoutsakos P. D.Koumoutsakos P. D.1998] with numerous references to earlier work.

Will DS' successes in such domains eventually carry over to learning sequential behavior in domains traditionally approached by variants of DPRL? At the moment little work has been done on DS in search spaces whose elements are sequential behaviors [Schmidhuber, Zhao, WieringSchmidhuber et al.1997b], but this may change soon. Of course, the most obvious temporal tasks to be attacked by DS are not board games but tasks that violate DPRL's Markovian assumptions. It does not make sense to apply low-bias methods like DS to domains that satisfy the preconditions of more appropriately biased DPRL approaches.

Parity. I will use the parity problem to illustrate this. The task requires to separate bitstrings of length ( integer) with an odd number of zeros from others. -bit parity in principle is solvable by a 3-layer feedforward neural net with input units. But learning the task from training exemplars by, say, backpropagation, is hard for , due to such a net's numerous free parameters. On the other hand, a very simple finite state automaton with just one bit of internal state can correctly classify arbitrary bitstrings by sequentially processing them one bit at a time, and switching the internal state bit on or off depending on whether the current input is 1 or 0.

A policy implementing such a sequential solution, however, cannot efficiently be learned by DPRL. The problem is that the task violates DPRL's essential Markovian precondition: the current input bit in a training sequence does not provide the relevant information about the previous history necessary for correct classification.

Next we will see, however, that parity can be quickly learned by the most trivial DS method, namely, random search (RS). RS works as follows: REPEAT randomly initialize the policy and test the resulting net on a training set UNTIL solution found.

Experiment. Our policy is the weight matrix of a standard recurrent neural network. We use two architectures (A1, A2). A1() is a fully connected net with 1 input, 1 output, and hidden units, each non-input unit receiving the traditional bias connection from a unit with constant activation 1.0. A2 is like A1(10), but less densely connected: each hidden unit receives connections from the input unit, the output unit, and itself; the output unit sees all other units. All activation functions are standard: $f(x)=(1 + e^{-x})^{-1} \in [0.0,1.0]$ . Binary inputs are -1.0 (for 0) and 1.0 (for 1). Sequence lengths are randomly chosen between 500 and 600. All variable activations are set to 0 at each sequence begin. Target information is provided only at sequence ends (hence the relevant time delays comprise at least 500 steps; there are no intermediate rewards). Our training set consists of 100 sequences, 50 from class 1 (even; target 0.0) and 50 from class 2 (odd; target 1.0). Correct sequence classification is defined as ``absolute error at sequence end below 0.1''. We stop RS once a random weight matrix (weights randomly initialized in [-100.0,100.0]) correctly classifies all training sequences. Then we test on the test set (100 sequences).

In all simulations, RS eventually finds a solution that classifies all test set sequences correctly; average final absolute test set errors are always below 0.001 -- in most cases below 0.0001. In particular, RS with A1 () solves the problem within only 2906 trials (average of 10 trials). RS with A2 solves it within 2797 trials on average. RS for architecture A2, but without self-connections for hidden units, solves the problem within 250 trials on average. See a previous paper [Hochreiter SchmidhuberHochreiter Schmidhuber1997] for additional results in this vein.

RS is a dumb DS algorithm, of course. It won't work within acceptable time except for the most trivial problems. But this is besides the point of this section, whose purpose is to demonstrate that even primitive DS may yield results beyond DPRL's abilities. Later, however, I will discuss smarter DS methods.

What about GP? Given the RS results above, how can it be that parity is considered a difficult problem by many authors publishing in the DS-based field of ``Genetic Programming'' (GP), as can be seen by browsing through the proceedings of recent GP conferences? The reason is: most existing GP systems are extremely limited because they search in spaces of programs that do not even allow for loops or recursion -- to the best of my knowledge, the first exception was [Dickmanns, Schmidhuber, WinklhoferDickmanns et al.1987]. Hence most GP approaches ignore a major motivation for search in program space, namely, the repetitive reuse of code in solutions with low algorithmic complexity [KolmogorovKolmogorov1965,SolomonoffSolomonoff1964,ChaitinChaitin1969,Li VitányiLi Vitányi1993].

Of course, all we need to make parity easy is a search space of programs that process inputs sequentially and allow for internal memory and loops or conditional jumps. A few thousand trials will suffice to generalize perfectly to n-bit parity for arbitrary , not just for special values like those used in the numerous GP papers on this topic (where typically ).

Next: DS Theory Up: Sequential Decision Making Based Previous: Summary

Juergen Schmidhuber 2003-02-19

Back to Reinforcement Learning and POMDP page