Both DPRL and DS boast impressive practical successes. For instance, there is the world-class DPRL-based backgammon player [TesauroTesauro1994], although it's not quite clear yet (beyond mere intuition) why exactly it works so well. And DS has proven superior to alternative traditional methods in engineering domains such as wing design, combustion chamber design, turbulence control (the historical origins of ``evolutionary computation''). There are overviews [SchwefelSchwefel1995,Koumoutsakos P. D.Koumoutsakos P. D.1998] with numerous references to earlier work.

Will DS' successes in such domains eventually carry over to learning sequential behavior in domains traditionally approached by variants of DPRL? At the moment little work has been done on DS in search spaces whose elements are sequential behaviors [Schmidhuber, Zhao, WieringSchmidhuber et al.1997b], but this may change soon. Of course, the most obvious temporal tasks to be attacked by DS are not board games but tasks that violate DPRL's Markovian assumptions. It does not make sense to apply low-bias methods like DS to domains that satisfy the preconditions of more appropriately biased DPRL approaches.

**Parity.**
I will use the parity problem to illustrate this.
The task requires to separate bitstrings of length ( integer)
with an odd number of zeros from others.
-bit parity in principle is solvable by a 3-layer
feedforward neural net with input units.
But learning the task from training exemplars
by, say, backpropagation, is hard for , due to
such a net's numerous free parameters.
On the other hand, a
very simple finite state automaton with just
one bit of internal state can correctly classify
arbitrary bitstrings by
sequentially processing them one bit at a time,
and switching the internal state bit on or off
depending on whether the current input is 1 or 0.

A policy implementing such a sequential solution, however, cannot efficiently be learned by DPRL. The problem is that the task violates DPRL's essential Markovian precondition: the current input bit in a training sequence does not provide the relevant information about the previous history necessary for correct classification.

Next we will see, however,
that parity can be quickly learned by
the most trivial DS method, namely, random search (RS).
RS works as follows: *REPEAT randomly initialize the policy and test
the resulting net on a training set UNTIL solution found*.

**Experiment.**
Our policy is the weight matrix of a standard recurrent neural network.
We use two architectures (A1, A2). A1() is a fully connected net with
1 input, 1 output, and hidden units, each non-input unit receiving the
traditional bias connection
from a unit with constant activation 1.0.
A2 is like A1(10), but less densely connected:
each hidden unit receives connections from the input unit,
the output unit, and itself;
the output unit sees all other units.
All activation functions are standard:
.
Binary inputs are -1.0 (for 0) and 1.0 (for 1).
Sequence lengths are randomly chosen between 500 and 600.
All variable activations are set to 0 at each sequence begin.
Target information is provided only at sequence ends
(hence the relevant time delays comprise at least 500 steps; there are no intermediate rewards).
Our training set consists of 100 sequences, 50 from class 1 (even; target 0.0)
and 50 from class 2 (odd; target 1.0).
Correct sequence classification is defined as ``absolute error
at sequence end below 0.1''.
We stop RS once a random
weight matrix (weights randomly initialized in [-100.0,100.0])
correctly classifies all training sequences.
Then we test on the test set (100 sequences).

In all simulations, RS eventually finds a solution that classifies all test set sequences correctly; average final absolute test set errors are always below 0.001 -- in most cases below 0.0001. In particular, RS with A1 () solves the problem within only 2906 trials (average of 10 trials). RS with A2 solves it within 2797 trials on average. RS for architecture A2, but without self-connections for hidden units, solves the problem within 250 trials on average. See a previous paper [Hochreiter SchmidhuberHochreiter Schmidhuber1997] for additional results in this vein.

RS is a dumb DS algorithm, of course. It won't work within acceptable time except for the most trivial problems. But this is besides the point of this section, whose purpose is to demonstrate that even primitive DS may yield results beyond DPRL's abilities. Later, however, I will discuss smarter DS methods.

**What about GP?** Given the RS results above, how can it be that parity
is considered a difficult problem by many authors publishing in the
DS-based field of ``Genetic Programming'' (GP), as can be seen by browsing
through the proceedings of recent GP conferences? The reason is: most
existing GP systems are extremely limited because they search in spaces
of programs that do not even allow for loops or recursion -- to the best
of my knowledge, the first exception was [Dickmanns, Schmidhuber, WinklhoferDickmanns
et al.1987]. Hence most
GP approaches ignore a major motivation for search in program space,
namely, the repetitive reuse of code in solutions with low algorithmic
complexity [KolmogorovKolmogorov1965,SolomonoffSolomonoff1964,ChaitinChaitin1969,Li VitányiLi Vitányi1993].

Of course, all we need to make parity easy is a search space of programs
that process inputs sequentially and allow for internal memory and loops
or conditional jumps. A few thousand trials will suffice to generalize
perfectly to n-bit parity for *arbitrary* , not just for special
values like those used in the numerous
GP papers on this topic (where typically ).

Back to Reinforcement Learning and POMDP page