The following experiments were conducted by the TUM-students Josef Hochreiter and Klaus Bergner. See  and  for the full details.
1. Evolution of a flip-flop by reinforcement learning. A controller had to learn to behave like a flip-flop as described in . The main difficulty (the one which makes this different from the supervised approach as described in ) was that there was no teacher for 's (probabilistic) output units. Instead, the system had to generate alternative outputs in a variety of spatio-temporal contexts, and to build a model of the often `painful' consequences. 's only goal information was the activation of a pain input unit whenever it produced an incorrect output. With , , and 20 out of 30 test runs with the parallel version required less than 1000000 time steps to produce an acceptable solution.
Why does it take much more time solving the reinforcement flip-flop problem than solving the corresponding supervised flip-flop problem? One answer is: With supervised learning the controller gradient is given to the system, while with reinforcement learning the gradient has to be discovered by the system.
2. `Non-Markovian' pole balancing. A cart pole system was modeled by the same differential equations used for a related balancing task which is described in . In contrast to previous pole balancing tasks, however, no information about temporal derivatives of cart position and pole angle was provided. (Similar experiments are mentioned in .)
In our experiments the cart-pole system would not stabilize indefinitely. However, significant performance improvement was obtained. The best results were achieved by using a `perfect model' as described above: Before learning, the average time until failure was about 25 time steps. Within a few hundred trials one could observe trials with more than 1000 time steps balancing time. `Friendly' initial conditions could lead to balancing times of more than 3000 time steps.
3. `Markovian' pole balancing with a vector-valued adaptive critic. The adaptive critic extension described above does not need a non-Markovian environment to demonstrate advantages over previous adaptive critics: A four-dimensional adaptive critic was tested on the pole balancing task described in . The critic component had four output units for predicting four different kinds of `pain', two for bumps against the two edges of the track and two for pole crashes.
None of five conducted test runs took more than 750 failures to achieve the first trial with more than 30000 time steps. (The longest run reported by  took about 29000 time steps, more than 7000 failures had to be experienced to achieve that result.)