IDSIA, Corso Elvezia 36
I describe a novel paradigm for reinforcement learning (RL) with limited computational resources in realistic, non-resettable environments. The learner's policy is an arbitrary modifiable algorithm mapping environmental inputs and internal states to outputs and new internal states. Like in the real world, any event in system life and any learning process computing policy modifications may affect future performance and preconditions of future learning processes. There is no need for pre-defined ``trials''. At a given time in system life, there is only one single training example to evaluate the current long-term usefulness of any given previous policy modification, namely the average reinforcement per time since that modification occurred. At certain times in system life called checkpoints, such singular observations are used by a stack-based backtracking method which invalidates certain previous policy modifications, such that the history of still valid modifications corresponds to a history of long-term reinforcement accelerations (up until to the current checkpoint, each still valid modification has been followed by faster reinforcement intake than all the previous ones). Until the next checkpoint there is time to collect delayed reinforcement and to execute additional policy modifications; until then no previous policy modifications are invalidated; and until then the straight-forward, temporary generalization assumption is: each modification that until now appeared to contribute to an overall speed-up will remain useful. The paradigm provides a foundation for (1) ``meta-learning'', and (2) multi-agent learning. The principles are illustrated in (1) a single, self-referential, ``evolutionary'' system using an assembler-like programming language to modify its own policy, and to modify the way it modifies its policy, etc., and (2) another ``evolutionary'' system consisting of multiple agents, where each agent is in fact just a connection in a fully recurrent RL neural net.
The biggest difference between time and space is that you can't reuse time.