There is an interesting alternative class of RL systems that also combines aspects of DS and traditional RL. It exploits ideas from the field of economy that seem naturally applicable in the context of multiagent RL, and (unlike Q-learning etc.) are not necessarily limited to learning reactive behavior. In what follows I will ignore the extensive original financial literature and briefly review solely some work on RL economies instead. Then I will relate this work to SSA.
Classifier Systems and Bucket Brigade. The first RL economy was Holland's meanwhile well-known bucket brigade algorithm for classifier systems [HollandHolland1985]. Messages in form of bitstrings of size can be placed on a global message list either by the environment or by entities called classifiers. Each classifier consists of a condition part and an action part defining a message it might send to the message list. Both parts are strings out of where the `_' serves as a `don't care' if it appears in the condition part. A non-negative real number is associated with each classifier indicating its `strength'. During one cycle all messages on the message list are compared with the condition parts of all classifiers of the system. Each matching classifier computes a `bid' based on its strength. The highest bidding classifiers may place their message on the message list of the next cycle, but they have to pay with their bid which is distributed among the classifiers active during the last time step which set up the triggering conditions (this explains the name bucket brigade). Certain messages result in an action within the environment (like moving a robot one step). Because some of these actions may be regarded as 'useful' by an external critic who can give payoff by increasing the strengths of the currently active classifiers, learning may take place. The central idea is that classifiers which are not active when the environment gives payoff but which had an important role for setting the stage for directly rewarded classifiers can earn credit by participating in `bucket brigade chains'. The success of some active classifier recursively depends on the success of classifiers that are active at the following time ticks. Unsuccessful classifiers are replaced by new ones generated with the help of GAs.
Holland's original scheme is similar to DPRL algorithms such as Q-learning in the sense that the bids of the agents correspond to predictions of future reward. On the other hand, the scheme does not necessarily require full environmental observability. Instead it partly depends on DS-like evolutionary pressure absent in traditional RL. E.g., bankrupt agents who spent all their money are removed from the system.
Holland's approach, however, leaves several loopholes that allow agents to make money without contributing to the success of the entire system. This led to a lot of follow-up research on more stable RL classifier economies [WilsonWilson1994,WilsonWilson1995,WeissWeiss1994,Weiss SenWeiss Sen1996] and other related types of RL ecomomies (see below). This work has closed some but possibly not all of the original loopholes.
Prototypical Self-referential Associating Learning Mechanisms. Pages 23-51 of earlier work [SchmidhuberSchmidhuber1987] are devoted to systems called PSALM1 - PSALM3. Like in Holland's scheme, competing/cooperating RL agents bid for executing actions. Winners may receive external reward for achieving goals. Unlike in Holland's scheme, agents are supposed to learn the credit assignment process itself (metalearning). For this purpose they can execute actions for collectively constructing / connecting / modifying agents, for assigning credit (reward) to agents, and for transferring credit from one agent to another. To the best of my knowledge, PSALMs are the first machine learning systems that enforce the important constraint of total credit conservation (except for consumption and external reward): no agent can generate money from nothing. This constraint is not enforced in the original bucket brigade economy, where new agents enter with freshly created money (this may cause inflation and other problems). Reference [SchmidhuberSchmidhuber1987] also inspired the neural bucket brigade (NBB), a slightly more recent but less general approach enforcing money conservation, where money is "weight substance" of a reinforcement learning neural net [SchmidhuberSchmidhuber1989].
Hayek Machine [Baum DurdanovicBaum Durdanovic1998]. PSALM3 does not strictly enforce individual property rights. For instance, agents may steal money from other agents and temporally use it in a way that does not contribute to the system's overall progress. Hayek machines are constitutional economies that apparently do not suffer from such ``parasite problems" (although there is no proof yet at the moment). Hayek2 [Baum DurdanovicBaum Durdanovic1998] -- the most recent Hayek variant -- is somewhat reminiscent of PSALM3 (agents may construct new agents, and there is money conservation), but avoids several loopholes in the credit assignment process that may allow some agent to profit from actions that are not beneficial for the system as a whole. Property rights of agents are strictly enforced. Agents can create children agents and invest part of their money into them, and profit from their success. Hayek2 learned to solve rather complex blocks world problems [Baum DurdanovicBaum Durdanovic1998]. The authors admit, however, that possibly not all potential loopholes rewarding undesired behavior have been closed by Hayek2.
COINs/Wonderful Life Utility [Wolpert, Tumer, FrankWolpert et al.1999]. Perhaps the only current RL economy with a sound theoretical foundation is the COllective Intelligence (COIN) by Wolpert, Tumer & Frank (1999). A COIN partitions its set of agents into ``subworlds." Each agent of a given subworld shares the same local utility function. Global utility is optimized by provably making sure that no agent in some subworld can profit if the system as a whole does not profit. COINs were successfully used in an impressive routing application.
SSA and Market Models. One way of viewing SSA in an economy-inspired framework is this: the current ``credit" of a policy change equals the reward since its creation divided by the time since its creation. A policy change gets undone as soon as its credit falls below the credit of the most recent change that has not been undone yet. After any given SSA invocation the yet undone changes reflect a success-story of long-term credit increases.
To use the parent/child analogy [Baum DurdanovicBaum Durdanovic1998]: at a given time, any still valid policy change may be viewed as a (grand-)parent of later policy changes for which it set the stage. Children that are more profitable than their ancestors protect the latter from being undone. In this way the ancestors profit from making successful children. Ancestors who increase the probability of non-profitable offspring, however, will eventually risk oblivion.