|
A small environment.
The first test environment consists of states. There are
possible actions, and 100 possible experiments.
The transistion probabilities are:
A bigger environment.
The second test environment consists of states. There are
possible actions, and 10000 possible experiments.
The transistion probabilities are:
For random search and for RDIA based on entropy changes (with parameters , , and ), table 2 shows the number of time steps required to achieve given entropy values. The only state allowing for acquisition of a lot of information is . RDIA quickly discovers this and establishes a policy that causes the agent to move as quickly as possible to from every other state. Random exploration, in contrast, wastes much of its time on the states ... . Again, for small entropy margins, the advantage of reinforcement driven information acquisition is not as pronounced as in later stages, because Q-learning needs some time to fix the strategy for performing experiments. As the entropy margin approaches the optimum, however, reinforcement driven information acquisition becomes much faster, by at least an order of magnitude.
|
Future work. 1. ``Exploitation/exploration trade-off''. In this paper, exploration was studied in isolation from exploitation. Is there an ``optimal'' way of combining both? For which kinds of goal-directed learning should RDIA be recommended? It is always possible to design environments where ``curiosity'' (the drive to explore the world) may ``kill the cat'', or at least may have a negative influence on exploitation performance. This is illustrated by additional experiments presented in [10]: In one environment described therein, exploration helps to speed up exploitation. But with a different environment, curiosity slows down exploitation. The ``exploitation/exploration trade-off'' remains an open problem.
2. Additional experimental comparisons. It will be interesting to compare RDIA to better competitors than random exploration, like e.g. Kaelbling's Interval Estimation algorithm [5].
3. Function approximators. It also will be interesting to replace the Q-table by function approximators like backprop networks. Previous experimental work by various authors indicates that in certain environments this might improve performance, despite the fact that theoretical foundations of combinations of Q-learning and function approximators are still weak.