... time[*]
Note that even for the more limited DPRL algorithms until very recently there have not been any theorems guaranteeing finite convergence time [Kearns SinghKearns Singh1999].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... $P(a \mid POL, S, E).$[*]
Instead of using the expression policy for the conditional probability distribution $P$ itself we reserve it for the agent's modifiable data structure POL.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... 0.[*]
The problem resembles another two-armed bandit problem for which there is an optimal method due to Gittins [GittinsGittins1989]. Our unknown reward delays, however, prevent this method from being applicable -- it cannot discover that the current reward does not depend on the current input but on an event in the past. In addition, Gittins' method needs to discount future rewards relative to immediate rewards. Anyway, this footnote is besides the point of this paper whose focus is on direct policy search -- Gittins' method is not a direct one.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... take?[*]
In fact, population-based approaches will suffer even more than simple SHC from unknown delays and stochasticity, simply because they need to test many policies, not just one.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.


Back to Reinforcement Learning and POMDP page