1. INTRODUCTION

Next: 2. MODEL BUILDING WITH Up: REINFORCEMENT DRIVEN INFORMATION ACQUISITION Previous: REINFORCEMENT DRIVEN INFORMATION ACQUISITION

1. INTRODUCTION

Efficient reinforcement learning requires to model the environment. What is an efficient strategy for acquiring a model of a non-deterministic Markov environment (NME)? Reinforcement driven information acquisistion (RDIA), the method described in this paper, extends previous work on ``query learning'' and ``experimental design'' (see e.g. [3] for an overview, see [1,6,4,7,2] for more recent contributions) and ``active exploration'', e.g. [9,8,11]. The method combines the notion of information gain with the notion of reinforcement learning. The latter is used to devise exploration strategies that maximize the former. Experiments demonstrate significant advantages of RDIA.

Basic set-up / Q-Learning. An agent lives in a NME. At a given discrete time step , the environment is in state (one of possible states ), and the agent executes action (one of possible actions ). This affects the environmental state: If and , then with probability $p_{ijk}$ , . At certain times , there is reinforcement . At time , the goal is to maximize the discounted sum of future reinforcement $\sum_{k = 0}^{m} \gamma^{k} R(t+k+1)$ (where $0 < \gamma < 1$ ). We use Watkins' Q-learning [12] for this purpose: is the agent's evaluation (initially zero) corresponding to the state/action pair . The central loop of the algorithm is as follows:

1. Observe current state . Randomly choose $p \in [0, \ldots, 1]$ . If $p \leq \mu \in \left[0, \ldots, 1 \right]$ , randomly pick . Otherwise pick such that is maximal.

2. Execute , observe and .

$\begin{displaymath} Q(S(t), a(t)) \leftarrow (1- \alpha) Q(S(t), a(t)) + \alpha (D(t) + \gamma~ max_b Q(S(t+1), b)), \end{displaymath}$

where $0 < \gamma < 1 , 0 < \alpha < 1$ . Goto 1.

Next: 2. MODEL BUILDING WITH Up: REINFORCEMENT DRIVEN INFORMATION ACQUISITION Previous: REINFORCEMENT DRIVEN INFORMATION ACQUISITION

Juergen Schmidhuber 2003-02-28

Back to Active Learning - Exploration - Curiosity page
Back to Reinforcement Learning page