Many biological learning systems, particularily the more complex ones, show an interplay of goal-directed learning and explorative learning. In addition to certain permanent goals (like avoiding pain), goals are generated whose immediate purpose solely is to increase knowledge about the world. So far this interplay has not been addressed at all in the connectionist literature.
The explorative side of learning (related to something that usually is called curiosity) is not completely unsupervised, as it is sometimes assumed. Curiosity helps to learn how the world works, which in turn helps to satisfy certain goals. However, the goal-directedness of curiosity is less obvious than the goal-directedness of the algorithm described above (and of less general algorithms described in other papers on goal-directed learning).
Curiosity is related to what one already knows about the world. One gets curious as soon as one believes that there is something that one does not know. However, the goal of learning how the world works is dominated by other goals (like avoiding pain): One does not know exactly how it feels putting one's hand into the meat grinder. However, one does not want to know.
Since curiosity makes sense only for systems that can have dynamic influence on what they learn, and since curiosity aims at minimizing a dynamically changing value, namely, the degree of ignorance about something, it makes sense only in on-line learning situations where there is some sort of dynamic attention.
Thus the precondition of curiosity is something like our on-line learning algorithm described above. This algorithm builds a world model in order to use the world model for goal-directed learning of the controller. The controller's potential for dynamic attention is given by
the external feedback. The world model adapts itself to whatever the controller focusses on (see  for an application of similar adaptive control techniques to the problem of learning selective attention). The direct goal of curiosity and boredom is to improve the world model. The indirect goal is to ease the learning of new goal-directed action sequences. The contribution of this section is to show one possibility for augmenting the algorithm by curiosity and by its counterpart, which is boredom.
The basic idea is simple: We introduce an additional reinforcement unit for the controller (see figure 1.). This unit, hereafter called the curiosity unit, gets activated by a process which at every time step measures the Euclidian distance between reality and prediction of the model network. The activation of the curiosity unit is a function of this distance. Its desired value is a positive number corresponding to the ideal mismatch between belief and reality. The effect of the algorithm described in the first section is that there is positive reinforcement whenever the model network fails to correctly predict the environment. Thus the usual credit assignment process for the controller encourages certain past actions in order to repeat situations similar to the mismatch situation.
As soon as the model network has learned to correctly predict the environment in former `mismatch situations', actions leading to such situations automatically are de-reinforced. This is because the activation of the curiosity unit goes back to zero. Boredom becomes associated with the corresponding situations.
The important point is: The same complex mechanism which is used for `normal' goal-directed learning is used for implementing curiosity and boredom. There is no need for devising a separate system which aims at improving the world model.
The controller's credit assignment process is aimed at repeatedly entering situations where the model network's performance is not optimal. It is important to observe that this process itself makes use of the model network! The model network has to predict the activations of the curiosity unit. Thus the model network partly has to model its own ignorance, it has to learn to know that it does not know certain details.
What is the ideal mismatch mentioned above? In conventional AI the saying goes that a system can not learn something that it does not already almost know. If we want to adopt this view, then a consequence is that the function that translates mismatches into reinforcement is not a linear one. Zero reinforcement should be given in case of perfect matches, high reinforcement should be given in case of `near-misses', and low reinforcement again should be given in case of strong mismatches. This corresponds to a notion from `esthetic information theory' which tries to explain the feeling of `beauty' by means of the quotient of `subjective complexity' and `subjective order' or the quotient of `unfamiliarity' and `familiarity' (measured in an information-theoretic manner). This quotient should achieve a certain ideal value. (See Nake  for an overview of approaches to formalizing `esthetic information'. Interestingly, the number plays a significant role in at least some of these approaches.) However, at the moment the precise nature of a good mapping between (mis)matches and reinforcement is unclear and subject of ongoing research.
Currently some experimental research is going on in order to answer the following questions: What are useful learning rates (it is assumed that the model network should learn clearly faster than the controller)? What are useful relative strengths of pure goal-directed reinforcement and `curiosity reinforcement'? And what are the properties of a good mapping from mismatches to reinforcement?
Although these questions are still open, in some preliminary experiments with a linear mapping from mismatches to reinforcement it already has been demonstrated that errors of the model network can be reduced by generating curiosity reinforcement in an on-line manner.