Intuitively, surprise or novelty implies unexpectedness. For instance, many would be surprised by the view of a flying elephant. Unexpectedness by itself, however, does not imply surprise. For instance, few would be surprised by the unpredictable and therefore unexpected details of the elephant's skin texture.
One difference between the flying elephant and its skin texture is that the former violates a confident prediction (``this elephant won't fly") while the latter does not. Obviously surprise has to be measured with respect to some given predictor. Only a predictor explicitly expressing an expectation that may be wrong can leave room for surprise and novelty: no surprise without the commitment of a prediction.
Now consider that in general it is not particular inputs that are surprising, but sequences of inputs. For instance, many would say an airborne elephant that jumped off the roof of a building is less surprising than one lifting off from the street and disappearing in the clouds. In fact, the predictable and the surprising things are usually spatio-temporal abstractions of the input sequences. For example, all possible visual input sequences caused by physically plausible trajectories of falling elephants (modulated by eye and head movements and varying lighting conditions) may map onto the same abstract internal representation ``falling elephant," as opposed to the more surprising ``flying elephant."
Let's have a closer look now at how the concept of a surprise is implemented in some existing, ``curious," ``inquisitive'' machine learning systems designed to explore a given environment, and how this differs from what we intuitively might want.
Most previous work on exploring unknown data sets has focused on selecting single training exemplars maximizing traditional information gain [5,10,16,18,4]. Here typically the concept of a surprise is defined in Shannon's sense : some event's surprise value or information content is the negative logarithm of its probability. This inspired simple reinforcement learning approaches to pure exploration [24,23,40] that use adaptive predictors to predict the entire next input, given current input and action. The basic idea is that the action-generating module gets rewarded in the case of predictor failures. Since it tries to maximize reward, it is motivated to generate action sequences leading to yet unpredictable states that are ``informative'' in the classic sense. Some of these explorers actually like white noise simply because it is so unpredictable, thus conveying a lot of Shannon information.
Most existing systems are limited to picking out simple statistical regularities such as ``performing action A in discrete, fully observable environmental state B will lead to state C with probability 0.8.'' Essentially, they always predict all the details of single inputs (or the next state among a set of predefined states), and are not able to limit their predictions solely to certain computable aspects of inputs (as requested in ) or input sequences, while ignoring random and irregular aspects. For instance, they cannot even express (and therefore cannot find) complex, abstract, predictable regularities such as ``executing a particular sequence of eye movements, given a history of incomplete environmental inputs partially caused by a falling elephant, will result in the appearance of a big red stain on the street within the next 3 seconds, where details of the shape of the stain are expected to be unpredictable and left unspecified.''
General spatio-temporal abstractions of this kind apparently can be made only by systems that can run fairly general algorithms mapping input/action sequences to compact internal representations conveying only certain relevant information embedded in the original inputs. For instance, there are many different, realistic, plausible big red stains -- all may be mapped onto the same compact internal representation predictable from all sequences corresponding to the abstraction ``falling elephant." Only if the final input sequence caused by eye movements scanning the street does not map onto the concept ``big red stain" (because the elephant somehow decelerated in time and for some strange reason never touched the ground) will there be a surprise.
The central questions are: In a given environment, how does one extract the predictable concepts corresponding to algorithmic regularities that are not already known? Which novel input sequence-transforming algorithms do indeed compute reduced internal representations permitting reliable predictions? Usually we cannot rely on a teacher telling the system which concepts are interesting, such as in the EURISKO system . So how to discover novel spatio-temporal regularities automatically among the many random or unpredictable things that should be ignored?
To study these questions, I will use a rather general algorithmic setup. Consider an agent exposed to a lifelong sequence of complex (for example, visual) inputs from a real world-like environment that it can manipulate. The agent is able to compose algorithms written in a Turing-equivalent programming language (one that permits construction of a universal Turing machine). The instruction set includes instructions that can access the current input (for example, pixels of visual scenes), modify the environment via actions, and modify an internal memory consisting of many addressable memory cells. It also includes arithmetic instructions and conditional jumps (conditioned on current memory cell contents).
An important role is played by so-called ``Bet!'' instructions that can be used as parts of algorithms that make predictions about memory cell contents. For instance, suppose there already exists a complex algorithm that writes value 7 into memory cell 22 whenever there has occurred one of the many visual input sequences caused by typical falling elephants. Suppose there exists another piece of code that writes value 7 into memory cell 44 shortly after a big red stain in the street is identified by an appropriate algorithm. Finally, suppose shortly after cell 44 is filled another algorithm looks at cell 22 and a history of recent eye movements represented in other memory cells and claims that cell 44's content is 7, without looking at it. The claim is simply implemented via a Bet! instruction that addresses the two memory cells and compares their contents. It corresponds to a prediction expressing abstract knowledge about the world. The prediction may be wrong. This will constitute a surprise -- a violation of an explicit (confident) prediction limited to a very particular, abstract transformation of a complex input sequence.
Now, in the absence of an external programmer or teacher, how could such algorithms focusing on the essential, predictable aspects of the environment be learned? How can we motivate the system to focus on such novel algorithms without wasting time on previously learned algorithms whose effects are already known? How can we prevent it from focusing on trivial transformations mapping all input sequences on the same internal representation which is always predictable? Clearly, we somehow need to evaluate whether something is trivial or already known or novel or just plain irregular. The model of what's known needs to be adaptive, quickly accessible, and able to generalize from experience with, say, flying elephants, to yet unseen instances of flying elephants. A look-up table model, for instance, is out of the question, due to its lack of generalization capability, and the large number of possible algorithms to investigate.
Here I propose to implicitly represent both algorithmic knowledge and model of knowledge in an adaptive data structure evolving through co-evolution of two essentially identical, algorithm-generating modules that collectively design experiments and bet on their outcomes, competing and cooperating at the same time. I will motivate each module to show the other regularities it does not yet believe in.
Each module is a set of modifiable, real-valued parameters, and owns initially equal amounts of ``money" represented by two real-valued variables -- the money idea in the context of machine learning can be traced back at least to Holland's bucket brigade system . There is a function that maps the current values of the parameters of both modules onto a single, possibly complex algorithm that may include Bet! instructions corresponding to statements such as ``cells 22 and 44 are now equal." Thus the modules collectively design and run an algorithm they have agreed upon.
For each Bet! instruction the modules can bet in advance whether the prediction is wrong or true (bids are real values, either fixed or based on the current module parameters). If they bet on different outcomes, given a particular Bet! instruction, then the system will immediately check which module is right. The winner gets rewarded (it receives the other's bid which is added to its money), while the surprised loser is punished (loses its bid).
It is absolutely essential that both modules agree on each experiment they assemble and execute. This is to make sure that no module can cheat. In the elephant example above, for instance, cheating would be possible if one of the modules could insert into its current algorithm instructions (not authorized by the other module) that look at cell 44's contents and use this information before computing an appropriate bid. This would be like two differently privileged viewers watching a magic trick, one in the audience being surprised by something the other finds entirely predictable because the latter can observe hidden movements of the magician from backstage. By agreeing on the entire instruction protocol of the current experiment the modules implicitly agree on all the information that may be used for the current computation and the current predictions.
In trying to maximize reward, each module is motivated to surprise the other as often as possible. Each is also motivated to veto computations whose outcome it does not consider profitable. As a consequence, each is essentially motivated to lure the other into agreeing on experiments that will surprise it. But by agreeing, each module expresses its belief that it is actually the other module which is in for a surprise. Using an appropriate evolutionary or reinforcement learning (RL) algorithm, a surprised module will eventually adapt. In turn, the other will lose a source of reward -- an incentive to shift the exploration focus and try new algorithms reveiling novel, yet unknown, predictable regularities.
One of the simplest such RL algorithms divides the learners' life times into trials, each lasting, say, 1000 successive instructions. After each trial, the current loser is replaced by a copy of the winner (the module with the most money) if there is one, thus ``learning" what the other knew. Money is then equally redistributed among both modules, and the parameters of one of them are modified via a random mutation such that the new competitors in the next trial are not exactly alike and tend to bet on different outcomes.
In the experiments, however, I will use a more sophisticated RL scheme that (1) takes into account the possibility of additional external reward from the environment (to be maximized by both modules), (2) does not depend on pre-wired trial lengths, and (3) replaces the simple mutation process mentioned above by more powerful, self-referential ``sequences of module parameter-modifying instructions" that allow the algorithm-generating modules to modify themselves in a highly directed manner [35,34]. Section 2 will present formal details, Section 3 will describe details of the RL algorithm, Section 4 will show that algorithmic surprise-generation of this kind can actually help to speed up external reward, Section 5 will conclude, and the appendix will describe details of the concrete implementation used in the experiments.