Formal Theory of Creativity & Fun & Intrinsic Motivation (1990-2010)
has built curious, creative agents that may be viewed as simple artificial
scientists & artists with an intrinsic desire to explore the world by continually inventing new
experiments. They never stop generating novel & surprising stuff, and consist
of two learning modules: (A) an adaptive predictor or compressor or model of the growing data history as the agent is interacting with its environment, and (B) a general reinforcement learner (RL) selecting the actions that shape the history. The learning progress of (A) can be precisely measured and is the agent's fun: the intrinsic reward of (B). That is, (B) is motivated to learn to invent interesting things that (A) does not yet know but can easily learn. To maximize future expected reward, in the absence of external reward such as food, (B) learns more and more complex behaviors that yield initially surprising (but eventually boring) novel patterns that make (A) quickly improve.
Many papers on this
since 1990 can be found
here - key papers include those of 1991, 1995, 1997(2002), 2006, 2007, 2011
(see also bottom of this page).
The agents embody a simple, but general, formal theory of fun & creativity
explaining essential aspects of human or non-human
science, art, music, humor (as discussed in the next column).
Let O(t) denote the state of some subjective observer O at time t.
Let H(t) denote its history of
previous actions & sensations & rewards until time t.
O has some adaptive method for compressing H(t) or parts of it.
We identify the subjective momentary simplicity or compressibility or regularity or
beauty B(D,O(t)) of any data D
(but not its interestingness or aesthetic / artistic value - see below)
as the negative number of bits required to encode D,
given the observer's current limited prior
knowledge and limited compression method.
We define the time-dependent subjective interestingness
or novelty or surprise or aesthetic reward or
aesthetic value or internal joy or fun
I(D,O(t)) of data D for observer O at
discrete time step t>0 by
the change or the first derivative of subjective simplicity or
beauty: as the learning agent improves its compression algorithm,
formerly apparently random data parts become subjectively more
regular and beautiful, requiring fewer and fewer bits
for their encoding. As long as this process is not over
the data remains interesting, but eventually
it becomes boring even if it is beautiful.
At time t, let
ri(t)=I(H(t),O(t)) denote the
momentary fun or intrinsic reward
for compression progress through discovery of a novel pattern somewhere in
H(t), the history of actions and sensations until t.
re(t) denote the current external reward if there is any,
r(t)=g(ri(t),re(t)) the total current reward,
where g is a function
weighing external vs intrinsic rewards, e.g., g(a,b)=a+b.
The agent's goal at time t0 is to maximize
where E is the expectation operator, and
T is death. This can be done with one of our
reinforcement learning algorithms.
The following variants were implemented.
[1991a]: Non-traditional RL
(without restrictive Markovian assumptions) based on
adaptive recurrent neural networks as predictive world models
intrinsic reward as measured by prediction error;
[1991b]: Traditional RL maximizes
intrinsic reward as measured by improvements in prediction error;
: Traditional RL maximizes intrinsic reward as
measured by relative entropies between the agent's
priors and posteriors;
[1997-2002]: Learning of probabilistic, hierarchical programs and skills through
zero-sum intrinsic reward games
of two players betting against each other, each trying to out-predict
or surprise the other by inventing algorithmic experiments where both
modules disagree on the predicted experimental outcome,
taking into account the computational
costs of learning, and learning when to learn and what to
The papers 1991-2002 also showed experimentally how intrinsic rewards can
substantially accelerate goal-directed
learning and external reward intake.
JS also described
mathematically optimal (more), intrinsically motivated systems driven by prediction progress or
compression progress [2006-]. All publications on this
are listed here.
Our continually learning artificial agents go through developmental stages:
something interesting catches their attention
for a while, but they get bored
once they can't further improve their predictions,
then try to create new tasks for
themselves to learn novel, more complex skills
on top of what they already know; then this gets boring as well,
and so on, in open-ended fashion.
Selected Videos and Invited Talks on Artificial Creativity etc
13 June 2012: JS' work featured in Through the Wormhole with Morgan Freeman on the Science
Full video at youtube - check 31:20 ff and 2:30 ff.
20 Jan 2012: TEDx Talk
(uploaded 10 March) at
When creative machines overtake man (12:47).
15 Jan 2011: Winter Intelligence Conference, Oxford
(on universal AI and theory of fun). See
video of Sept 2011 at
22 Sep 2010:
Banquet Talk in the historic
de la Música Catalana for Joint Conferences ECML / PKDD 2010,
Formal Theory of Fun & Creativity. 4th slide.
On Dec 14, videolectures.net posted a video and all slides of this talk.
12 Nov 2009:
Keynote in the historic Cinema Corso (Lugano)
for Multiple Ways to Design Research 09: Art & Science
3 Oct 2009: Invited talk for Singularity Summit in the historic
Kaufmann Concert Hall, New York City.
10 min video.
12 Jul 2009:
Dagstuhl Castle Seminar on
3 Sep 2008:
Keynote for KES 2008
2 Oct 2007:
for Algorithmic Learning Theory (ALT 2007) and
Discovery Science (DS 2007), Sendai, Japan
23 Aug 2007: Keynote for A*STAR Meeting on
Expectation & Surprise, Singapore
12 July 2007: Keynote for Art Meets
Science 2007: "Randomness vs
simplicity & beauty in physics and the fine arts"
Left: JS giving a
talk on creativity theory & art & science & humor at the
Singularity Summit 2009
in New York City.
40min (original at Vimeo),
JS' theory was also subject
of a TV documentary (BR "Faszination Wissen", 29 May 2008; several
repeats on other channels).
Compare H+ interview and
How the Theory Explains Art.
Artists (and observers of art) get
rewarded for making (and observing) novel patterns:
data that is neither arbitrary (like incompressible
random white noise) nor regular in an already known way, but
regular in way that is new with respect
to the observer's current knowledge, yet learnable (that is, after learning
fewer computational resources are needed to encode the data).
While the formal theory of creativity explains the
desire to create or observe all
kinds of art,
(1997) applies and
illustrates it in a particularly clear way. Example to the right:
Many observers report they derive pleasure from
discovering simple but novel patterns while actively
scanning this self-similar Femme Fractale.
The observer's learning process causes
a reduction of the subjective complexity of the data,
yielding a temporarily high derivative of subjective beauty:
a temporarily steep learning curve. Similarly, the computer-aided
got reward for discovering a satisfactory way of using fractal circles to create this low-complexity artwork, although it took him a long time and thousands of frustrating trials. Here is the explanation of the artwork's low algorithmic complexity: The frame is a circle; its leftmost point is the center of another circle of the same size. Wherever two circles of equal size touch or intersect are centers of two more circles with equal and half size, respectively. Each line of the drawing is a segment of some circle, its endpoints are where circles touch or intersect. There are few big circles and many small ones. This can be used to encode the image very efficiently through a very short program. That is,
the Femme Fractale has very low algorithmic information or Kolmogorov complexity.
Click at the image to enlarge it.
(The expression Femme Fractale was coined in 1997: J. Schmidhuber. Femmes Fractales. Report IDSIA-99-97, IDSIA, Switzerland, 1997. In 2012, the artwork first appeared on TV in Through the Wormhole with Morgan Freeman on the Science
How the Theory Explains Music.
Why are some musical pieces more interesting or aesthetically
rewarding than others? Not the one the listener (composer)
just heard (played) fifty times in a row without any noticable change.
It became too subjectively predictable in the
process. Not the weird one with completely
unfamiliar rhythm and tonality. It seems too irregular and contain
too much arbitrariness and subjective noise.
The observer (creator) of the data is interested in
melodies that are unfamiliar enough to contain somewhat
unexpected harmonies or beats etc., but
familiar enough to allow for quickly recognizing the presence of a
new learnable regularity or compressibility in the sound stream:
a novel pattern!
Sure, it will get boring over time, but not yet.
All of this perfectly fits the theory:
The current predictor / compressor of the observer or data creator tries to compress
his history of acoustic and other inputs where possible (whatever you can
predict you can compress as you don't have to store it extra). The action selector
tries to find history-influencing actions such that the continually growing historic data
allows for improving the performance of the predictor / compressor. The interesting or
aesthetically rewarding musical and other subsequences are
precisely those with previously unknown yet learnable
types of regularities, because they lead to
compressor improvements. The boring patterns are those
that are either already perfectly known or
arbitrary or random, or whose structure seems too hard to understand.
How the Theory Explains Humor.
Consider the following statement:
Biological organisms are driven by the "Four Big F's":
Feeding, Fighting, Fleeing, Mating.
Some subjective observers who read this for the first time think it is funny.
Why? As the eyes are sequentially
scanning the text the brain receives a complex visual input stream.
The latter is subjectively partially compressible as it relates to the
observer's previous knowledge about letters and words.
That is, given
the reader's current knowledge and current compressor, the raw data can be encoded
by fewer bits than required to store random data of the same size.
But the punch line after the last comma is unexpected for those who
expected another "F".
Initially this failed expectation results in sub-optimal
data compression - storage of expected events does not cost anything,
but deviations from predictions require extra bits to encode them.
The compressor, however, does not stay
the same forever: within a short time interval
its learning algorithm kicks in and improves its performance on the data seen so far, by
discovering the non-random, non-arbitrary and therefore compressible pattern
relating the punch line to previous text and previous elaborate
predictive knowledge about the "Four Big F's."
This saves a few bits of storage. The number of saved bits
(or a similar measure of learning progress) becomes the observer's intrinsic
reward, possibly strong enough to
motivate him to read on in search for more
reward through additional yet unknown patterns.
While previous attempts at explaining humor (e. g., Raskin 1985) also
focus on the element of surprise, they lack the essential
concept of novel pattern detection measured by
compression progress due to learning.
This progress is zero whenever the
unexpected is just random white noise, and thus no fun at all.
Applications of Schmidhuber's theory of humor can be found in
How the Theory Explains Science.
If the history of the entire
universe were computable,
and there is no evidence against this possibility
(Schmidhuber, Nature 438, 2005),
then its simplest explanation would be the
program that computes it.
Unfortunately there is no general way of finding
the shortest program computing any given data.
Therefore physicists have traditionally proceeded incrementally,
analyzing just a small aspect of the world at any given
time, trying to find simple laws that allow for
describing their limited observations better than the
best previously known law, essentially
trying to find a program that compresses
the observed data better than
the best previously known program.
An unusually large compression breakthrough deserves the name
For example, Newton's law of gravity
can be formulated as a short piece of code which
allows for substantially compressing many observation
sequences involving falling apples and other objects.
Although its predictive power is limited - for example, it does
not explain quantum fluctuations of apple atoms - it still allows
for greatly reducing the number of bits required to encode the
data stream, by assigning short codes to events that are
predictable with high probability under the
assumption that the law holds.
Einstein's general relativity theory
yields additional compression progress as it
compactly explains many previously unexplained
deviations from Newton's predictions.
Most physicists believe
there is still room for further advances,
and this is what's driving their ongoing research.
When not occupied with optimizing external
physicists are just following their compression progress drive!
Key Papers Since 1990 (more here)
A possibility for implementing curiosity and boredom in
model-building neural controllers.
In J. A. Meyer and S. W. Wilson, editors, Proc. of the
International Conference on Simulation
of Adaptive Behavior: From Animals to
Animats, pages 222-227. MIT Press/Bradford Books, 1991.
(Based on TR FKI-126-90,
TUM, 1990. PDF.)
Curious model-building control systems.
In Proc. International Joint Conference on Neural Networks,
Singapore, volume 2, pages 1458-1463. IEEE, 1991.
J. Storck, S. Hochreiter, and J. Schmidhuber.
Reinforcement-driven information acquisition in non-deterministic
In Proc. ICANN'95, vol. 2, pages 159-164.
EC2 & CIE, Paris, 1995.
Leonardo, Journal of the
International Society for the Arts, Sciences, and
Technology, 30(2):97-103, MIT Press, 1997.
J . Schmidhuber.
Artificial Curiosity Based on Discovering Novel Algorithmic
Predictability Through Coevolution.
In P. Angeline, Z. Michalewicz, M. Schoenauer, X. Yao, Z.
Zalzala, eds., Congress on Evolutionary Computation, p. 1612-1618,
IEEE Press, Piscataway, NJ, 1999. (Based on:
What's interesting? TR IDSIA-35-97, 1997.
Exploring the Predictable.
In Ghosh, S. Tsutsui, eds., Advances in Evolutionary Computing,
p. 579-612, Springer, 2002.
Optimal Artificial Curiosity, Creativity, Music, and the Fine Arts.
Connection Science, 18(2): 173-187, 2006.
Simple Algorithmic Principles of Discovery, Subjective Beauty,
Selective Attention, Curiosity & Creativity.
In V. Corruble, M. Takeda, E. Suzuki, eds.,
Proc. 10th Intl. Conf. on Discovery Science (DS 2007)
p. 26-38, LNAI 4755, Springer, 2007.
POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem.
Frontiers in Cognitive Science, 2013.
ArXiv preprint (2011):
R. K. Srivastava, B. R. Steunebrink, J. Schmidhuber.
First Experiments with PowerPlay.
Neural Networks, 2013.
ArXiv preprint (2012):
How the Theory Explains Fun Through Learning Motor Skills.
In many ways the laughs provoked by witty jokes
are similar to those provoked by the
acquisition of new skills through both babies and adults.
Past the age of 25
JS learnt to juggle three balls. It was not a sudden
process but an incremental and rewarding one: in the beginning
he managed to juggle them for maybe one second
before they fell down, then two seconds, four
seconds, etc., until he was able to do it right.
Watching himself in the mirror (as recommended by juggling teachers)
he noticed an idiotic grin across his face whenever he made progress.
Later his little daughter grinned just
like that when she was able to stand on her own feet for the
All of this fits the theory perfectly:
such grins are
triggered by intrinsic reward
for generating a data stream with previously unknown
novel patterns, such as the sensory input sequence
corresponding to observing oneself juggling, which may be quite
different from the more familiar experience of
observing somebody else juggling, and therefore
truly novel and intrinsically rewarding, until
adaptive predictor / compressor
(e.g., a recurrent neural network)
gets used to it.
Picture: iCub baby robot
as used in JS'
EU project IM-CLEVER on developmental robotics and on
implementing the theory of fun & creativity
on robots. (So far, however, the iCub has been unable
to juggle the three balls for more than 60 seconds -
much remains to be done.)
How the Theory
Generalizes Active Learning (e.g., Fedorov, 1972).
To optimize a function may require expensive data evaluations.
Original active learning is limited to supervised classification tasks,
asking which data points to evaluate next to maximize information gain, typically (but not
necessarily) using 1 step look-ahead,
assuming all data point evaluations are equally costly.
The objective (to improve classification error) is given externally;
there is no explicit intrinsic reward in the sense discussed here.
The more general framework of creativity theory also takes formally into account:
(1) Reinforcement learning agents embedded in an environment where there
may be arbitrary delays between experimental actions and corresponding information gains,
e.g., papers of 1991 & 1995,
(2) The highly environment- dependent costs of obtaining or creating not just
individual data points but data sequences of a priori unknown size,
(3) Arbitrary algorithmic or statistical dependencies in sequences of actions &
sensory inputs, e.g., papers of 2002 & 2006,
(4) The computational cost of learning new skills, e.g., the 2002 paper
here. Unlike previous approaches, these systems
measure and maximize algorithmic novelty
(learnable but previously unknown compressibility or predictability)
of self-made, general, spatio- temporal patterns in the history of data and actions,
e.g., papers 2006-2010.
No Objective Ideal Ratio Between Expected and Unexpected.
Some of the previous attempts at explaining aesthetic experience in the context
of math and information theory (Birkhoff 1933, Moles 1968, Bense 1969, Frank 1964, Nake 1974, Franke 1979)
focused on the idea of an "ideal" ratio between expected and unexpected
information conveyed by some aesthetic
object (its order vs its complexity). Note that
the alternative approach of JS does
not have to postulate an objective ideal ratio of this kind.
Instead his dynamic measure
of interestingness reflects the change in the number
of bits required to encode an object, and explicitly takes into
account the subjective observer's prior knowledge as well as
its limited compression improvement algorithm.
Hence the value of an aesthetic experience is not defined by the observed object per se, but
by the algorithmic compression
progress of the subjective, learning observer.
Summary. To build a creative
system we need just a few crucial ingredients:
(1) A predictor or compressor (e.g., an
RNN) of the continually growing history of actions and sensory inputs,
reflecting what's currently known about how the world works,
(2) A learning algorithm that continually improves the predictor or compressor
(detecting novel spatio-temporal patterns that subsequently become known
(3) Intrinsic rewards measuring the predictor's or compressor's improvements (= first derivatives of compressibility) due
to the learning algorithm,
(4) A separate reward optimizer or reinforcement learner (could be an evolutionary algorithm),
which translates those rewards into action
sequences or behaviors expected to optimize future reward - the
creative agent is intrinsically motivated to
make additional novel patterns predictable or compressible in hitherto unknown ways,
thus maximizing learning progress of the predictor / compressor.
Alternative Summary. Apart from external reward, how much fun can a subjective
observer extract from some sequence of actions and observations?
His intrinsic fun is the difference between how many
resources (bits & time) he needs to encode the data before and after learning.
A separate reinforcement learner maximizes
expected fun by finding or creating data that is better compressible in some yet unknown but learnable way,
such as jokes, songs, paintings, or scientific observations obeying novel, unpublished laws.
Copyright notice (2010):
Text and graphics and
Fibonacci web design by
Member of the
Academy of Sciences and Arts,
will be delighted if you use parts of
the data and graphics in this web page
for educational and non-commercial purposes, including
Wikipedia and similar sites,
provided you mention the source and provide a link.
Overview Papers Etc Since 2009
Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes.
In G. Pezzulo, M. V. Butz, O. Sigaud, G. Baldassarre, eds.: Anticipatory Behavior in Adaptive Learning Systems, from Sensorimotor to Higher-level Cognitive Capabilities, Springer, LNAI, 2009.
Simple Algorithmic Theory of Subjective Beauty, Novelty, Surprise,
Interestingness, Attention, Curiosity, Creativity, Art,
Science, Music, Jokes. Journal of SICE, 48(1):21-32, 2009.
J. Schmidhuber. Art & science as by-products of the search for novel patterns, or data compressible in unknown yet learnable ways. In M. Botta (ed.), Multiple ways to design research. Research cases that reshape the design discipline, Milano-Lugano, Swiss Design Network - Et al. Edizioni, 2009, pp. 98-112. PDF.
J. Schmidhuber. Artificial Scientists & Artists Based on the Formal Theory of Creativity.
Proceedings of the Third Conference on Artificial General Intelligence (AGI-2010), Lugano, Switzerland.
J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). IEEE Transactions on Autonomous Mental Development, 2(3):230-247, 2010.
PDF of draft.
Comment on Nature 473, 450 (26 May 2011): Neuroscience: What makes us laugh.
J. Schmidhuber. A Formal Theory of Creativity to Model the Creation of Art.
In J. McCormack (ed.), Computers and Creativity. MIT Press, 2012.
PDF of preprint.