Edit of 2021: There is now a
more detailed publication
[AC20] on this topic.
One of the most important goals of research on
Artificial Neural Networks (NNs)
is to create algorithms that learn
the statistics of given data.
To achieve this, I introduced
a new type of unsupervised learning in our
Annus Mirabilis of
It is based on the principles of gradient descent/ascent in a minimax game where one NN minimizes the objective function maximized by another.
This duel between two unsupervised adversarial NNs was called
(PM) [PM3, PM1, PM2, PM4, ZIF].
PM requires an encoder network with initially random connection strengths or weights.
It receives data samples (such as images) through its input units
(white circles in the figure to the right)
and generates codes thereof across its output units called code units (grey).
Each code unit can be active in the interval [0,1].
A separate predictor network (black output units in the figure)
is trained to predict each
from the remaining code units. Both encoder and predictor may have hidden
units (not shown here).
The predictor learns the conditional
expectation of each code unit, given the other code units.
But the encoder is trained to maximize exactly the
same objective function
minimized by the predictor (e.g., mean squared error). That is, each
code unit wants to become as unpredictable as possible.
Predictor and encoder fight each other,
to motivate the encoder to achieve a "holy grail" of unsupervised learning, namely,
an ideal, disentangled, binary, factorial code
of the data, where the code units are statistically independent of each other
- see relevant proof by Dayan & Zemel & Pouget (1992) [PM5, PM2].
Ideally, after learning, the probability of a given data pattern is simply the product of the probabilities of its code units; these probabilities equal the predictor's outputs.
(Pattern codes can be decoded again [PM1].)
Generative Adversarial Networks
(GANs, 2014) [GAN1] also use an
unsupervised minimax principle to model the statistics of given data.
The well-known NIPS 2014 GAN paper [GAN1]
claims that PM is not
based on a minimax game with a value function that one agent seeks to maximise and the other seeks to minimise, and that for GANs
"the competition between the networks is the sole training criterion, and is sufficient on its own to train the network," while PM "is only a regularizer that encourages the hidden units of a neural network to be statistically independent while they accomplish some other task; it is not a primary training criterion" [GAN1].
However, good old
PM is indeed a pure minimax game, too, e.g.,
[PM3], Equation 2 (there is no "other task").
In particular, PM was also trained [PM3, PM1] (also on images [PM3, PM4]) such that "the competition between the networks is the sole training criterion, and is sufficient on its own to train the network."
Unsupervised Minimax for Artificial Curiosity
in Adversarial Reinforcement Learning
I extended these ideas to the case of
unsupervised adversarial Reinforcement Learning (RL),
to build agents with Artificial Curiosity.
The first work on this
dates back to 1990
[PhD, UARL1-2] (when Predictability Minimization was invented, too). A reward-maximising neural control network C
learns to generate action sequences or experiments
in an environment. It gets intrinsic
reward in proportion to the prediction errors of a separate neural network
called the world model M. M learns to predict future inputs, given past inputs and actions.
Again, in the absence of external reward, C is maximising exactly the same value function that M is minimising. This motivates C to invent and generate experiments that lead to "novel" situations where M does not yet know how to predict well.
Many recent papers on curiosity for RL are based on this simple principle.
Certain drawbacks of the approach were overcome in numerous follow-up papers
starting in 1991, e.g., [AC91, AC91b], and
In particular, in an adversarial approach of 1997 [UARL3-5],
two dueling, reward-maximizing modules (both general computers)
called the left brain and the right brain
can collectively design an experiment:
a (probabilistic) program that defines how to execute an
action sequence in the environment, and how to
compute the final experimental outcome through an
instruction sequence implementing
a computable function (e.g., a binary yes/no classification)
of the observation sequence triggered by the experiment.
Both brains can
predict experimental outcomes before they are known.
If their predictions or hypotheses differ,
after having generated and executed the experiment,
the surprised loser pays an intrinsic
reward to the winner in a zero sum game.
This motivates the unsupervised two brain system to focus on the "interesting" things,
losing interest in boring aspects of the world that are consistently predictable by both brains, as well as seemingly random aspects of the world that are currently still hard to predict by any brain.
Again, in the absence of external reward, each unsupervised
module is trying to maximise the value function minimised by the other.
This type of
artificial curiosity & creativity
artificial scientists and artists,
e.g., [AC09], and
may also accelerate the intake of
external reward, e.g., [UARL5], intuitively because a better understanding
of the world can help to solve certain problems faster.
Today, many are using the unsupervised minimax principle, both for modeling data distributions and for artificial curiosity (and also for generating algorithmic art).
Compare recent slides [ICML18ws].
[PM1] J. Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863-879, 1992. Based on TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF.
Netzwerkarchitekturen, Zielfunktionen und Kettenregel
(Network architectures, objective functions, and chain rule).
Habilitation (postdoctoral thesis),
Institut for informatics, Tech Univ. Munich, 1993.
[PM3] J. Schmidhuber, M. Eldracher, B. Foltin. Semilinear predictability minimzation produces well-known feature detectors. Neural Computation, 8(4):773-786, 1996.
N. N. Schraudolph, M. Eldracher, J. Schmidhuber.
Processing Images by Semi-Linear Predictability Minimization.
Network, 10(2): 133-169, 1999.
P. Dayan, R. Zemel, and A. Pouget, 1992. Personal Communication. See [PM2], Sec. 6.9.3.
Neural predictors for detecting and removing redundant information.
In H. Cruse, J. Dean, and H. Ritter, editors, Adaptive Behavior
and Learning. Kluwer, 1998.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, Y. Bengio.
Generative adversarial nets. NIPS 2014, 2672-2680, Dec 2014.
J. Schmidhuber. Dynamische neuronale Netze und das fundamentale raumzeitliche Lernproblem (Dynamic neural nets and the fundamental spatio-temporal credit assignment problem). Dissertation, TUM, 1990. PDF.
Making the world differentiable: On using fully recurrent
self-supervised neural networks for dynamic reinforcement learning and
planning in non-stationary environments.
Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990.
A possibility for implementing curiosity and boredom in
model-building neural controllers.
In J. A. Meyer and S. W. Wilson, editors, Proc. of the
International Conference on Simulation
of Adaptive Behavior: From Animals to
Animats, pages 222-227. MIT Press/Bradford Books, 1991.
[UARL3] J. Schmidhuber.
Technical Report IDSIA-35-97, IDSIA, July 1997. More on artificial curiosity.
[UARL4] J. Schmidhuber.
Artificial Curiosity Based on Discovering Novel Algorithmic
Predictability Through Coevolution.
In Proc. CEC 1999, p. 1612-1618,
IEEE Press, Piscataway, NJ, 1999.
[UARL5] J. Schmidhuber.
Exploring the Predictable.
In Ghosh, S. Tsutsui, eds., Advances in Evolutionary Computing,
p. 579-612, Springer, 2002.
Adaptive curiosity and adaptive confidence.
Technical Report FKI-149-91, Institut fuer Informatik, Technische
Universitaet Muenchen, April 1991.
Curious model-building control systems.
In Proc. International Joint Conference on Neural Networks,
Singapore, volume 2, pages 1458-1463. IEEE, 1991.
on artificial curiosity followed later.
J. Schmidhuber. Art & science as by-products of the search for novel patterns, or data compressible in unknown yet learnable ways. In M. Botta (ed.), Et al. Edizioni, 2009, pp. 98-112.
PDF. (More on
artificial scientists and artists.)
J. Schmidhuber. Unsupervised Minimax:
Nets That Fight Each Other.
Slides for the
ICML 2018 Workshop
on Theoretical Foundations and Applications
of Deep Generative Models.
J. Schmidhuber. Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991).
Neural Networks, Volume 127, p 58-66, 2020.
PS: Earlier adversarial settings include reinforcement learners playing against themselves, e.g., [SAM59], or predator prey games in artificial evolution [HIL90]. Chess programs date back to 1945 [ZUSE45], and for many decades have used a recursive minimax procedure with continually shrinking look-ahead.
Such earlier adversarial settings, however, neither were unsupervised, nor gradient-based, nor about modeling the statistics of given data, nor about intrinsic rewards for artificial curiosity:
A. L. Samuel.
Some studies in machine learning using the game of checkers.
IBM Journal on Research and Development, 3:210-229, 1959.
W. D. Hillis.
Co-evolving parasites improve simulated evolution as an optimization
Physica D: Nonlinear Phenomena, 42(1-3):228-234, 1990.
Chess programs, in The Plankalkül.
Rept. No. 106, Gesellschaft für Mathematik und
Datenverarbeitung, pages 201-244, 1976. Translation of German original, 1945.