Unsupervised Neural Nets Fight Each Other in a Minimax Game where One Net Maximizes the Value Function Minimized by the Other

Unsupervised Neural Networks Fight in a Minimax Game

Jürgen Schmidhuber (2018)
(Pronounce: You_again Shmidhoobuh)

One of the most important goals of research on Artificial Neural Networks (NNs) is to create algorithms that learn the statistics of given data. To achieve this, I introduced a new type of unsupervised learning in the 1990s. It is based on the principles of gradient descent/ascent in a minimax game where one NN minimizes the objective function maximized by another. This duel between two unsupervised adversarial NNs was called Predictability Minimization (PM) [PM3, PM1, PM2, PM4, ZIF].

Predictability Minimization: unsupervised minimax game where one neural network minimizes the objective function maximized by another PM requires an encoder network with initially random connection strengths or weights. It receives data samples (such as images) through its input units (white circles in the figure to the right) and generates codes thereof across its output units called code units (grey). Each code unit can be active in the interval [0,1]. A separate predictor network (black output units in the figure) is trained to predict each code unit from the remaining code units. Both encoder and predictor may have hidden units (not shown here). The predictor learns the conditional expectation of each code unit, given the other code units. But the encoder is trained to maximize exactly the same objective function minimized by the predictor (e.g., mean squared error). That is, each code unit wants to become as unpredictable as possible. Predictor and encoder fight each other, to motivate the encoder to achieve a "holy grail" of unsupervised learning, namely, an ideal, disentangled, binary, factorial code of the data, where the code units are statistically independent of each other - see relevant proof by Dayan & Zemel & Pouget (1992) [PM5, PM2]. Ideally, after learning, the probability of a given data pattern is simply the product of the probabilities of its code units; these probabilities equal the predictor's outputs. (Pattern codes can be decoded again [PM1].)

Generative Adversarial Networks (GANs, 2014) [GAN1] also use an unsupervised minimax principle to model the statistics of given data. The well-known NIPS 2014 GAN paper [GAN1] claims that PM is not based on a minimax game with a value function that one agent seeks to maximise and the other seeks to minimise, and that for GANs "the competition between the networks is the sole training criterion, and is sufficient on its own to train the network," while PM "is only a regularizer that encourages the hidden units of a neural network to be statistically independent while they accomplish some other task; it is not a primary training criterion" [GAN1]. However, good old PM is indeed a pure minimax game, too, e.g., [PM3], Equation 2 (there is no "other task"). In particular, PM was also trained [PM3, PM1] (also on images [PM3, PM4]) such that "the competition between the networks is the sole training criterion, and is sufficient on its own to train the network."

Curiosity is not necessarily good for you Unsupervised Minimax for Artificial Curiosity in Adversarial Reinforcement Learning

I extended these ideas to the case of unsupervised adversarial Reinforcement Learning (RL), to build agents with Artificial Curiosity. The first work on this dates back to 1990 [PhD, UARL1-2] (when Predictability Minimization was invented, too). A reward-maximising neural control network C learns to generate action sequences or experiments in an environment. It gets intrinsic reward in proportion to the prediction errors of a separate neural network called the world model M. M learns to predict future inputs, given past inputs and actions. Again, in the absence of external reward, C is maximising exactly the same value function that M is minimising. This motivates C to invent and generate experiments that lead to "novel" situations where M does not yet know how to predict well.

Many recent papers on curiosity for RL are based on this simple principle. Certain drawbacks of the approach were overcome in numerous follow-up papers starting in 1991, e.g., [AC91, AC91b], and many more.

In particular, in an adversarial approach of 1997 [UARL3-5], two dueling, reward-maximizing modules (both general computers) called the left brain and the right brain can collectively design an experiment: a (probabilistic) program that defines how to execute an action sequence in the environment, and how to compute the final experimental outcome through an instruction sequence implementing a computable function (e.g., a binary yes/no classification) of the observation sequence triggered by the experiment. Both brains can predict experimental outcomes before they are known. If their predictions or hypotheses differ, after having generated and executed the experiment, the surprised loser pays an intrinsic reward to the winner in a zero sum game.

Two reward-maximizing modules (both general computers) called the left brain and the right brain bet on computational outcomes of experimental protocols or algorithms they have agreed upon This motivates the unsupervised two brain system to focus on the "interesting" things, losing interest in boring aspects of the world that are consistently predictable by both brains, as well as seemingly random aspects of the world that are currently still hard to predict by any brain. Again, in the absence of external reward, each unsupervised module is trying to maximise the value function minimised by the other. This type of artificial curiosity & creativity may drive artificial scientists and artists, e.g., [AC09], and may also accelerate the intake of external reward, e.g., [UARL5], intuitively because a better understanding of the world can help to solve certain problems faster.

Today, many are using the unsupervised minimax principle, both for modeling data distributions and for artificial curiosity (and also for generating algorithmic art). Compare recent slides [ICML18ws].

References

[PM1] J. Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863-879, 1992. Based on TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF. More.

[PM2] J. Schmidhuber. Netzwerkarchitekturen, Zielfunktionen und Kettenregel (Network architectures, objective functions, and chain rule). Habilitation (postdoctoral thesis), Institut for informatics, Tech Univ. Munich, 1993. PDF. HTML.

[PM3] J. Schmidhuber, M. Eldracher, B. Foltin. Semilinear predictability minimzation produces well-known feature detectors. Neural Computation, 8(4):773-786, 1996. PDF. More.

[PM4] N. N. Schraudolph, M. Eldracher, J. Schmidhuber. Processing Images by Semi-Linear Predictability Minimization. Network, 10(2): 133-169, 1999. PDF.

[PM5] P. Dayan, R. Zemel, and A. Pouget, 1992. Personal Communication. See [PM2], Sec. 6.9.3.

[ZIF] J. Schmidhuber. Neural predictors for detecting and removing redundant information. In H. Cruse, J. Dean, and H. Ritter, editors, Adaptive Behavior and Learning. Kluwer, 1998. PDF. HTML.

[GAN1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Generative adversarial nets. NIPS 2014, 2672-2680, Dec 2014.

[PhD] J. Schmidhuber. Dynamische neuronale Netze und das fundamentale raumzeitliche Lernproblem (Dynamic neural nets and the fundamental spatio-temporal credit assignment problem). Dissertation, TUM, 1990. PDF. HTML.

[UARL1] J.  Schmidhuber. Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990. PDF

[UARL2] J.  Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In J. A. Meyer and S. W. Wilson, editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pages 222-227. MIT Press/Bradford Books, 1991. PDF. HTML.

[UARL3] J. Schmidhuber. What's interesting? Technical Report IDSIA-35-97, IDSIA, July 1997. More on artificial curiosity.

[UARL4] J. Schmidhuber. Artificial Curiosity Based on Discovering Novel Algorithmic Predictability Through Coevolution. In Proc. CEC 1999, p. 1612-1618, IEEE Press, Piscataway, NJ, 1999.

[UARL5] J. Schmidhuber. Exploring the Predictable. In Ghosh, S. Tsutsui, eds., Advances in Evolutionary Computing, p. 579-612, Springer, 2002. PDF. HTML.

[AC91b] J.  Schmidhuber. Adaptive curiosity and adaptive confidence. Technical Report FKI-149-91, Institut fuer Informatik, Technische Universitaet Muenchen, April 1991. PDF.

[AC91] J.  Schmidhuber. Curious model-building control systems. In Proc. International Joint Conference on Neural Networks, Singapore, volume 2, pages 1458-1463. IEEE, 1991. PDF. HTML. Numerous papers on artificial curiosity followed later.

[AC09] J. Schmidhuber. Art & science as by-products of the search for novel patterns, or data compressible in unknown yet learnable ways. In M. Botta (ed.), Et al. Edizioni, 2009, pp. 98-112. PDF. (More on artificial scientists and artists.)

[ICML18ws] J. Schmidhuber. Unsupervised Minimax: Nets That Fight Each Other. Slides for the ICML 2018 Workshop on Theoretical Foundations and Applications of Deep Generative Models. PDF.

PS: Earlier adversarial settings include reinforcement learners playing against themselves, e.g., [SAM59], or predator prey games in artificial evolution [HIL90]. Chess programs date back to 1945 [ZUSE45], and for many decades have used a recursive minimax procedure with continually shrinking look-ahead. Such earlier adversarial settings, however, neither were unsupervised, nor gradient-based, nor about modeling the statistics of given data, nor about intrinsic rewards for artificial curiosity:

[SAM59] A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3:210-229, 1959.

[HIL90] W. D. Hillis. Co-evolving parasites improve simulated evolution as an optimization procedure. Physica D: Nonlinear Phenomena, 42(1-3):228-234, 1990.

[ZUSE45] K. Zuse. Chess programs, in The Plankalkül. Rept. No. 106, Gesellschaft für Mathematik und Datenverarbeitung, pages 201-244, 1976. Translation of German original, 1945.

.

Our impact on the world's most valuable public companies (Google, Apple, Microsoft, Facebook, mazon etc)