End-to-end differentiable neural attention 1990-93

Jürgen Schmidhuber (October 2020)
Pronounce: You_again Shmidhoobuh

End-to-End Differentiable Sequential Neural Attention 1990-93

Abstract. In 2020, we are celebrating the 30-year anniversary of our end-to-end differentiable sequential neural attention and goal-conditional reinforcement learning (RL) [ATT0] [ATT1]. This work was conducted in 1990 at TUM with my student Rudolf Huber. A few years later, I also described the learning of internal spotlights of attention in end-to-end differentiable fashion for outer product-based fast weights [FAST2]. That is, back then we already had both of the now common types of neural sequential attention: end-to-end differentiable "soft" attention (in latent space) through multiplicative units within neural networks [FAST2], and "hard" attention (in observation space) in the context of RL [ATT0] [ATT1]. Today, similar techniques are widely used.

Learning Sequential Attention Through Neural Networks (1990) Unlike traditional artificial neural networks (NNs), humans use sequential gaze shifts and selective attention to detect and recognize patterns. This can be much more efficient than the highly parallel approach of traditional feedforward NNs. That's why we introduced sequential attention-learning NNs three decades ago (1990 and onwards) [ATT0] [ATT1]—compare also [AC90] [PLAN2-3] [PHD].

Our work [ATT0] also introduced another concept that is widely used in today's Reinforcement Learning (RL): extra goal-defining input patterns that encode various tasks, such that the RL machine knows which task to execute next. In references [ATT0] [ATT1], such goal-conditional RL is used to train a neural controller with the help of an adaptive, neural, predictive world model [AC90]. The world model learns to predict parts of future inputs from past inputs and actions of the controller. Using the world model, the controller learns to find objects in visual scenes, by steering a fovea through sequences of saccades, thus learning sequential attention. User-defined goals (target objects) are provided to the system by special "goal input vectors" that remain constant (Sec. 3.2 of [ATT1]) while the controller shapes its stream of visual inputs through fovea-shifting actions.

In the same year of 1990, we also used goal-conditional RL for hierarchical RL with end-to-end differentiable subgoal generators [HRL0-2]. An NN with task-defining inputs of the form (start, goal) learns to predict the costs of going from start states to goal states. (Compare my former student Tom Schaul's "universal value function approximator" at DeepMind a quarter century later [UVF15].) In turn, the gradient descent-based subgoal generator NN learns to use such predictions to come up with better subgoals. More in Sec. 10 & Sec. 12 of [MIR].

Section 5 of my overview paper for CMSS 1990 [ATT2] summarised our early work on attention, apparently the first implemented neural system for combining glimpses that jointly trains a recognition & prediction component with an attentional component (the fixation controller). More on this in Sec. 9 of [MIR] and Sec. XVI & XVII of [T20].

Learning Internal Spotlights of Attention

A typical NN has many more connections than neurons. In traditional NNs, neuron activations change quickly, while connection weights change slowly. That is, the numerous weights cannot implement short-term memories or temporal variables, only the few neuron activations can. Non-traditional NNs with quickly changing "fast weights" overcome this limitation.

End-to-end Differentiable Fast Weights: NNs Learn to Program NNs (1991) Dynamic links or fast weights for NNs were introduced by Christoph v. d. Malsburg in 1981 [FAST]. However, he did not have an end-to-end differentiable system that learns by gradient descent to quickly manipulate the fast weight storage. Such a system I published in 1991 [FAST0] [FAST1]. There a slow NN learns to control the weights of a separate fast NN. One year later, I introduced gradient descent-based, active control of fast weights through 2D tensors or outer product updates for recurrent NNs [FAST2] (compare our more recent work on this [FAST3] [FAST3a]). The motivation was to get many more temporal variables under end-to-end differentiable control than what's possible in standard recurrent NNs of the same size: O(H^2) instead of O(H), where H is the number of hidden units. To achieve this, Section 2 of [FAST2] explicitly introduced the learning of "internal spotlights of attention" in end-to-end differentiable networks. More in Sec. 8 of [MIR] and Sec. XVI & XVII of [T20].

Today, the most famous end-to-end differentiable fast weight-based NN [FAST0] is actually our vanilla LSTM network of 2000 [LSTM2], whose forget gates learn to control the fast weights on self-recurrent connections of internal LSTM cells. All the major IT companies are now massively using vanilla LSTM [DL4]. More in [DEC] and Sec. 4 & Sec. 8 of [MIR].


Thanks to several expert reviewers for useful comments. (Let me know under juergen@idsia.ch if you can spot any remaining error.) The contents of this article may be used for educational and non-commercial purposes, including articles for Wikipedia and similar sites.

Self-referential problem-solving robot thinking about itself References

[DL1] J. Schmidhuber, 2015. Deep Learning in neural networks: An overview. Neural Networks, 61, 85-117. More.

[DL2] J. Schmidhuber, 2015. Deep Learning. Scholarpedia, 10(11):32832.

[DL4] J. Schmidhuber, 2017. Our impact on the world's most valuable public companies: 1. Apple, 2. Alphabet (Google), 3. Microsoft, 4. Facebook, 5. Amazon ... HTML.

[ATT0] J. Schmidhuber and R. Huber. Learning to generate focus trajectories for attentive vision. Technical Report FKI-128-90, Institut für Informatik, Technische Universität München, 1990. PDF.

[ATT1] J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(1 & 2):135-141, 1991. Based on TR FKI-128-90, TUM, 1990. PDF. More.

[ATT2] J.  Schmidhuber. Learning algorithms for networks with internal and external feedback. In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton, editors, Proc. of the 1990 Connectionist Models Summer School, pages 52-61. San Mateo, CA: Morgan Kaufmann, 1990. PS. (PDF.)

[HRL0] J.  Schmidhuber. Towards compositional learning with dynamic neural networks. Technical Report FKI-129-90, Institut für Informatik, Technische Universität München, 1990. PDF.

[HRL1] J. Schmidhuber. Learning to generate sub-goals for action sequences. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 967-972. Elsevier Science Publishers B.V., North-Holland, 1991. PDF. Extending TR FKI-129-90, TUM, 1990. HTML & images in German.

[HRL2] J.  Schmidhuber and R. Wahnsiedler. Planning simple trajectories using neural subgoal generators. In J. A. Meyer, H. L. Roitblat, and S. W. Wilson, editors, Proc. of the 2nd International Conference on Simulation of Adaptive Behavior, pages 196-202. MIT Press, 1992. PDF. HTML & images in German.

[HRL4] M. Wiering and J. Schmidhuber. HQ-Learning. Adaptive Behavior 6(2):219-246, 1997. PDF.

[UVF15] T. Schaul, D. Horgan, K. Gregor, D. Silver. Universal value function approximators. Proc. ICML 2015, pp. 1312-1320, 2015.

[FAST] C. v.d. Malsburg. Tech Report 81-2, Abteilung f. Neurobiologie, Max-Planck Institut f. Biophysik und Chemie, Goettingen, 1981.

[FAST0] J.  Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Technical Report FKI-147-91, Institut für Informatik, Technische Universität München, March 1991. PDF.

[FAST1] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131-139, 1992. PDF. HTML. Pictures (German).

[FAST2] J. Schmidhuber. Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460-463. Springer, 1993. PDF.

[FAST3] I. Schlag, J. Schmidhuber. Gated Fast Weights for On-The-Fly Neural Program Generation. Workshop on Meta-Learning, @NIPS 2017, Long Beach, CA, USA.

[FAST3a] I. Schlag, J. Schmidhuber. Learning to Reason with Third Order Tensor Products. Advances in Neural Information Processing Systems (NIPS), Montreal, 2018. Preprint: arXiv:1811.12143. PDF.

[FAST5] F. J. Gomez and J. Schmidhuber. Evolving modular fast-weight networks for control. In W. Duch et al. (Eds.): Proc. ICANN'05, LNCS 3697, pp. 383-389, Springer-Verlag Berlin Heidelberg, 2005. PDF. HTML overview.

[T20] J. Schmidhuber (June 2020). Critique of 2018 Turing Award. Link.

[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000. PDF. [The "vanilla LSTM architecture" that everybody is using today, e.g., in Google's Tensorflow.]

[PHD] J.  Schmidhuber. Dynamische neuronale Netze und das fundamentale raumzeitliche Lernproblem (Dynamic neural nets and the fundamental spatio-temporal credit assignment problem). Dissertation, Institut für Informatik, Technische Universität München, 1990. PDF. HTML.

[AC90] J.  Schmidhuber. Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990. PDF

[PLAN2] J.  Schmidhuber. An on-line algorithm for dynamic reinforcement learning and planning in reactive environments. In Proc. IEEE/INNS International Joint Conference on Neural Networks, San Diego, volume 2, pages 253-258, 1990. Based on [AC90].

[PLAN3] J.  Schmidhuber. Reinforcement learning in Markovian and non-Markovian environments. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, NIPS'3, pages 500-506. San Mateo, CA: Morgan Kaufmann, 1991. PDF. Partially based on [AC90].

[MIR] J. Schmidhuber (10/4/2019). Deep Learning: Our Miraculous Year 1990-1991. See also arxiv:2005.05744 (May 2020).

[DEC] J. Schmidhuber (02/20/2020). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s.


The 2010s: Our Decade of Deep Learning (Juergen Schmidhuber)