**Abstract.** In 2020, we are celebrating the
30-year anniversary of our
end-to-end differentiable sequential neural attention and goal-conditional reinforcement learning
(RL) [ATT0] [ATT1].
This work
was conducted in 1990 at TUM with my student Rudolf Huber.
A few years later, I also described the learning
of internal spotlights of attention in end-to-end differentiable fashion for *outer
product-based fast weights* [FAST2].
That is, back then we already had *both* of the now common types of neural sequential attention:
end-to-end differentiable *"soft"* attention (in *latent* space)
through multiplicative units within neural networks [FAST2],
and
*"hard"* attention (in *observation* space) in
the context of RL [ATT0] [ATT1].
Today, similar techniques are widely used.

Unlike traditional artificial neural networks (NNs), humans use sequential gaze shifts and **selective attention** to detect and recognize patterns.
This can be much more efficient than the highly parallel approach of traditional feedforward NNs.
That's why we introduced
sequential attention-learning NNs three decades ago (1990 and onwards) [ATT0]
[ATT1]—compare also [AC90] [PLAN2-3] [PHD].

Our work [ATT0] also introduced another concept that is widely used in today's **Reinforcement Learning** (RL):
extra *goal-defining input patterns* that encode various tasks,
such that the RL machine knows which task to execute next.
In references [ATT0] [ATT1], such **goal-conditional RL** is used to train
a neural controller with the help of an adaptive, neural, predictive **world model** [AC90]. The world model learns to predict parts of future inputs from
past inputs and actions of the controller. Using the world model,
the controller learns to find objects in visual scenes, by steering a fovea through sequences of saccades, thus learning sequential attention. User-defined goals (target objects) are provided to the system by special "goal input vectors" that remain constant (Sec. 3.2 of [ATT1]) while the controller shapes its stream of visual inputs through fovea-shifting actions.

In the same year of 1990, we also used **goal-conditional RL
for
hierarchical RL** with
end-to-end differentiable **subgoal generators** [HRL0-2]. An NN with task-defining inputs
of the form *(start, goal)* learns to predict the costs of going from start states to goal states.
(Compare my former student Tom Schaul's *"universal value function approximator"* at DeepMind a quarter century later [UVF15].)
In turn, the gradient descent-based subgoal generator NN learns to use such predictions to come up with better subgoals.
More in
Sec. 10 & Sec. 12 of [MIR].

Section 5 of my
overview paper for CMSS 1990
[ATT2] summarised our early work on attention,
apparently the first implemented neural system for combining glimpses that jointly trains a recognition & prediction component
with an attentional component (the fixation controller).
More on this in
Sec. 9 of [MIR]
and Sec. XVI & XVII of [T20].

##

Learning Internal Spotlights of Attention

A typical NN has many more connections than neurons.
In traditional NNs, neuron activations change quickly,
while connection weights change slowly.
That is, the numerous weights cannot implement short-term memories
or temporal variables, only the few neuron activations can.
Non-traditional NNs with quickly changing *"fast weights"* overcome this limitation.
Dynamic links or fast weights for NNs were introduced by Christoph v. d. Malsburg in 1981 [FAST]. However, he did not have an *end-to-end differentiable* system that learns by gradient descent to quickly manipulate the fast weight storage. Such a system I published in 1991 [FAST0] [FAST1].
There a slow NN learns to control the weights of a separate fast NN.
One year later, I introduced gradient descent-based, active control of fast weights through **2D tensors or outer product updates** for recurrent NNs [FAST2] (compare our more recent work on this [FAST3] [FAST3a]).
The motivation
was to get many more temporal variables under end-to-end differentiable control than what's possible in standard recurrent NNs of the same size: O(H^2) instead of O(H), where H is the number of hidden units. To achieve this,
Section 2 of
[FAST2] explicitly introduced **the learning of "internal spotlights of attention"** in end-to-end differentiable networks.
More in
Sec. 8 of [MIR]
and Sec. XVI & XVII of [T20].

Today, the most famous end-to-end differentiable fast weight-based NN [FAST0] is actually our vanilla LSTM network of 2000 [LSTM2], whose forget gates learn to control the fast weights on self-recurrent connections of internal LSTM cells. All the major IT companies are now massively using vanilla LSTM [DL4].
More in [DEC] and
Sec. 4 & Sec. 8 of [MIR].

##

Acknowledgments

Thanks to several expert reviewers for useful comments. (Let me know under *juergen@idsia.ch* if you can spot any remaining error.) The contents of this article may be used for educational and non-commercial purposes, including articles for Wikipedia and similar sites.

##
References

[DL1] J. Schmidhuber, 2015.
Deep Learning in neural networks: An overview. Neural Networks, 61, 85-117.
More.

[DL2] J. Schmidhuber, 2015.
Deep Learning.
Scholarpedia, 10(11):32832.

[DL4] J. Schmidhuber, 2017. Our impact on the world's most valuable public companies: 1. Apple, 2. Alphabet (Google), 3. Microsoft, 4. Facebook, 5. Amazon ...
HTML.

[ATT0] J. Schmidhuber and R. Huber.
Learning to generate focus trajectories for attentive vision.
Technical Report FKI-128-90, Institut für Informatik, Technische
Universität München, 1990.
PDF.

[ATT1] J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(1 & 2):135-141, 1991. Based on TR FKI-128-90, TUM, 1990.
PDF.
More.

[ATT2]
J. Schmidhuber.
Learning algorithms for networks with internal and external feedback.
In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton,
editors, *Proc. of the 1990 Connectionist Models Summer School*, pages
52-61. San Mateo, CA: Morgan Kaufmann, 1990.
PS. (PDF.)

[HRL0]
J. Schmidhuber.
Towards compositional learning with dynamic neural networks.
Technical Report FKI-129-90, Institut für Informatik, Technische
Universität München, 1990.
PDF.

[HRL1]
J. Schmidhuber. Learning to generate sub-goals for action sequences. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 967-972. Elsevier Science Publishers B.V., North-Holland, 1991. PDF. Extending TR FKI-129-90, TUM, 1990.
HTML & images in German.

[HRL2]
J. Schmidhuber and R. Wahnsiedler.
Planning simple trajectories using neural subgoal generators.
In J. A. Meyer, H. L. Roitblat, and S. W. Wilson, editors, *Proc.
of the 2nd International Conference on Simulation of Adaptive Behavior*,
pages 196-202. MIT Press, 1992.
PDF.
HTML & images in German.

[HRL4]
M. Wiering and J. Schmidhuber. HQ-Learning. Adaptive Behavior 6(2):219-246, 1997.
PDF.

[UVF15]
T. Schaul, D. Horgan, K. Gregor, D. Silver. Universal value function approximators. Proc. ICML 2015, pp. 1312-1320, 2015.

[FAST] C. v.d. Malsburg. Tech Report 81-2, Abteilung f. Neurobiologie,
Max-Planck Institut f. Biophysik und Chemie, Goettingen, 1981.

[FAST0]
J. Schmidhuber.
Learning to control fast-weight memories: An alternative to recurrent nets.
Technical Report FKI-147-91, Institut für Informatik, Technische
Universität München, March 1991.
PDF.

[FAST1] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131-139, 1992.
PDF.
HTML.
Pictures (German).

[FAST2] J. Schmidhuber. Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460-463. Springer, 1993.
PDF.

[FAST3] I. Schlag, J. Schmidhuber. Gated Fast Weights for On-The-Fly Neural Program Generation. Workshop on Meta-Learning, @NIPS 2017, Long Beach, CA, USA.

[FAST3a] I. Schlag, J. Schmidhuber. Learning to Reason with Third Order Tensor Products. Advances in Neural Information Processing Systems (NIPS), Montreal, 2018.
Preprint: arXiv:1811.12143. PDF.

[FAST5]
F. J. Gomez and J. Schmidhuber.
Evolving modular fast-weight networks for control.
In W. Duch et al. (Eds.):
*Proc. ICANN'05,*
LNCS 3697, pp. 383-389, Springer-Verlag Berlin Heidelberg, 2005.
PDF.
HTML overview.

[T20] J. Schmidhuber (June 2020). Critique of 2018 Turing Award. Link.

[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000.
PDF.
*[The "vanilla LSTM architecture" that everybody is using today, e.g., in Google's Tensorflow.]*

[PHD]
J. Schmidhuber.
Dynamische neuronale Netze und das fundamentale raumzeitliche
Lernproblem
(Dynamic neural nets and the fundamental spatio-temporal
credit assignment problem).
Dissertation,
Institut für Informatik, Technische
Universität München, 1990.
PDF.
HTML.

[AC90]
J. Schmidhuber.
Making the world differentiable: On using fully recurrent
self-supervised neural networks for dynamic reinforcement learning and
planning in non-stationary environments.
Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990.
PDF

[PLAN2]
J. Schmidhuber.
An on-line algorithm for dynamic reinforcement learning and planning
in reactive environments.
In *Proc. IEEE/INNS International Joint Conference on Neural
Networks, San Diego*, volume 2, pages 253-258, 1990.
Based on [AC90].

[PLAN3]
J. Schmidhuber.
Reinforcement learning in Markovian and non-Markovian environments.
In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, *
Advances in Neural Information Processing Systems 3, NIPS'3*, pages 500-506. San
Mateo, CA: Morgan Kaufmann, 1991.
PDF.
Partially based on [AC90].

[MIR] J. Schmidhuber (10/4/2019). Deep Learning: Our Miraculous Year 1990-1991. See also arxiv:2005.05744 (May 2020).

[DEC] J. Schmidhuber (02/20/2020). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s.

.