Highway Networks (May 2015):
First Working Really Deep Feedforward
Neural Networks With Over 100 Layers
In 20092010, our team triggered the
supervised deep learning revolution [MLP12].
Back then, both our deep feedforward neural networks (FNNs) and our earlier very deep recurrent NNs (RNNs, e.g., CTCLSTM for connected handwriting recognition [LSTM5]) were able to beat all competing algorithms on important problems of that time.
However, in 2010, our deepest FNNs were still limited. They had at most 10 layers of neurons or so.
In subsequent years, FNNs achieved at most a few tens of layers, e.g., 2030 layers.
On the other hand, our earlier work since 1991 on
RNNs with unsupervised pretraining [UN12] and on
supervised LSTM RNNs [LSTM1]
suggested that much greater depth (up to 1000 and more) should be possible. And since depth is essential for
deep learning,
we wanted to transfer the principles of our deep RNNs to deep FNNs.
In May 2015 we achieved this goal.
Our Highway Networks [HW1][HW1a] were the first working really deep
feedforward neural networks with hundreds of layers. This was made possible
through the work of my PhD students Rupesh Kumar Srivastava and Klaus Greff.
Highway Nets are essentially feedforward versions of recurrent Long ShortTerm Memory (LSTM) networks [LSTM1] with forget gates (or "gated recurrent units") [LSTM2].
Let g, t, h denote nonlinear differentiable functions. Each noninput layer of a Highway Net computes
g(x)x + t(x)h(x),
where x is the data from the previous layer. (Like in LSTM RNNs [LSTM1] with forget gates [LSTM2].)
This is the basic ingredient required to overcome the fundamental deep learning problem of vanishing or exploding gradients, which my very first student Sepp Hochreiter identified and analyzed in 1991, years before anybody else did [VAN1].
If we open the gates by setting g(x)=t(x)=1 and keep them open,
we obtain the socalled
Residual Net or ResNet [HW2] (December 2015),
a version of our Highway Net [HW1].
It is
essentially a feedforward variant of the original
[LSTM1] without gates,
or with gates initialised in a standard way, namely, fully open.
That is, the basic LSTM principle is not only central to deep RNNs but also to deep FNNs.
Microsoft Research won the ImageNet 2015 contest with a very deep ResNet of 150 layers [HW2][IM15].
Highway Nets showed how very deep NNs with skip connections work.
This is now also relevant for
Transformers, e.g., [TR1][TR2][FWP01,6].
Contrary to certain claims (e.g., [HW2]),
the earlier Highway Nets perform roughly as well as ResNets on ImageNet [HW3].
Highway layers are also often used for natural language processing, where the simpler residual layers do
not work as well [HW3].
Compare [MIR][DEC][T22].
In the 2010s,
LSTM concepts kept invading CNN territory, e.g., [7af],
also through GPUfriendly multidimensional LSTMs [LSTM16].
Deep learning
is all about NN depth [DL1].
LSTMs
brought essentially unlimited depth to supervised recurrent NNs; Highway Nets brought it to feedforward NNs [MOST].
References
[HW1] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks.
Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The first working very deep feedforward nets with over 100 layers. Let g, t, h, denote nonlinear differentiable functions. Each noninput layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates [LSTM2] for RNNs.) Resnets [HW2] are a special case of this where the gates are always open: g(x)=t(x)=const=1.
Highway Nets perform roughly as well as ResNets [HW2] on ImageNet [HW3]. Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well [HW3].
More.
[HW1a]
R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Presentation at the Deep Learning Workshop, ICML'15, July 1011, 2015.
Link.
[HW2] He, K., Zhang,
X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint
arXiv:1512.03385
(Dec 2015). Residual nets are a special case of Highway Nets [HW1]
where the gates are open:
g(x)=1 (a typical highway net initialization) and t(x)=1.
More.
[HW3]
K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint
arxiv:1612.07771 (2016). Also at ICLR 2017.
[DL1] J. Schmidhuber, 2015.
Deep learning in neural networks: An overview. Neural Networks, 61, 85117.
More.
Got the first Best Paper Award ever issued by the journal Neural Networks, founded in 1988.
[IM15]
ImageNet Large Scale Visual Recognition Challenge 2015 (ILSVRC2015):
Results
[LSTM1] S. Hochreiter, J. Schmidhuber. Long ShortTerm Memory. Neural Computation, 9(8):17351780, 1997. PDF.
Based on [LSTM0]. More.
[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):24512471, 2000.
PDF.
[The "vanilla LSTM architecture" that everybody is using today, e.g., in Google's Tensorflow.]
[LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009.
PDF.
[LSTM16]
M. Stollenga, W. Byeon, M. Liwicki, J. Schmidhuber. Parallel MultiDimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation. Advances in Neural Information Processing Systems (NIPS), 2015.
Preprint: arxiv:1506.07452.
[ATT] J. Schmidhuber (AI Blog, 2020). 30year anniversary of endtoend differentiable sequential neural attention. Plus goalconditional reinforcement learning. We had both hard attention (1990) and soft attention (199193).^{[FWP]} Today, both types are very popular.
[FWP]
J. Schmidhuber (AI Blog, 26 March 2021).
26 March 1991: Neural nets learn to program neural nets with fast weights—like Transformer variants. 2021: New stuff!
30year anniversary of a now popular
alternative^{[FWP01]} to recurrent NNs.
A slow feedforward NN learns by gradient descent to program the changes of
the fast weights of
another NN.
Such Fast Weight Programmers^{[FWP07]} can learn to memorize past data, e.g.,
by computing fast weight changes through additive outer products of selfinvented activation patterns^{[FWP01]}
(now often called keys and values for selfattention^{[TR12]}).
The similar Transformers^{[TR12]} combine this with projections
and softmax and
are now widely used in natural language processing.
For long input sequences, their efficiency was improved through
Transformers with linearized selfattention^{[TR56]}
which are formally equivalent to the 1991 Fast Weight Programmers (apart from normalization).
In 1993, I introduced
the attention terminology^{[FWP2]} now used
in this context,^{[ATT]} and
extended the approach to
RNNs that program themselves.
[FWP0]
J. Schmidhuber.
Learning to control fastweight memories: An alternative to recurrent nets.
Technical Report FKI14791, Institut für Informatik, Technische
Universität München, 26 March 1991.
PDF.
First paper on fast weight programmers: a slow net learns by gradient descent to compute weight changes of a fast net.
[FWP1] J. Schmidhuber. Learning to control fastweight memories: An alternative to recurrent nets. Neural Computation, 4(1):131139, 1992.
PDF.
HTML.
Pictures.
[FWP2] J. Schmidhuber. Reducing the ratio between learning complexity and number of timevarying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460463. Springer, 1993.
PDF.
First recurrent fast weight programmer based on outer products. Introduced the terminology of learning "internal spotlights of attention."
[FWP3] I. Schlag, J. Schmidhuber. Gated Fast Weights for OnTheFly Neural Program Generation. Workshop on MetaLearning, @N(eur)IPS 2017, Long Beach, CA, USA.
[FWP3a] I. Schlag, J. Schmidhuber. Learning to Reason with Third Order Tensor Products. Advances in Neural Information Processing Systems (N(eur)IPS), Montreal, 2018.
Preprint: arXiv:1811.12143. PDF.
[FWP6] I. Schlag, K. Irie, J. Schmidhuber.
Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174.
[FWP7] K. Irie, I. Schlag, R. Csordas, J. Schmidhuber.
Going Beyond Linear Transformers with Recurrent Fast Weight Programmers.
Advances in Neural Information Processing Systems (NeurIPS), 2021.
Preprint: arXiv:2106.06295 . See also the
Blog Post.
[UN]
J. Schmidhuber (AI Blog, 2021). 30year anniversary. 1991: First very deep learning with unsupervised pretraining. Unsupervised hierarchical predictive coding finds compact internal representations of sequential data to facilitate downstream learning. The hierarchy can be distilled into a single deep neural network (suggesting a simple model of conscious and subconscious information processing). 1993: solving problems of depth >1000.
[UN0]
J. Schmidhuber.
Neural sequence chunkers.
Technical Report FKI14891, Institut für Informatik, Technische
Universität München, April 1991.
PDF.
[UN1] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234242, 1992. Based on TR FKI14891, TUM, 1991.^{[UN0]} PDF.
First working Deep Learner based on a deep RNN hierarchy (with different selforganising time scales),
overcoming the vanishing gradient problem through unsupervised pretraining and predictive coding.
Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills—such approaches are now widely used. More.
[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised pretraining for a stack of recurrent NN can be found here (depth > 1000).
[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF.
[More on the Fundamental Deep Learning Problem.]
[MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 32073220, 2010. ArXiv Preprint.
Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pretraining.
[MLP2] J. Schmidhuber
(AI Blog, Sep 2020). 10year anniversary of supervised deep learning breakthrough (2010). No unsupervised pretraining.
By 2010, when compute was 100 times more expensive than today, both our feedforward NNs^{[MLP1]} and our earlier recurrent NNs were able to beat all competing algorithms on important problems of that time. This deep learning revolution quickly spread from Europe to North America and Asia. The rest is history.
[TR1]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 59986008.
[TR2]
J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pretraining of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805.
[TR5]
A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret.
Transformers are RNNs: Fast autoregressive Transformers
with linear attention. In Proc. Int. Conf. on Machine
Learning (ICML), July 2020.
[TR6]
K. Choromanski, V. Likhosherstov, D. Dohan, X. Song,
A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin,
L. Kaiser, et al. Rethinking attention with Performers.
In Int. Conf. on Learning Representations (ICLR), 2021.
[T22] J. Schmidhuber (AI Blog, 2022).
Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA7721 (v3), IDSIA, Lugano, Switzerland, 22 June 2022.
[MIR] J. Schmidhuber (AI Blog, Oct 2019, revised 2021). Deep Learning: Our Miraculous Year 19901991. Preprint
arXiv:2005.05744, 2020. The deep learning neural networks of our team have revolutionised pattern recognition and machine learning, and are now heavily used in academia and industry. In 202021, we celebrate that many of the basic ideas behind this revolution were published within fewer than 12 months in our "Annus Mirabilis" 19901991 at TU Munich.
[DEC] J. Schmidhuber (AI Blog, 02/20/2020; revised 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The recent decade's most important developments and industrial applications based on our AI, with an outlook on the 2020s, also addressing privacy and data markets.
[MOST]
J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long ShortTerm Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on our similar earlier DanNet: the first deep convolutional NN to win
image recognition competitions),
(4) Generative Adversarial Networks (an instance of my earlier
Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized selfattention are formally equivalent to my earlier Fast Weight Programmers).
Most of this started with our
Annus Mirabilis of 19901991.^{[MIR]}
[7a] 2011: First superhuman CNNs
[7b] 2011: First humancompetitive CNNs for handwriting
[7c] 2012: First CNN to win segmentation contest
[7d] 2012: First CNN to win contest on object discovery in large images
[7e] Deep Learning.
Scholarpedia, 10(11):32832, 2015
[7f] History of computer vision contests won by deep CNNs on GPUs (2017)
Can you spot the Fibonacci pattern in the graphics?
.
