Highway Networks, the first working really deep feedforward neural networks with over 100 layers

Jürgen Schmidhuber (2015, updated 2022)
Pronounce: You_again Shmidhoobuh

Highway Networks (May 2015):
First Working Really Deep Feedforward
Neural Networks With Over 100 Layers

In 2009-2010, our team triggered the supervised deep learning revolution [MLP1-2]. Back then, both our deep feedforward neural networks (FNNs) and our earlier very deep recurrent NNs (RNNs, e.g., CTC-LSTM for connected handwriting recognition [LSTM5]) were able to beat all competing algorithms on important problems of that time.

However, in 2010, our deepest FNNs were still limited. They had at most 10 layers of neurons or so. In subsequent years, FNNs achieved at most a few tens of layers, e.g., 20-30 layers. On the other hand, our earlier work since 1991 on RNNs with unsupervised pre-training [UN1-2] and on supervised LSTM RNNs [LSTM1] suggested that much greater depth (up to 1000 and more) should be possible. And since depth is essential for deep learning, we wanted to transfer the principles of our deep RNNs to deep FNNs.

In May 2015 we achieved this goal. Our Highway Networks [HW1][HW1a] were the first working really deep feedforward neural networks with hundreds of layers. This was made possible through the work of my PhD students Rupesh Kumar Srivastava and Klaus Greff. Highway Nets are essentially feedforward versions of recurrent Long Short-Term Memory (LSTM) networks [LSTM1] with forget gates (or "gated recurrent units") [LSTM2].

Let g, t, h denote non-linear differentiable functions. Each non-input layer of a Highway Net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like in LSTM RNNs [LSTM1] with forget gates [LSTM2].)

This is the basic ingredient required to overcome the fundamental deep learning problem of vanishing or exploding gradients, which my very first student Sepp Hochreiter identified and analyzed in 1991, years before anybody else did [VAN1].

If we open the gates by setting g(x)=t(x)=1 and keep them open, we obtain the so-called Residual Net or ResNet [HW2] (December 2015), a version of our Highway Net [HW1]. It is essentially a feedforward variant of the original [LSTM1] without gates, or with gates initialised in a standard way, namely, fully open. That is, the basic LSTM principle is not only central to deep RNNs but also to deep FNNs. Microsoft Research won the ImageNet 2015 contest with a very deep ResNet of 150 layers [HW2][IM15].

Highway Nets showed how very deep NNs with skip connections work. This is now also relevant for Transformers, e.g., [TR1][TR2][FWP0-1,6].

Contrary to certain claims (e.g., [HW2]), the earlier Highway Nets perform roughly as well as ResNets on ImageNet [HW3]. Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well [HW3]. Compare [MIR][DEC][T22].

In the 2010s, LSTM concepts kept invading CNN territory, e.g., [7a-f], also through GPU-friendly multi-dimensional LSTMs [LSTM16].

Deep learning is all about NN depth [DL1]. LSTMs brought essentially unlimited depth to supervised recurrent NNs; Highway Nets brought it to feedforward NNs [MOST].

References

[HW1] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The first working very deep feedforward nets with over 100 layers. Let g, t, h, denote non-linear differentiable functions. Each non-input layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates [LSTM2] for RNNs.) Resnets [HW2] are a special case of this where the gates are always open: g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets [HW2] on ImageNet [HW3]. Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well [HW3]. More.

[HW1a] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Presentation at the Deep Learning Workshop, ICML'15, July 10-11, 2015. Link.

[HW2] He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint arXiv:1512.03385 (Dec 2015). Residual nets are a special case of Highway Nets [HW1] where the gates are open: g(x)=1 (a typical highway net initialization) and t(x)=1. More.

[HW3] K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint arxiv:1612.07771 (2016). Also at ICLR 2017.

[DL1] J. Schmidhuber, 2015. Deep learning in neural networks: An overview. Neural Networks, 61, 85-117. More. Got the first Best Paper Award ever issued by the journal Neural Networks, founded in 1988.

[IM15] ImageNet Large Scale Visual Recognition Challenge 2015 (ILSVRC2015): Results

[LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More.

[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000. PDF. [The "vanilla LSTM architecture" that everybody is using today, e.g., in Google's Tensorflow.]

[LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009. PDF.

[LSTM16] M. Stollenga, W. Byeon, M. Liwicki, J. Schmidhuber. Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation. Advances in Neural Information Processing Systems (NIPS), 2015. Preprint: arxiv:1506.07452.

[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention (1990) and soft attention (1991-93).^[FWP] Today, both types are very popular.

[FWP] J. Schmidhuber (AI Blog, 26 March 2021). 26 March 1991: Neural nets learn to program neural nets with fast weights—like Transformer variants. 2021: New stuff! 30-year anniversary of a now popular alternative^[FWP0-1] to recurrent NNs. A slow feedforward NN learns by gradient descent to program the changes of the fast weights of another NN. Such Fast Weight Programmers^[FWP0-7] can learn to memorize past data, e.g., by computing fast weight changes through additive outer products of self-invented activation patterns^[FWP0-1] (now often called keys and values for self-attention^[TR1-2]). The similar Transformers^[TR1-2] combine this with projections and softmax and are now widely used in natural language processing. For long input sequences, their efficiency was improved through Transformers with linearized self-attention^[TR5-6] which are formally equivalent to the 1991 Fast Weight Programmers (apart from normalization). In 1993, I introduced the attention terminology^[FWP2] now used in this context,^[ATT] and extended the approach to RNNs that program themselves.

[FWP0] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Technical Report FKI-147-91, Institut für Informatik, Technische Universität München, 26 March 1991. PDF. First paper on fast weight programmers: a slow net learns by gradient descent to compute weight changes of a fast net.

[FWP1] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131-139, 1992. PDF. HTML. Pictures.

[FWP2] J. Schmidhuber. Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460-463. Springer, 1993. PDF. First recurrent fast weight programmer based on outer products. Introduced the terminology of learning "internal spotlights of attention."

[FWP3] I. Schlag, J. Schmidhuber. Gated Fast Weights for On-The-Fly Neural Program Generation. Workshop on Meta-Learning, @N(eur)IPS 2017, Long Beach, CA, USA.

[FWP3a] I. Schlag, J. Schmidhuber. Learning to Reason with Third Order Tensor Products. Advances in Neural Information Processing Systems (N(eur)IPS), Montreal, 2018. Preprint: arXiv:1811.12143. PDF.

[FWP6] I. Schlag, K. Irie, J. Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174.

[FWP7] K. Irie, I. Schlag, R. Csordas, J. Schmidhuber. Going Beyond Linear Transformers with Recurrent Fast Weight Programmers. Advances in Neural Information Processing Systems (NeurIPS), 2021. Preprint: arXiv:2106.06295 . See also the Blog Post.

[UN] J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised hierarchical predictive coding finds compact internal representations of sequential data to facilitate downstream learning. The hierarchy can be distilled into a single deep neural network (suggesting a simple model of conscious and subconscious information processing). 1993: solving problems of depth >1000.

[UN0] J. Schmidhuber. Neural sequence chunkers. Technical Report FKI-148-91, Institut für Informatik, Technische Universität München, April 1991. PDF.

[UN1] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992. Based on TR FKI-148-91, TUM, 1991.^[UN0] PDF. First working Deep Learner based on a deep RNN hierarchy (with different self-organising time scales), overcoming the vanishing gradient problem through unsupervised pre-training and predictive coding. Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills—such approaches are now widely used. More.

[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised pre-training for a stack of recurrent NN can be found here (depth > 1000).

[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. [More on the Fundamental Deep Learning Problem.]

[MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 3207-3220, 2010. ArXiv Preprint. Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pre-training.

[MLP2] J. Schmidhuber (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. By 2010, when compute was 100 times more expensive than today, both our feedforward NNs^[MLP1] and our earlier recurrent NNs were able to beat all competing algorithms on important problems of that time. This deep learning revolution quickly spread from Europe to North America and Asia. The rest is history.

[TR1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 5998-6008.

[TR2] J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805.

[TR5] A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret. Transformers are RNNs: Fast autoregressive Transformers with linear attention. In Proc. Int. Conf. on Machine Learning (ICML), July 2020.

[TR6] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al. Rethinking attention with Performers. In Int. Conf. on Learning Representations (ICLR), 2021.

[T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21 (v3), IDSIA, Lugano, Switzerland, 22 June 2022.

[MIR] J. Schmidhuber (AI Blog, Oct 2019, revised 2021). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020. The deep learning neural networks of our team have revolutionised pattern recognition and machine learning, and are now heavily used in academia and industry. In 2020-21, we celebrate that many of the basic ideas behind this revolution were published within fewer than 12 months in our "Annus Mirabilis" 1990-1991 at TU Munich.

[DEC] J. Schmidhuber (AI Blog, 02/20/2020; revised 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The recent decade's most important developments and industrial applications based on our AI, with an outlook on the 2020s, also addressing privacy and data markets.

[MOST] J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), (4) Generative Adversarial Networks (an instance of my earlier Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to my earlier Fast Weight Programmers). Most of this started with our Annus Mirabilis of 1990-1991.^[MIR]

[7a] 2011: First superhuman CNNs
[7b] 2011: First human-competitive CNNs for handwriting
[7c] 2012: First CNN to win segmentation contest
[7d] 2012: First CNN to win contest on object discovery in large images
[7e] Deep Learning. Scholarpedia, 10(11):32832, 2015
[7f] History of computer vision contests won by deep CNNs on GPUs (2017)

Can you spot the Fibonacci pattern in the graphics?
.

Highway Networks (May 2015): First Working Really Deep Feedforward Neural Networks With Over 100 Layers

References

Highway Networks (May 2015):
First Working Really Deep Feedforward
Neural Networks With Over 100 Layers