Jürgen Schmidhuber (2015, updated Nov 2020)
Pronounce: You_again Shmidhoobuh

Highway Networks (May 2015):
First Working Really Deep Feedforward
Neural Networks With Over 100 Layers

In 2009-2010, our team triggered the supervised deep learning revolution [MLP1-2]. Back then, both our deep feedforward neural networks (FNNs) and our earlier very deep recurrent NNs (RNNs, e.g., CTC-LSTM for connected handwriting recognition [LSTM5]) were able to beat all competing algorithms on important problems of that time.

However, in 2010, our deepest FNNs were still limited. They had at most 10 layers of neurons or so. In subsequent years, FNNs achieved at most a few tens of layers, e.g., 20-30 layers. On the other hand, our earlier work since 1991 on RNNs with unsupervised pre-training [UN1-2] and on supervised LSTM RNNs [LSTM1] suggested that much greater depth (up to 1000 and more) should be possible. And since depth is essential for deep learning, we wanted to transfer the principles of our deep RNNs to deep FNNs.

In May 2015 we achieved this goal. Our Highway Networks [HW1] [HW1a] were the first working really deep feedforward neural networks with hundreds of layers. This was made possible through the work of my PhD students Rupesh Kumar Srivastava and Klaus Greff. Highway Nets are essentially feedforward versions of recurrent Long Short-Term Memory (LSTM) networks [LSTM1] with forget gates (or "gated recurrent units") [LSTM2].

Let g, t, h denote non-linear differentiable functions. Each non-input layer of a Highway Net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like in LSTM RNNs [LSTM1] with forget gates [LSTM2].)

This is the basic ingredient required to overcome the fundamental deep learning problem of vanishing or exploding gradients, which my very first student Sepp Hochreiter identified and analyzed in 1991, years before anybody else did [VAN1].

If we open the gates by setting g(x)=t(x)=1 and keep them open, we obtain the so-called Residual Net or ResNet [HW2] (December 2015), a special case of our Highway Net [HW1]. It is essentially a feedforward variant of the original [LSTM1] without gates, or with gates initialised in a standard way, namely, fully open. That is, the basic LSTM principle is not only central to deep RNNs but also to deep FNNs. Microsoft Research won the ImageNet 2015 contest with a very deep ResNet of 150 layers [HW2] [IM15].

Highway Nets showed how very deep NNs with skip connections work. This is now also relevant for Transformers, e.g., [TR1] [TR2].

Contrary to certain claims (e.g., [HW2]), the earlier Highway Nets perform roughly as well as ResNets on ImageNet [HW3]. Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well [HW3]. Compare [MIR] [DEC] [T20].

In the 2010s, LSTM concepts kept invading CNN territory, e.g., [7a-f], also through GPU-friendly multi-dimensional LSTMs [LSTM16].


[HW1] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The first working very deep feedforward nets with over 100 layers. Let g, t, h, denote non-linear differentiable functions. Each non-input layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates [LSTM2] for RNNs.) Resnets [HW2] are a special case of this where the gates are always open: g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets [HW2] on ImageNet [HW3]. Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well [HW3]. More.

[HW1a] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Presentation at the Deep Learning Workshop, ICML'15, July 10-11, 2015. Link.

[HW2] He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint arXiv:1512.03385 (Dec 2015). Residual nets are a special case of Highway Nets [HW1] where the gates are open: g(x)=1 (a typical highway net initialization) and t(x)=1. More.

[HW3] K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint arxiv:1612.07771 (2016). Also at ICLR 2017.

[IM15] ImageNet Large Scale Visual Recognition Challenge 2015 (ILSVRC2015): Results

LSTM [LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More.

[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000. PDF. [The "vanilla LSTM architecture" that everybody is using today, e.g., in Google's Tensorflow.]

[LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009. PDF.

[LSTM16] M. Stollenga, W. Byeon, M. Liwicki, J. Schmidhuber. Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation. Advances in Neural Information Processing Systems (NIPS), 2015. Preprint: arxiv:1506.07452.

[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. [More on the Fundamental Deep Learning Problem.]

[UN1] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992. Based on TR FKI-148-91, TUM, 1991 [UN0]. PDF. [First working Deep Learner based on a deep RNN hierarchy (with different self-organising time scales), overcoming the vanishing gradient problem through unsupervised pre-training and predictive coding. Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills—such approaches are now widely used. More.]

[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. [An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised pre-training for a stack of recurrent NN can be found here (depth > 1000).]

[MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 3207-3220, 2010. ArXiv Preprint (1 March 2010). [Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pre-training.]

[TR1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 5998-6008.

[TR2] J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805.

[MLP2] J. Schmidhuber (Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. The rest is history

[T20] J. Schmidhuber (June 2020). Critique of 2018 Turing Award.

[MIR] J. Schmidhuber (10/4/2019). Deep Learning: Our Miraculous Year 1990-1991. See also arxiv:2005.05744 (May 2020).

[DEC] J. Schmidhuber (02/20/2020). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s.

[7a] 2011: First superhuman CNNs
[7b] 2011: First human-competitive CNNs for handwriting
[7c] 2012: First CNN to win segmentation contest
[7d] 2012: First CNN to win contest on object discovery in large images
[7e] Deep Learning. Scholarpedia, 10(11):32832, 2015
[7f] History of computer vision contests won by deep CNNs on GPUs (2017)

Can you spot the Fibonacci pattern in the graphics?