Highway Networks (May 2015):
First Working Really Deep Feedforward
Neural Networks With Over 100 Layers
In 20092010, our team triggered the
supervised deep learning revolution [MLP12].
Back then, both our deep feedforward neural networks (FNNs) and our earlier very deep recurrent NNs (RNNs, e.g., CTCLSTM for connected handwriting recognition [LSTM5]) were able to beat all competing algorithms on important problems of that time.
However, in 2010, our deepest FNNs were still limited. They had at most 10 layers of neurons or so.
In subsequent years, FNNs achieved at most a few tens of layers, e.g., 2030 layers.
On the other hand, our earlier work since 1991 on
RNNs with unsupervised pretraining [UN12] and on
supervised LSTM RNNs [LSTM1]
suggested that much greater depth (up to 1000 and more) should be possible. And since depth is essential for
deep learning,
we wanted to transfer the principles of our deep RNNs to deep FNNs.
In May 2015 we achieved this goal.
Our Highway Networks [HW1] [HW1a] were the first working really deep
feedforward neural networks with hundreds of layers. This was made possible
through the work of my PhD students Rupesh Kumar Srivastava and Klaus Greff.
Highway Nets are essentially feedforward versions of recurrent Long ShortTerm Memory (LSTM) networks [LSTM1] with forget gates (or "gated recurrent units") [LSTM2].
Let g, t, h denote nonlinear differentiable functions. Each noninput layer of a Highway Net computes
g(x)x + t(x)h(x),
where x is the data from the previous layer. (Like in LSTM RNNs [LSTM1] with forget gates [LSTM2].)
This is the basic ingredient required to overcome the fundamental deep learning problem of vanishing or exploding gradients, which my very first student Sepp Hochreiter identified and analyzed in 1991, years before anybody else did [VAN1].
If we open the gates by setting g(x)=t(x)=1 and keep them open,
we obtain the socalled
Residual Net or ResNet [HW2] (December 2015),
a special case of our Highway Net [HW1].
It is
essentially a feedforward variant of the original
[LSTM1] without gates,
or with gates initialised in a standard way, namely, fully open.
That is, the basic LSTM principle is not only central to deep RNNs but also to deep FNNs.
Microsoft Research won the ImageNet 2015 contest with a very deep ResNet of 150 layers [HW2] [IM15].
Highway Nets showed how very deep NNs with skip connections work.
This is now also relevant for Transformers, e.g., [TR1] [TR2].
Contrary to certain claims (e.g., [HW2]),
the earlier Highway Nets perform roughly as well as ResNets on ImageNet [HW3].
Highway layers are also often used for natural language processing, where the simpler residual layers do
not work as well [HW3].
Compare [MIR] [DEC] [T20].
In the 2010s,
LSTM concepts kept invading CNN territory, e.g., [7af],
also through GPUfriendly multidimensional LSTMs [LSTM16].
References
[HW1] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks.
Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The first working very deep feedforward nets with over 100 layers. Let g, t, h, denote nonlinear differentiable functions. Each noninput layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates [LSTM2] for RNNs.) Resnets [HW2] are a special case of this where the gates are always open: g(x)=t(x)=const=1.
Highway Nets perform roughly as well as ResNets [HW2] on ImageNet [HW3]. Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well [HW3].
More.
[HW1a]
R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Presentation at the Deep Learning Workshop, ICML'15, July 1011, 2015.
Link.
[HW2] He, K., Zhang,
X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint
arXiv:1512.03385
(Dec 2015). Residual nets are a special case of Highway Nets [HW1]
where the gates are open:
g(x)=1 (a typical highway net initialization) and t(x)=1.
More.
[HW3]
K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint
arxiv:1612.07771 (2016). Also at ICLR 2017.
[IM15]
ImageNet Large Scale Visual Recognition Challenge 2015 (ILSVRC2015):
Results
[LSTM1] S. Hochreiter, J. Schmidhuber. Long ShortTerm Memory. Neural Computation, 9(8):17351780, 1997. PDF.
Based on [LSTM0]. More.
[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):24512471, 2000.
PDF.
[The "vanilla LSTM architecture" that everybody is using today, e.g., in Google's Tensorflow.]
[LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009.
PDF.
[LSTM16]
M. Stollenga, W. Byeon, M. Liwicki, J. Schmidhuber. Parallel MultiDimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation. Advances in Neural Information Processing Systems (NIPS), 2015.
Preprint: arxiv:1506.07452.
[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF.
[More on the Fundamental Deep Learning Problem.]
[UN1] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234242, 1992. Based on TR FKI14891, TUM, 1991 [UN0]. PDF.
[First working Deep Learner based on a deep RNN hierarchy (with different selforganising time scales),
overcoming the vanishing gradient problem through unsupervised pretraining and predictive coding.
Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills—such approaches are now widely used. More.]
[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
[An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised pretraining for a stack of recurrent NN
can be found here (depth > 1000).]
[MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 32073220, 2010. ArXiv Preprint (1 March 2010).
[Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pretraining.]
[TR1]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 59986008.
[TR2]
J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pretraining of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805.
[MLP2] J. Schmidhuber
(Sep 2020). 10year anniversary of supervised deep learning breakthrough (2010). No unsupervised pretraining. The rest is history
[T20] J. Schmidhuber (June 2020). Critique of 2018 Turing Award.
[MIR] J. Schmidhuber (10/4/2019). Deep Learning: Our Miraculous Year 19901991. See also arxiv:2005.05744 (May 2020).
[DEC] J. Schmidhuber (02/20/2020). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s.
[7a] 2011: First superhuman CNNs
[7b] 2011: First humancompetitive CNNs for handwriting
[7c] 2012: First CNN to win segmentation contest
[7d] 2012: First CNN to win contest on object discovery in large images
[7e] Deep Learning.
Scholarpedia, 10(11):32832, 2015
[7f] History of computer vision contests won by deep CNNs on GPUs (2017)
Can you spot the Fibonacci pattern in the graphics?
.
