Highway Networks: First Working Feedforward Networks With Over 100 Layers
Jürgen Schmidhuber
Our Highway Networks of May 2015 [4] were the first working very deep feedforward neural networks with hundreds of layers. This was made possible
through the work of my PhD students Rupesh Kumar Srivastava and Klaus Greff.
Highway Nets are essentially feedforward versions of recurrent Long ShortTerm Memory (LSTM) networks [3] with forget gates (or "gated recurrent units") [5].
Let g, t, h denote nonlinear differentiable functions. Each noninput layer of a Highway Net computes
g(x)x + t(x)h(x),
where x is the data from the previous layer. (Like in STM [3] with forget gates [5] for recurrent networks.)
This is the basic ingredient required to overcome the fundamental deep learning problem of vanishing or exploding gradients, which my very first student Sepp Hochreiter identified and analyzed in 1991, years before anybody else did [6].
Microsoft's Residual Net or ResNet [1] of December 2015 is a special case of our Highway Nets [4].
The CNN layers of ResNets [1] act like those of Highway Nets with g(x)=1 (a typical Highway Net initialisation) and t(x)=1,
essentially like a feedforward
LSTM [3] without gates.
Microsoft Research dominated the ImageNet 2015 contest with a very deep ResNet of 150 layers [1][2].
Contrary to certain claims (e.g., [1]),
the earlier Highway Nets perform roughly as well as ResNets on ImageNet [9].
Highway layers are also often used for natural language processing, where the simpler residual layers do
not work as well [9].
LSTM concepts keep invading CNN territory, e.g., [7af], also through GPUfriendly multidimensional LSTMs [8].
References
[1] K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. TR
arxiv:1512.03385, Dec 2015.
[2]
ImageNet Large Scale Visual Recognition Challenge 2015 (ILSVRC2015):
Results
[3] S. Hochreiter, J. Schmidhuber. Long ShortTerm Memory. Neural Computation, 9(8):17351780, 1997. Based on TR FKI20795, TUM (1995). PDF. Led to a lot of followup work, and is now heavily
used by leading IT companies all over the world.
[4] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. TR
arxiv:1505.00387 (May 2015)
and
arXiv:1507.06228 (July 2015).
Also at NIPS'2015.
[5] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):24512471, 2000.
PDF.
[6] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TU Munich, 1991. Advisor: J. Schmidhuber. Overview.
[7a] 2011: First superhuman CNNs
[7b] 2011: First humancompetitive CNNs for handwriting
[7c] 2012: First CNN to win segmentation contest
[7d] 2012: First CNN to win contest on object discovery in large images
[7e] Deep Learning.
Scholarpedia, 10(11):32832, 2015
[7f] History of computer vision contests won by deep CNNs on GPUs (2017)
[8] M. Stollenga, W. Byeon, M. Liwicki, J. Schmidhuber. Parallel MultiDimensional LSTM, with Application to Fast Biomedical Volumetric Image Segmentation. NIPS 2015;
arxiv:1506.07452.
[9]
K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint
arxiv:1612.07771 (2016). Also at ICLR 2017.
Can you spot the Fibonacci pattern in the graphics above?
.
