Microsoft wins ImageNet 2015 through feedforward LSTM without gates

Microsoft Wins ImageNet 2015 through
Highway Net (or Feedforward LSTM) without Gates

Jürgen Schmidhuber

Note: this report was updated in 2025 to mark its 10th anniversary. See also: Who Invented Deep Residual Learning? Technical Report IDSIA-09-25, IDSIA, 2025 [4b].

Microsoft Research dominated the ImageNet 2015 contest with a very deep neural network of 150 layers [1]. Congrats to Kaiming He & Xiangyu Zhang & Shaoqing Ren & Jian Sun on the great results [2]!

Their Residual Net or ResNet of December 2015 [1] is an open-gated variant of our Highway Net of May 2015 [4], the first very deep feedforward neural network (FNN) with hundreds of layers, over 10 times deeper than previous FNNs. Highway nets are essentially unfolded feedforward versions [4b][4d] of our recurrent Long Short-Term Memory networks [3] with forget gates (or gated recurrent units) [5]. The earlier Highway Net [4] (a gated ResNet) performs roughly as well as a plain ResNet on ImageNet [4a].

Let g, t, h, denote non-linear differentiable functions of real values. Each non-input layer of a Highway NN computes g(x)x + t(x)h(x), where x is the data from the previous layer (like in an unfolded LSTM [3] with forget gates [5] for recurrent networks). The crucial residual part is the g(x)x part: the Highway gates g(x) are typically initialised to 1.0 (like the forget gates of the 2000 LSTM [5]), to obtain plain residual connections with weight 1.0 (like in ResNets) that allow for very deep error propagation like in the "constant error carrousels" of the 1997 LSTM [3][4b].

The residual connections in the 1997 LSTM [3] or the initialized 2000 LSTM [5] or the initialized Highway Net (May 2015) [4] or the ResNet (Dec 2015) [1] are the basic ingredient required to overcome the fundamental deep learning problem of vanishing or exploding gradients. The ResNet authors mention it [1], but do not mention my very first student Sepp Hochreiter (now professor) who identified and analyzed and solved it in 1991, years before anybody else did [6].

Here is the timeline of the evolution of deep residual learning, taken from a separate report [4b]:

1991: Hochreiter's recurrent residual connections solve the vanishing gradient problem
1997 LSTM: plain recurrent residual connections (weight 1.0)
1999 LSTM: gated recurrent residual connections (gates initially open: 1.0)
2005: unfolding LSTM—from recurrent to feedforward residual NNs
May 2015: deep Highway Net—gated feedforward residual connections (initially 1.0)
Dec 2015: ResNet—like an open-gated Highway Net (or an unfolded 1997 LSTM)

The ResNet paper [1] calls the Highway Net [4] "concurrent," but it wasn't: the ResNet was published 7 months later. It cites the earlier Highway Net in a way that does not make clear that ResNets are essentially open-gated Highway Nets, and that Highway Nets are gated ResNets, and that the gates of residual connections in Highway Nets are initially open anyway, such that Highway Nets start out with standard residual connections like ResNets [4b]. A follow-up paper by the ResNet authors suffered from design flaws leading to incorrect conclusions about gated residual connections [4c].

Apart from the quibbles above, I liked the paper [1]. LSTM concepts keep invading CNN territory [e.g., 7a-e], also through GPU-friendly multi-dimensional LSTMs [8].


References

[1] He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint arXiv:1512.03385 (Dec 2015). Microsoft's ResNet paper refers to the Highway Net (May 2015) [4] as 'concurrent'. However, this is incorrect: ResNet was published seven months later. Although the ResNet paper acknowledges the problem of vanishing/exploding gradients, it fails to recognise that S. Hochreiter first identified the issue in 1991 and developed the residual connection solution (weight 1.0) [6][4b]. The ResNet paper cites the earlier Highway Net in a way that does not make it clear that ResNets are essentially open-gated Highway Nets and that Highway Nets are gated ResNets. It also fails to mention that the gates of residual connections in Highway Nets are initially open (1.0), meaning that Highway Nets start out with standard residual connections, to achieve deep residual learning (Highway Nets were ten times deeper than previous gradient-based feedforward nets). The residual parts of a Highway Net are like those of an unfolded 2000 LSTM [5], while the residual parts of a ResNet are like those of an unfolded 1997 LSTM [3][4b]. A follow-up paper by the ResNet authors was flawed in its design, leading to incorrect conclusions about gated residual connections [4c]. See also [4b]: who invented deep residual learning?

[2] ImageNet Large Scale Visual Recognition Challenge 2015 (ILSVRC2015): Results

[3] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. Based on TR FKI-207-95, TUM (1995). Led to a lot of follow-up work, and is now heavily used by leading IT companies all over the world.

[4] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (Training Very Deep Networks; July 2015). Also at NeurIPS 2015. The first working very deep gradient-based feedforward neural nets (FNNs) with hundreds of layers, ten times deeper than previous gradient-based FNNs. Let g, t, h, denote non-linear differentiable functions. Each non-input layer of a Highway Net computes g(x)x + t(x)h(x), where x is the data from the previous layer. The gates g(x) are typically initialised to 1.0, to obtain plain residual connections (weight 1.0) [6][4b]. This allows for very deep error propagation, which makes Highway NNs so deep. The later Resnet (Dec 2015) [1] adopted this principle. It is like a Highway net variant whose gates are always open: g(x)=t(x)=const=1. That is, Highway Nets are gated ResNets: set the gates to 1.0→ResNet. Highway Nets perform roughly as well as ResNets on ImageNet [4a]. The residual parts of a Highway Net are like those of an unfolded 2000 LSTM [5], while the residual parts of a ResNet are like those of an unfolded 1997 LSTM [3][4b]. More.

[4a] K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint arxiv:1612.07771 (2016). Also at ICLR 2017.

[4b] J. Schmidhuber (AI Blog, 2025). Who Invented Deep Residual Learning? Technical Report IDSIA-09-25, IDSIA, 2025. Preprint arXiv:2509.24732.

[4c] R. K. Srivastava (January 2025). Weighted Skip Connections are Not Harmful for Deep Nets. Shows that a follow-up paper by the authors of [1] suffered from design flaws leading to incorrect conclusions about gated residual connections.

[4d] J. Schmidhuber (AI Blog, 2015, updated 2025 for 10-year anniversary). Overview of Highway Networks: First working really deep feedforward neural networks with hundreds of layers.

[5] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000.

[6] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TU Munich, 1991. Advisor: J. Schmidhuber. Overview.

[7a] 2011: First superhuman CNNs
[7b] 2011: First human-competitive CNNs for handwriting
[7c] 2012: First CNN to win segmentation contest
[7d] 2012: First CNN to win contest on object discovery in large images
[7e] Deep Learning. Scholarpedia, 10(11):32832, 2015

[8] M. Stollenga, W. Byeon, M. Liwicki, J. Schmidhuber. Parallel Multi-Dimensional LSTM, with Application to Fast Biomedical Volumetric Image Segmentation. NIPS 2015; arxiv:1506.07452.


Can you spot the Fibonacci pattern in the graphics above?
.