Very Deep Learning with Highway Networks

How can we train deep neural networks with very long credit assignment paths from inputs to outputs to solve complex AI problems? Many solutions to this problem have been proposed over the years, but most do not work well for very deep networks with arbitrary non-linear (and possibly recurrent) transformations between layers.

This project presents a different take on the problem. We simply redesign neural networks in a way that makes them easier to optimize even for very large depths.

Our Highway Networks are based on Long Short Term Memory (LSTM) recurrent networks and allow training of deep, efficient networks (even with hundreds of layers) with conventional gradient-based methods. Even when large depths are not required, highway layers can be used instead of traditional neural layers to allow the network to adaptively copy or transform representations.

NEW: In recent work we have used highway networks to easily train multi-layer state transitions in recurrent neural networks. The resulting Recurrent Highway Networks outperform existing recurrent architectures substantially, and open up a new degree of freedom in recurrent model design.


Highway and Residual Networks learn Unrolled Iterative Estimation
K. Greff, R. K. Srivastava, and J. Schmidhuber
ICLR 2017 Conference Track arXiv:1612.07771

Recurrent Highway Networks
J. G. Zilly, R. K. Srivastava, J. Koutnik and J. Schmidhuber
International Conference on Machine Learning (ICML 2017)
arXiv Preprint arXiv:1607.03474
Code for reproducing experimental results.

Training Very Deep Networks
R. K. Srivastava, K. Greff and J. Schmidhuber
Neural Information Processing Systems (NIPS 2015 Spotlight) arXiv:1507.06228
Download logs for all 800 optimization runs here, with instructions.

Highway Networks
R. K. Srivastava, K. Greff and J. Schmidhuber
Deep Learning Workshop (ICML 2015). arXiv:1505.00387 poster


Frequently Asked Questions

Q: How do I set the bias for the transform gates when initializing a highway network?

A: You can think of the initial bias as a prior over the behavior of your network at initialization. In general, this is a hyper-parameter which will depend on the given problem and network architecture. However, here are some general suggestions which have worked for certain problems:

Q: Is the highway gating mechanism related to how information flow is regulated in the brain?

A: Information processing in the brain is not understood very well yet. However, the idea that the brain uses similar gating mechanisms has definitely been considered seriously by neuroscientists.

For example, see: Gisiger, T., & Boukadoum, M. (2011). Mechanisms Gating the Flow of Information in the Cortex: What They Might Look Like and What Their Uses may be. Frontiers in Computational Neuroscience, 5, 1. link

Some Publications which use Highway Networks

  1. Kim, Yoon, et al. "Character-Aware Neural Language Models." arXiv preprint arXiv:1508.06615 (2015).
  2. Zhang et al. "Highway Long Short-Term Memory RNNs for Distant Speech Recognition." arXiv preprint arXiV:1510.08983 (2015).
  3. Bowman, Samuel R., et al. "Generating sentences from a continuous space." arXiv preprint arXiv:1511.06349 (2015).
  4. BJozefowicz, Rafal, et al. "Exploring the limits of language modeling." arXiv preprint arXiv:1602.02410 (2016).
  5. Schmaltz, Allen, et al. "Sentence-level grammatical error identification as sequence-to-sequence correction." arXiv preprint arXiv:1604.04677 (2016)."
  6. Lu, Liang. "Sequence Training and Adaptation of Highway Deep Neural Networks." arXiv preprint arXiv:1607.01963 (2016).
  7. Lu, Liang, Michelle Guo, and Steve Renals. "Knowledge Distillation for Small-footprint Highway Networks." arXiv preprint arXiv:1608.00892 (2016).
  8. Vylomova, Ekaterina, et al. "Word Representation Models for Morphologically Rich Languages in Neural Machine Translation." arXiv preprint arXiv:1606.04217 (2016).