Composing Music with LSTM Recurrent Networks - Blues Improvisation

Note: This page was created by Schmidhuber's former postdoc Doug Eck (now assistant professor at Univ. Montreal), on the LSTM long time lag project.

Here are some multimedia files related to the LSTM music composition project. The files are in MP3 (hi-resolution 128kbps and low resolution 32kbps) and MIDI. A helpful reference document for understanding the compositions is IDSIA Technical Report IDSIA-07-02, A First Look at Music Composition using LSTM Recurrent Neural Networks [postscript or pdf].

These compositions were made by an LSTM recurrent neural network. After learning by example from the training set (next-step prediction), the network generated these musical examples. When the network composed, it worked with no guidance whatsoever. That is, the network had to reproduce both the chords and the melodies. For the simple training set used here, the chord structure was fixed. This meant that the network did not need to generalize chords. Even in this simple case, however, neither a feed-forward network nor very likely a traditional recurrent neural network (RNN) can learn these chords. See the technical report for more.

This work marks, we believe, a first step towards a neural network music composer that can learn and use global musical structure. Previous attempts at this task showed that RNNs can capture local structure in music but fail to capture the long-term structure that defines a musical form. These compositions show that LSTM can capture and reproduce long-term musical structure and can generate new (sometimes somewhat pleasing) examples of a form. In short, though the chord structure was fixed and the training set somewhat dull, these results are promising enough to warrant further research.

A note on terminology: We are using the terms composition and improvisation loosely. It is probably more accurate to describe the behavior LSTM as improvisation because it is inventing new melodies on top of a set form; however, the end goal is the creation of new melodies and new forms, thus the somewhat optimistic use of the term composition.

Background Information

Chords: LSTM was trained using a form of blues common in jazz bebop improvisation. The form is in 12 bars and contains the following chords:

You can listen to these bebop jazz blues chords [MP3 hi-res (493Kb), MP3 lo-res (123Kb), MIDI (2Kb)]. Note, in these examples the letter name of the chord (the tonic) is used to form a melody line.

Notes: The possible chord notes were limited to the octave below middle C. The possible melody notes were limited to the octave above (and including) middle C.

The improvisations were based around the pentatonic scale:

You can listen to the pentatonic scale [MP3 hi-res (94Kb), MP3 lo-res (24Kb), MIDI (1Kb)]. You can also listen to random quarter notes chosen from the pentatonic scale sound [MP3 hi-res (5.5Mb), MP3 lo-res (1.4Mb), MIDI (24Kb)]. In this example the random notes are played along with the chords described above.

Training Set [MP3 hi-res (5.5Mb), MP3 lo-res (1.4Mb), MIDI (21Kb)]: The training sets were composed by choosing randomly from melodic segments that fit the blues form. Those segments were worked out by me on the piano. Be warned: this is a boring training set! The goal of these experiments were to see if LSTM could learn a fixed chord structure while in parallel learning elements of a varying melody structure. It was easier to stick with a basic melody. Note that every 12-bar segment is unique; however, because only one or two bars are changed at a time, you may have to listen for a while to hear differences. We are currently working on a much more interesting set of training melodies and chords

Some Compositions

These examples are quite long. Rather than listening to them in their entirety, it's probably beter to use your audio player to select the playing times mentioned in the descriptions.

Composition 1 [MP3 hi-res (5.6Mb), MP3 lo-res (1.4Mb), MIDI (22Kb)]: This first example shows initial failure followed by stabilization. At first, the network does not correctly reproduce the chords, resulting in a bad melody line as well. However, at around 25 seconds the chords fall into place, and the melody follows. This isn't a particularly good example of composition but it is an effective example of how the chord structure constrains the melody line. That is, once the chords are reproduced correctly by the network the melody follows. Compare for example the first 25 seconds to the 12 bars starting at around 2:04 and around 4:54.

Composition 2 [MP3 hi-res (5.6Mb), MP3 lo-res (1.4Mb), MIDI (22Kb)]: This second example shows how the network can drift from fairly close reproduction of training set melodies to freer improvisation. Listen from the beginning and notice how at around 0:14 the network begins reproducing the melody with some variation. At around 0:50 the network drifts somewhat from the melody and recovers at around 1:00. Then at 1:13 it begins to alternate between constrained and freer improvisation. Notice also the passages starting at 3:50.

Composition 3 [MP3 hi-res (5.6Mb), MP3 lo-res (1.4Mb), MIDI (22Kb)]: This third example is similar to Composition 2. It contains some nice sections (presuming, of course, you think any of this is nice). Starting at 0:28 and continuing through 1:12 is freer improvisation. At 1:12 is an example of the network repeating a motif not found in the training set.

Composition 4 [MP3 hi-res (5.6Mb), MP3 lo-res (1.4Mb), MIDI (22Kb)]: This fourth example shows an example of a network trained longer than those used for examples 1 through 3. Here, due to longer training time the network does a better job of reproducing the training set. It still improvises freely but with less departure from the target melodies.

Go to Doug's Homepage at Univ. Montreal