Recurrent nets discover new motifs for protein classification

Sepp Hochreiter 


Abstract:


We apply recurrent nets to protein classification.  We employ the
Long Short-Term Memory (LSTM) recurrent net because LSTM is able
to store the occurrence of certain amino acid patterns while scanning
the sequence whereas other architectures cannot store patterns over
extended periods. The LSTM architecture allows via its gating mechanism
to detect parts in the amino acid sequence which have dependencies
with the class label, that is, LSTM extracts motifs indicating
the protein class. In comparison to traditional alignment methods on
the PROSITE protein database, LSTM yields a lower misclassification
rate and finds new motifs. If the LSTM extracted motifs are
superimposed, then a motif is obtained, which is equal to the motif
found by alignment methods. Thus, LSTM generalizes alignment methods
by identifying dependencies within motifs and, additionally, is able
to correlate motifs which are far apart from each other in the sequence.