Lococodes cannot only be justified by reference to previous ideas on what's a ``desirable'' code. Next we will show that they can help to achieve superior generalization performance on a standard supervised learning benchmark problem. This section's focus on speech data also illustrates LOCOCODES's versatility: its applicability is not limited to visual data.
Task. We recognize vowels, using vowel data from Scott Fahlman's CMU benchmark collection (see also Robinson 1989). There are 11 vowels and 15 speakers. Each speaker spoke each vowel 6 times. Data from the first 8 speakers is used for training. The other data is used for testing. This means 528 frames for training and 462 frames for testing. Each frame consists of 10 input components obtained by low pass filtering at 4.7kHz, digitized to 12 bits with a 10 kHz sampling rate. A twelfth order linear predictive analysis was carried out on six 512 sample Hamming-windowed segments from the steady part of the vowel. The reflection coefficients were used to calculate 10 log area parameters, providing the 10 dimensional input space.
Coding. The training data is coded using an FMS AA. Architecture: (10-30-10). The input components are linearly scaled in [-1,1]. The AA is trained with pattern presentations. Then its weights are frozen.
Classification. From now on, the vowel codes across all nonconstant HUs are used as inputs for a conventional supervised BP classifier, which is trained to recognize the vowels from the code. The classifier's architecture is (()-11-11), where is the number pruned HUs in the AA. The hidden and output units are sigmoid with activation function , and receive an additional bias input. The classifier is trained with another pattern presentations.
Parameters. AA net: learning rate: 0.02, , , . Backprop classifier: learning rate: 0.002.
Overfitting. We confirm Robinson's results: the classifier tends to overfit when trained by simple BP -- during learning, the test error rate first decreases and then increases again.
Comparison. We compare: (1) Various neural nets (see Table 1). (2) Nearest neighbor: classifies an item as belonging to the class of the closest example in the training set (using Euclidean distance). (3) LDA: linear discriminant analysis. (4) Softmax: observation assigned to class with best fit value. (5) QAD: quadratic discriminant analysis (observations are classified as belonging to the class with closest centroid, using Mahalanobis distance based on the class-specific covariance matrix). (6) CART: classification and regression tree (coordinate splits and default input parameter values are used). (7) FDA/BRUTO: flexible discriminant analysis using additive models with adaptive selection of terms and splines smoothing parameters. BRUTO provides a set of basis functions for better class separation. (8) Softmax/BRUTO: best fit value for classification using BRUTO. (9) FDA/MARS: FDA using multivariate adaptive regression splines. MARS builds a basis expansion for better class separation. (10) Softmax/MARS: best fit value for classification using MARS. (11) LOCOCODE/Backprop: ``unsupervised'' codes generated by LOCOCODE with FMS, fed into a conventional, overfitting BP classifier.
Results. See Table 3. FMS generates 3 different lococodes. Each is fed into 10 BP classifiers with different weight initializations: the table entry for ``LOCOCODE/Backprop'' represents the mean of 30 trials. The results for neural nets and nearest neighbor are taken from Robinson (1989). The other results (except for LOCOCODE's) are taken from Hastie et al. (1993). Our method led to excellent generalization results. The error rates after BP learning vary between 39 and 45 %.
Backprop fed with LOCOCODE code sometimes goes down to 38 % error rate, but due to overfitting, the error rate increases again (of course, test set performance may not influence the training procedure). Given that BP by itself is a very naive approach it seems quite surprising that excellent generalization performance can be obtained just by feeding BP with nongoal-specific lococodes.
Typical feature detectors. The number of pruned HUs (with constant activation) varies between 5 and 10. 2 to 5 HUs become binary, and 4 to 7 trinary. With all codes we observed: apparently, certain HUs become feature detectors for speaker identification. Another HU's activation is near 1.0 for the words ``heed'' and ``hid'' (``i'' sounds). Another HU's activation has high values for the words ``hod'', ``hoard'', ``hood'' and ``who'd'' (``o''-words) and low but nonzero values for ``hard'' and ``heard''. LOCOCODE supports feature detection.
Why no sparse code? The real-valued input components cannot be described precisely by the activations of the few feature detectors generated by LOCOCODE. Additional real-valued HUs are necessary for representing the missing information.
Better results with additional information. Hastie et al. also obtained additional, even slightly better results with an FDA/MARS variant: down to 39 % average error rate. It should be mentioned, however, that their data was subject to goal-directed preprocessing with splines, such that there were many clearly defined classes. Furthermore, to determine the input dimension, Hastie et al. used a special kind of generalized cross-validation error, where one constant was obtained by unspecified ``simulation studies''. Hastie and Tibshirani (1996) also obtained an average error rate of 38 % with discriminant adaptive nearest neighbor classification. About the same error rate was obtained by Flake (1998) with RBF networks and hybrid architectures. Also, recent experiments (mostly conducted during the time this paper has been under review) showed that even better results can be obtained by using additional context information to improve classification performance, e.g., Turney (1993), Herrmann (1997), and Tenenbaum and Freeman (1997). For an overview see Schraudolph (1998). It will be interesting to combine these methods with LOCOCODE.
Conclusion. Although we made no attempt at preventing classifier overfitting, we achieved excellent results. From this we conclude that the lococodes fed into the classifier already conveyed the ``essential'', almost noise-free information necessary for excellent classification. We are led to believe that LOCOCODE is a promising method for data preprocessing.