jncc20
Class ArffParser

java.lang.Object
  extended by jncc20.ArffParser

 class ArffParser
extends java.lang.Object

Implementation of ARFF parser, used to parse training and testing files. Remarks:

1) Variables are supposed to be either nominal or numerical. Unlike Weka, it does *not* manage variable of type String, or Date.

2)It assumes the class of the problem to be named "class" in the Arff file and to be declared as last variable in the header.


Field Summary
private  java.lang.String arffFileAddress
          Absolute Path of the main Arff file
private  java.lang.String arffTestingFileAddress
          Name of the testing Arff file
private  java.util.ArrayList<java.lang.String[]> categoryNames
          Matrix of String with rows of different lenght, as different features (each row of the matrix corresponds to a different feature) can have different numbers of categories.
private  java.util.ArrayList<java.lang.String> classNames
          Names of the output class.
private  java.lang.String datasetName
          Dataset Name as read from the field "@relation" in the Arff file
private  double[][] discretizationIntervals
          Matrix with rows of different length; stores the bin ranges for numerical features
private  java.util.ArrayList<java.lang.String> featureNames
          Names of input features
private  java.util.ArrayList<java.lang.String> nonMarFeatureNamesTesting
          Names of NonMar features in testing
private  java.util.ArrayList<java.lang.String> nonMarFeatureNamesTraining
          Names of NonMar features in training
private  java.util.ArrayList<java.lang.Integer> notUsedFeatures
          Indexes of features that are not used (because discretized into a single bin)
private  java.util.ArrayList<java.lang.Boolean> numFlags
          Flags array, regarding wheter Features are numerical (1) or not (0)
private  java.util.ArrayList<double[]> RawDataset
          Copy of the data read from Arff file (having hence -9999 as marker for missing data), and category names substituted by the corresponding indexes.)
private  java.util.ArrayList<java.lang.String[]> rawTestingSet
          Raw testing set exactly as read from file.
private  java.util.ArrayList<java.lang.Integer>[] rowsClassIdx
          Indexes of the rows, in RawDataset, which have the same output class.
private  java.util.ArrayList<int[]> testingSet
          Testing set, as accessed by the classifier: numerical variables discretized, category names substituted by their indexes, missing data marked as -9999, classes substituted with indexes.
private  java.util.ArrayList<java.lang.Integer> usedFeatures
          Indexes of used features (i.e., categorical features and numerical features discretized into several bins)
private  java.lang.String validationMethod
          Set either to "CV" or to the name of the testing Arff file
private  java.lang.String workingPath
          Path where the files for the given experiment (Arff files, NonMar.txt) reside, and where the output will be saved
 
Constructor Summary
ArffParser(java.lang.String UserSuppliedWorkingPath, java.lang.String UserSuppliedArffName, java.lang.String UserSuppliedValidationMethod)
          Initializes data members; than, scans the Arff file, checking the formal correctness of variable declarations, and the coherence of the data with the declarations; stores the information and the data loaded from file.
 
Method Summary
(package private)  java.util.ArrayList<java.lang.String[]> getCategoryNames()
           
(package private)  java.util.ArrayList<java.lang.String> getClassNames()
           
(package private)  java.lang.String getDatasetName()
           
private  int getDiscretizationIdx(java.lang.Double currentValue, int FeatureIdx)
          Return the bin in which a numerical value of a given feature falls.
(package private)  java.util.ArrayList<java.lang.String> getFeatureNames()
           
(package private)  java.util.ArrayList<java.lang.String> getNonMarFeatureNamesTesting()
           
(package private)  java.util.ArrayList<java.lang.String> getNonMarFeatureNamesTraining()
           
(package private)  java.util.ArrayList<java.lang.Boolean> getNumFlags()
           
(package private)  java.util.ArrayList<double[]> getRawDataset()
           
(package private)  java.util.ArrayList<java.lang.String[]> getRawTestingSet()
           
(package private)  java.util.ArrayList<java.lang.Integer>[] getRowsClassIdx()
           
(package private)  java.util.ArrayList<int[]> getTestingSet()
           
private  void parseArffFile()
          Scans the main Arff file.
(package private)  void parseTestingArffFile(boolean UnknownClasses)
          Parses the testing file, checking that all declarations are coherent with those already loaded from the training Arff file; if the classes are unknown, it reads only the instances, without looking for the classes.
private  void readNonMar()
          Reads the file NonMar.txt, containing the list of nonMar variables; if no file is found, all variables are assumed to be MAR.
(package private)  void setArffTestingFileAddress(java.lang.String suppliedArffTestingFileAddress)
           
(package private)  void setDiscretizationIntervals(double[][] suppliedDiscretizationIntervals)
           
(package private)  void setNotUsedFeatures(java.util.ArrayList<java.lang.Integer> suppliedNotUsedFeatures)
           
(package private)  void setUsedFeatures(java.util.ArrayList<java.lang.Integer> suppliedUsedFeatures)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

arffFileAddress

private java.lang.String arffFileAddress
Absolute Path of the main Arff file


arffTestingFileAddress

private java.lang.String arffTestingFileAddress
Name of the testing Arff file


categoryNames

private java.util.ArrayList<java.lang.String[]> categoryNames
Matrix of String with rows of different lenght, as different features (each row of the matrix corresponds to a different feature) can have different numbers of categories.


classNames

private java.util.ArrayList<java.lang.String> classNames
Names of the output class.


datasetName

private java.lang.String datasetName
Dataset Name as read from the field "@relation" in the Arff file


discretizationIntervals

private double[][] discretizationIntervals
Matrix with rows of different length; stores the bin ranges for numerical features


featureNames

private java.util.ArrayList<java.lang.String> featureNames
Names of input features


nonMarFeatureNamesTesting

private java.util.ArrayList<java.lang.String> nonMarFeatureNamesTesting
Names of NonMar features in testing


nonMarFeatureNamesTraining

private java.util.ArrayList<java.lang.String> nonMarFeatureNamesTraining
Names of NonMar features in training


notUsedFeatures

private java.util.ArrayList<java.lang.Integer> notUsedFeatures
Indexes of features that are not used (because discretized into a single bin)


numFlags

private java.util.ArrayList<java.lang.Boolean> numFlags
Flags array, regarding wheter Features are numerical (1) or not (0)


RawDataset

private java.util.ArrayList<double[]> RawDataset
Copy of the data read from Arff file (having hence -9999 as marker for missing data), and category names substituted by the corresponding indexes.)


rawTestingSet

private java.util.ArrayList<java.lang.String[]> rawTestingSet
Raw testing set exactly as read from file. Used when a class-less file is used, and we want later to dump to file the original instances followed by the last column containing the classification. Being declared as String[][], it can hosts number as well as categories.


rowsClassIdx

private java.util.ArrayList<java.lang.Integer>[] rowsClassIdx
Indexes of the rows, in RawDataset, which have the same output class. For instance, the first row collects the indexes of all the rows in RawDataset having output class c1, and so on.


testingSet

private java.util.ArrayList<int[]> testingSet
Testing set, as accessed by the classifier: numerical variables discretized, category names substituted by their indexes, missing data marked as -9999, classes substituted with indexes.


usedFeatures

private java.util.ArrayList<java.lang.Integer> usedFeatures
Indexes of used features (i.e., categorical features and numerical features discretized into several bins)


validationMethod

private java.lang.String validationMethod
Set either to "CV" or to the name of the testing Arff file


workingPath

private java.lang.String workingPath
Path where the files for the given experiment (Arff files, NonMar.txt) reside, and where the output will be saved

Constructor Detail

ArffParser

ArffParser(java.lang.String UserSuppliedWorkingPath,
           java.lang.String UserSuppliedArffName,
           java.lang.String UserSuppliedValidationMethod)
Initializes data members; than, scans the Arff file, checking the formal correctness of variable declarations, and the coherence of the data with the declarations; stores the information and the data loaded from file. In particular, it instantiates the data members FeatureNames, NumFlags(whether every feature is numerical or not), CategoryNames(names of categories for each categorical featrue) and RawDataset (a matrix of double which contains the data as read from file, with missing values substitued by -9999, and category names substituted by numerical indexes, and numerical values unchanged.) Moreover, reads the list of NonMar variables, which are then stored in NonMarFeatureNamesTraining and NonMarFeatureNamesTesting.

Method Detail

getCategoryNames

java.util.ArrayList<java.lang.String[]> getCategoryNames()

getClassNames

java.util.ArrayList<java.lang.String> getClassNames()

getDatasetName

java.lang.String getDatasetName()

getDiscretizationIdx

private int getDiscretizationIdx(java.lang.Double currentValue,
                                 int FeatureIdx)
Return the bin in which a numerical value of a given feature falls.


getFeatureNames

java.util.ArrayList<java.lang.String> getFeatureNames()

getNonMarFeatureNamesTesting

java.util.ArrayList<java.lang.String> getNonMarFeatureNamesTesting()

getNonMarFeatureNamesTraining

java.util.ArrayList<java.lang.String> getNonMarFeatureNamesTraining()

getNumFlags

java.util.ArrayList<java.lang.Boolean> getNumFlags()

getRawDataset

java.util.ArrayList<double[]> getRawDataset()

getRawTestingSet

java.util.ArrayList<java.lang.String[]> getRawTestingSet()

getRowsClassIdx

java.util.ArrayList<java.lang.Integer>[] getRowsClassIdx()

getTestingSet

java.util.ArrayList<int[]> getTestingSet()

parseArffFile

private void parseArffFile()
Scans the main Arff file.


parseTestingArffFile

void parseTestingArffFile(boolean UnknownClasses)
Parses the testing file, checking that all declarations are coherent with those already loaded from the training Arff file; if the classes are unknown, it reads only the instances, without looking for the classes. Data are stored in TestingSet: nominal features are simply stored, while numerical features are discretized using the bins available from DiscretizationIntervals.


readNonMar

private void readNonMar()
Reads the file NonMar.txt, containing the list of nonMar variables; if no file is found, all variables are assumed to be MAR. If the name of the variable is not preceeded by any token, the feature is supposed to be NonMar on both training and testing set; if it is preceeded by "training" ["testing"], then it is managed as NonMar on training [testing] only, and hence as Mar on testing [training].

Then, put the names of NonMar variables in TrainingNonMarFeatureNames and TestingNonMarFeatureNames.


setArffTestingFileAddress

void setArffTestingFileAddress(java.lang.String suppliedArffTestingFileAddress)

setDiscretizationIntervals

void setDiscretizationIntervals(double[][] suppliedDiscretizationIntervals)

setNotUsedFeatures

void setNotUsedFeatures(java.util.ArrayList<java.lang.Integer> suppliedNotUsedFeatures)

setUsedFeatures

void setUsedFeatures(java.util.ArrayList<java.lang.Integer> suppliedUsedFeatures)