jncc20
Class Jncc

java.lang.Object
  extended by jncc20.Jncc

public class Jncc
extends java.lang.Object

Main class of the project. It loads data from the file specified by the user (using ArffParser objects); then, it trains and validates NBC and NCC. Class jncc implementes three kinds of experiments: 1) 10 runs of stratified 10-folds cross-validation; 2) validation via testing file (single training/testing experiment) 3)testing file with unknown classes. In the first two cases, accuracy stats are reported to file (via ResultsReporter objects), as the true classes are known. In the last case, NCC predictions only are reported to file, as the true classes are unknown. Numerical features are discretized via MDL-entropy-based supervised discretization (using MdlDiscretizer objects. Note that discretization intervals are always computed on the training set, and then applied unchanged on the testing set.


Nested Class Summary
private static class Jncc.ResultsReporter
          Helper class for jncc, which accomplishes the following tasks: reads the temporary file where NBC and NCC predictions are stored; computes performances indexes; produces the output file, which reports both the discretization log (i.e., whether some numerical feats have been discretized into a single bin) and the classifiers results.
 
Field Summary
private  ArffParser aParser
          Object for parsing ARFF files
private  java.lang.String arffTestingFile
          Absolute Path of the testing Arff file
private  java.lang.String arffTestingFileAddress
          Name of the testing Arff file
private  java.util.ArrayList<java.lang.String[]> categoryNames
          Matrix of String with rows of different lenght, as different features (each row of the matrix corresponds to a different feature) can have different numbers of categories.
private  java.util.ArrayList<java.lang.String> classesNames
          Names of the output class.
private  int[] cvFoldsIdx
          Indexes for cross validation: in which fold each row of RawDataset falls
private  double[][] discretizationIntervals
          Matrix with rows of different length; stores the bin ranges for numerical features
private  int[] discretLog
          How many times each feature has been discretized in a single bin, over the different training/testing experiments.
private  java.util.ArrayList<java.lang.String> featNames
          Names of input features
private  int[] foldsSize
          How many instances are in each fold
private  java.util.ArrayList<java.lang.String> nonMarFeatureNamesTesting
          Names of NonMar features in testing
private  java.util.ArrayList<java.lang.String> nonMarFeatureNamesTraining
          Names of NonMar features in training
private  java.util.ArrayList<java.lang.Integer> nonMarInCurrentTestingDataset
          Index of NonMar features positions in the current testing set (position might change during CV, as different variables can get discretized into a single bin)
private  java.util.ArrayList<java.lang.Integer> nonMarInCurrentTrainingDataset
          Index of NonMar features positions in the current training set (position might change during CV, as different variables can get discretized into a single bin)
private  java.util.ArrayList<java.lang.Integer> notUsedFeatures
          Variables not used in the current experiment, because discretized in a single bin; indexes refer to RawDataset
private  java.util.ArrayList<java.lang.Integer> numClassesNonMarTesting
          Number of classes of variables NonMar in the testing set.
private  java.util.ArrayList<java.lang.Integer> numClassForEachUsedFeature
          Number of classes for each used feature
private  int numCrossVRuns
          Number of Cross validation Runs
private  int numCvFolds
          Number of folds used by cross-validation
private  java.util.ArrayList<java.lang.Boolean> numFlags
          Flags array, regarding wheter Features are numerical (1) or not (0)
private  java.lang.String predsFile
          Absolute Path of the predictions file for CV
private  java.util.ArrayList<double[]> rawDataset
          Copy of the data read from Arff file (having hence -9999 as marker for missing data), and category names substituted by the corresponding indexes.)
private  java.util.ArrayList<java.lang.String[]> rawTestingSet
          Raw testing set exactly as read from file.
private  java.lang.String resFile
          File that reports avg and std dev of performance indicators; this is the ultimate output file
private  java.util.ArrayList<java.lang.Integer>[] rowsClassIdx
          Indexes of the rows, in RawDataset, which have the same output class.
private  java.util.ArrayList<int[]> testingSet
          Testing set, accessed by the classifier: numerical variables are discretized, while category names and classes are substituted by indexes; missing data denoted as -9999.
private  java.util.ArrayList<int[]> trainingSet
          Training set, accessed by the classifier: numerical variables are discretized, while category names and classes are substituted by indexes; missing data denoted as -9999.
private  java.util.ArrayList<java.lang.Integer> usedFeatures
          Variables used in the current experiment, hence excluding those discretized in a single bin.
private  java.util.ArrayList<java.lang.String> usedFeaturesNames
          Names of the variables used in the current experiment, hence excluding those discretized in a single bin.
private  java.lang.String validationMethod
          Set either to "CV" or to the name of the testing Arff file
private  java.lang.String workPath
          Path where the files for the given experiment (Arff files, NonMar.txt) are found, and where output files will be saved.
 
Constructor Summary
Jncc(java.lang.String UserSuppliedWorkingPath, java.lang.String UserSuppliedArffName, java.lang.String UserSuppliedValidationName)
          Initializes the necessary data members, scans the main Arff file and then instantiates the data members FeatureNames, NumFlags, CategoryNames and RawDataset.
 
Method Summary
private  void discretizeNumFeaturesOnTrainingData(java.util.ArrayList<double[]> TrainingData)
          Discretizes all the numerical features on the Training Set, and instantiates DiscretizationIntervals, UsedFeatures, UsedFeaturesNames, NumClassForEachUsedFeature; updates DiscretizationLog.
private  void drawCVindexes()
          Draws stratified folders for cross-validation, instantiating CvFoldsIdx.
private  void findNonMarInCurrentDataset()
          Prepares the NonMarInCurrentDataset data member.
private  int getDiscretizationIdx(java.lang.Double currentValue, int FeatureIdx)
          Given a numerical value of a certain discretized feature, returns the index of the bin in which the value falls
static void main(java.lang.String[] args)
          Arguments of the main: (1) the working path; (2) the name of the main ArrfFile; (3) "cv" or the name of the testing ArffFile; (4)[OPTIONAL] "unknownClasses", in case the actual classes of the testing set are unknown.
private  void predictionsToFileNbcNcc(int NumFold, int[] NBCPredictions, int[][] CredalPredictions)
          Dumps to file the predictions issued by both NBC and NCC on testing set(s).
private  void prepareDataSetFromRawData(java.util.ArrayList<double[]> SourceData, java.util.ArrayList<int[]> DestinationData)
          Take a raw set of data (undiscretized features) and put them into a dataset to be accessed by classifiers; categorical variables are copied unchanged, while numerical variables are converted to categorical according to DiscretizationIntervals; numerical variables discretized into a unique bin (and hence listed in NonUsedFeatures) are discarded.
private  void validateViaCV(java.lang.String[] args)
          Validates NBc and NCC via 10 runs of 10-folds cross-validation.
private  void validateViaTestingFile(java.lang.String TestingFile)
          Validates NBC and NCC via testing file.
private  void validateViaTestingFileUnknownClasses()
          Learns NCC; classifies the instances of the testing file via NCC, and writes the classifications to file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

aParser

private ArffParser aParser
Object for parsing ARFF files


arffTestingFile

private java.lang.String arffTestingFile
Absolute Path of the testing Arff file


arffTestingFileAddress

private java.lang.String arffTestingFileAddress
Name of the testing Arff file


categoryNames

private java.util.ArrayList<java.lang.String[]> categoryNames
Matrix of String with rows of different lenght, as different features (each row of the matrix corresponds to a different feature) can have different numbers of categories.


classesNames

private java.util.ArrayList<java.lang.String> classesNames
Names of the output class.


cvFoldsIdx

private int[] cvFoldsIdx
Indexes for cross validation: in which fold each row of RawDataset falls


discretizationIntervals

private double[][] discretizationIntervals
Matrix with rows of different length; stores the bin ranges for numerical features


discretLog

private int[] discretLog
How many times each feature has been discretized in a single bin, over the different training/testing experiments.


featNames

private java.util.ArrayList<java.lang.String> featNames
Names of input features


foldsSize

private int[] foldsSize
How many instances are in each fold


nonMarFeatureNamesTesting

private java.util.ArrayList<java.lang.String> nonMarFeatureNamesTesting
Names of NonMar features in testing


nonMarFeatureNamesTraining

private java.util.ArrayList<java.lang.String> nonMarFeatureNamesTraining
Names of NonMar features in training


nonMarInCurrentTestingDataset

private java.util.ArrayList<java.lang.Integer> nonMarInCurrentTestingDataset
Index of NonMar features positions in the current testing set (position might change during CV, as different variables can get discretized into a single bin)


nonMarInCurrentTrainingDataset

private java.util.ArrayList<java.lang.Integer> nonMarInCurrentTrainingDataset
Index of NonMar features positions in the current training set (position might change during CV, as different variables can get discretized into a single bin)


notUsedFeatures

private java.util.ArrayList<java.lang.Integer> notUsedFeatures
Variables not used in the current experiment, because discretized in a single bin; indexes refer to RawDataset


numClassesNonMarTesting

private java.util.ArrayList<java.lang.Integer> numClassesNonMarTesting
Number of classes of variables NonMar in the testing set. Useful when the NCC builds all the possible realizations of the NonMar variables


numClassForEachUsedFeature

private java.util.ArrayList<java.lang.Integer> numClassForEachUsedFeature
Number of classes for each used feature


numCrossVRuns

private int numCrossVRuns
Number of Cross validation Runs


numCvFolds

private int numCvFolds
Number of folds used by cross-validation


numFlags

private java.util.ArrayList<java.lang.Boolean> numFlags
Flags array, regarding wheter Features are numerical (1) or not (0)


predsFile

private java.lang.String predsFile
Absolute Path of the predictions file for CV


rawDataset

private java.util.ArrayList<double[]> rawDataset
Copy of the data read from Arff file (having hence -9999 as marker for missing data), and category names substituted by the corresponding indexes.)


rawTestingSet

private java.util.ArrayList<java.lang.String[]> rawTestingSet
Raw testing set exactly as read from file. Used when a testing file with unknown classes is provided, to eventually dump to file the values of the instances. Being declared as String[][], it hosts number as well as categories.


resFile

private java.lang.String resFile
File that reports avg and std dev of performance indicators; this is the ultimate output file


rowsClassIdx

private java.util.ArrayList<java.lang.Integer>[] rowsClassIdx
Indexes of the rows, in RawDataset, which have the same output class. For instance, the first row collects the indexes of all the rows in RawDataset having output class c1, and so on.


testingSet

private java.util.ArrayList<int[]> testingSet
Testing set, accessed by the classifier: numerical variables are discretized, while category names and classes are substituted by indexes; missing data denoted as -9999.


trainingSet

private java.util.ArrayList<int[]> trainingSet
Training set, accessed by the classifier: numerical variables are discretized, while category names and classes are substituted by indexes; missing data denoted as -9999.


usedFeatures

private java.util.ArrayList<java.lang.Integer> usedFeatures
Variables used in the current experiment, hence excluding those discretized in a single bin. Indexes refer to RawDataset


usedFeaturesNames

private java.util.ArrayList<java.lang.String> usedFeaturesNames
Names of the variables used in the current experiment, hence excluding those discretized in a single bin.


validationMethod

private java.lang.String validationMethod
Set either to "CV" or to the name of the testing Arff file


workPath

private java.lang.String workPath
Path where the files for the given experiment (Arff files, NonMar.txt) are found, and where output files will be saved.

Constructor Detail

Jncc

Jncc(java.lang.String UserSuppliedWorkingPath,
     java.lang.String UserSuppliedArffName,
     java.lang.String UserSuppliedValidationName)
Initializes the necessary data members, scans the main Arff file and then instantiates the data members FeatureNames, NumFlags, CategoryNames and RawDataset.

Method Detail

discretizeNumFeaturesOnTrainingData

private void discretizeNumFeaturesOnTrainingData(java.util.ArrayList<double[]> TrainingData)
Discretizes all the numerical features on the Training Set, and instantiates DiscretizationIntervals, UsedFeatures, UsedFeaturesNames, NumClassForEachUsedFeature; updates DiscretizationLog.


drawCVindexes

private void drawCVindexes()
Draws stratified folders for cross-validation, instantiating CvFoldsIdx.


findNonMarInCurrentDataset

private void findNonMarInCurrentDataset()
Prepares the NonMarInCurrentDataset data member.


getDiscretizationIdx

private int getDiscretizationIdx(java.lang.Double currentValue,
                                 int FeatureIdx)
Given a numerical value of a certain discretized feature, returns the index of the bin in which the value falls


main

public static void main(java.lang.String[] args)
Arguments of the main: (1) the working path; (2) the name of the main ArrfFile; (3) "cv" or the name of the testing ArffFile; (4)[OPTIONAL] "unknownClasses", in case the actual classes of the testing set are unknown.


predictionsToFileNbcNcc

private void predictionsToFileNbcNcc(int NumFold,
                                     int[] NBCPredictions,
                                     int[][] CredalPredictions)
Dumps to file the predictions issued by both NBC and NCC on testing set(s). It produces a file which contains the features (apart from the ones discretized into a unique bin, which can change between different runs of CV), the actual class, the NCC classification (i.e., a number of columns equal to the number of classes containing either the outputted class, or 6666 to mean that not all classes have been outputted by NCC), and finally the NBC prediction.


prepareDataSetFromRawData

private void prepareDataSetFromRawData(java.util.ArrayList<double[]> SourceData,
                                       java.util.ArrayList<int[]> DestinationData)
Take a raw set of data (undiscretized features) and put them into a dataset to be accessed by classifiers; categorical variables are copied unchanged, while numerical variables are converted to categorical according to DiscretizationIntervals; numerical variables discretized into a unique bin (and hence listed in NonUsedFeatures) are discarded.


validateViaCV

private void validateViaCV(java.lang.String[] args)
Validates NBc and NCC via 10 runs of 10-folds cross-validation. Reports to file the relevant accuracy measures.


validateViaTestingFile

private void validateViaTestingFile(java.lang.String TestingFile)
Validates NBC and NCC via testing file. Reports to file the relevant accuracy measures.


validateViaTestingFileUnknownClasses

private void validateViaTestingFileUnknownClasses()
Learns NCC; classifies the instances of the testing file via NCC, and writes the classifications to file.