jncc20
Class Jncc

java.lang.Object
  extended by jncc20.Jncc

public class Jncc
extends java.lang.Object

Main class of the project, which loads the data set from file and then trains and validates the classifiers. It loads data from the file specified by the user; then, it trains and validates NBC and NCC according to the validation method specified by the user. Jncc implementes three validation methods: 1) 10 runs of stratified 10-folds cross-validation; 2) validation via testing file (single training/testing experiment) 3)testing file with unknown classes. In the first two cases, accuracy stats are reported to file (via ResultsReporter objects), as the true classes are known. In the last case, NCC predictions only are reported to file, as the true classes are unknown. Numerical features are discretized via MDL-entropy-based supervised discretization (using MdlDiscretizer objects. Note that discretization intervals are computed on the training set, and then applied unchanged on the testing set.


Nested Class Summary
private  class Jncc.ResultsReporter
          Helper class for jncc, which accomplishes the following tasks: reads the temporary file where NBC and NCC predictions are stored; computes performances indexes; produces the output files, i.e., ResultsTable.csv (performance indicators), ConfMatrices.txt (confusion matrices) and, if a testing file is supplied, Prediction-.csv (instances and predictions of the testing file).
 
Field Summary
private  java.lang.String arffFileAddress
          Absolute path of the main Arff file
private  java.lang.String arffTestingFile
          Absolute path of the testing Arff file
private  java.lang.String arffTestingFileName
          Name of the testing Arff file
private  java.util.ArrayList<java.lang.String[]> categoryNames
          Matrix of String with rows of different lenght; stores the name of the categories (each row corresponds to a different feature); meaningful for categorical features only.
private  java.util.ArrayList<java.lang.String> classNames
          Names of the output classes.
private  int currentCvFold
           
private  int[] cvFoldsIdx
          Indexes for cross validation: in which fold each row of rawDataset falls
private  java.lang.String datasetName
          Dataset Name as read from the field "@relation" in the Arff file
private  double[][] discretizationIntervals
          Matrix with rows of different length; stores the bin ranges for numerical features
private  int[] discretLog
          How many times each feature has been discretized in a single bin, over the different training/testing experiments.
private  java.util.ArrayList<java.lang.String> featNames
          Names of input features
private  int[] foldsSize
          How many instances are in each fold
(package private)  com.sun.management.OperatingSystemMXBean mxbean
          needed to track execution time
private  NaiveBayes nbc
          Naive Bayes classifier
private  NaiveCredalClassifier2 ncc2
          NCC2 classifier
private  java.util.ArrayList<java.lang.String> nonMarFeatsTesting
          Names of NonMar features in testing
private  java.util.ArrayList<java.lang.String> nonMarFeatsTraining
          Names of NonMar features in training
private  java.util.ArrayList<java.lang.Integer> nonMarTesting
          Index of NonMar features positions in the current testing set (position might change during CV, as different variables can get discretized into a single bin)
private  java.util.ArrayList<java.lang.Integer> nonMarTraining
          Index of NonMar features positions in the current training set (position might change during CV, as different variables can get discretized into a single bin)
private  java.util.ArrayList<java.lang.Integer> notUsedFeatures
          Variables not used in the current experiment, because discretized in a single bin; indexes refer to rawDataset
private  java.util.ArrayList<java.lang.Integer> numClassesNonMarTesting
          Number of classes of variables NonMar in the testing set.
private  java.util.ArrayList<java.lang.Integer> numClassForEachUsedFeature
          Number of classes for each used feature
private  int numCvFolds
          Number of folds used by cross-validation
private  int numCvRuns
          Number of Cross validation Runs
private  java.util.ArrayList<java.lang.Boolean> numFlags
          Flags array, regarding wheter Features are numerical (1) or not (0)
private  java.lang.String predictionsFile
          Absolute path of the temporary predictions file
private  java.lang.String probabilitiesFile
          File that reports the estimated probabilities by precise classifiers and whether the imprecise classifier is precise or not; used to compute the curve of precision vs.
private  java.util.ArrayList<double[]> rawDataset
          Copy of the data read from Arff file (having hence -9999 as marker for missing data), and category names substituted by the corresponding indexes.)
private  java.util.ArrayList<java.lang.String[]> rawTestingSet
          Raw testing set exactly as read from file.
private  java.lang.String resultsFile
          File that reports avg and std dev of performance indicators; this is the ultimate output file
private  java.util.ArrayList<java.lang.Integer>[] rowsClassIdx
          Indexes of the rows, in rawDataset, which have the same output class.
private  double startTime
          time at which program is started
private  java.util.ArrayList<int[]> testingSet
          Testing set, accessed by the classifier: numerical variables are discretized, while category names and classes are substituted by indexes; missing data denoted as -9999.
private  java.util.ArrayList<int[]> trainingSet
          Training set, accessed by the classifier: numerical variables are discretized, while category names and classes are substituted by indexes; missing data denoted as -9999.
private  boolean unknownClasses
          Whether classes of the testing set are known or not
private  java.util.ArrayList<java.lang.Integer> usedFeatures
          Variables used in the current experiment, hence excluding those discretized in a single bin.
private  java.util.ArrayList<java.lang.String> usedFeaturesNames
          Names of the variables used in the current experiment, hence excluding those discretized in a single bin.
private  java.lang.String validationMethod
          Set either to "CV" or to the name of the testing Arff file
private  java.lang.String workPath
          Path where the files for the given experiment (Arff files, NonMar.txt) are found, and where output files will be saved.
 
Constructor Summary
Jncc(java.lang.String UserSuppliedWorkingPath, java.lang.String UserSuppliedArffName, java.lang.String UserSuppliedValidationName, int numArgs)
          Initializes the necessary data members, scans the main Arff file and then instantiates the data members FeatureNames, NumFlags, CategoryNames and rawDataset.
 
Method Summary
private static void checkArgs(java.lang.String[] args)
          Sanity-check of the parameters supplied by the user
private  void deleteFileIfExisting(java.lang.String file)
           
private  void discretizeNumFeats(java.util.ArrayList<double[]> trainingData)
          Discretizes all the numerical features on the Training Set, and instantiates DiscretizationIntervals, UsedFeatures, UsedFeaturesNames, NumClassForEachUsedFeature; updates DiscretizationLog.
private  void drawCVindexes()
          Draws stratified folders for cross-validation, instantiating CvFoldsIdx.
private  boolean findFeatName(java.lang.String tmpString)
           
private  void findNonMarInCurrentDataset()
          Prepares the NonMarInCurrentDataset data member.
private  int getDiscretizationIdx(java.lang.Double currentValue, int FeatureIdx)
          Given a numerical value of a certain discretized feature, returns the index of the bin in which the value falls
private  void initResultsFiles(java.lang.String validationFile)
          Initializes file where to store predictions (which are only temporary) and performance indicators; validationFile is the unique available Arff file (in case of CV), the testing file in case of validation via testing file; it is not defined in case of unknownclasses.
static void main(java.lang.String[] args)
          Arguments of the main: (1) the working path; (2) the name of the main ArrfFile; (3) "cv" or the name of the testing ArffFile; (4)[OPTIONAL] "unknownClasses", in case the actual classes of the testing set are unknown.
private  void parseArffFile()
          Scans the main Arff file.
private  void parseArffTestingFile(boolean UnknownClasses)
          Parses the testing file, checking that all declarations are coherent with those already loaded from the training Arff file; if the classes are unknown, it reads only the instances, without looking for the classes.
private  void parseNonMar()
          Reads the file NonMar.txt, containing the list of nonMar variables; if no file is found, all variables are assumed to be MAR.
private  void prepareDataSetFromRawData(java.util.ArrayList<double[]> SourceData, java.util.ArrayList<int[]> DestinationData)
          Take a raw set of data (undiscretized features) and put them into a dataset to be accessed by classifiers; categorical variables are copied unchanged, while numerical variables are converted to categorical according to DiscretizationIntervals; numerical variables discretized into a unique bin (and hence listed in NonUsedFeatures) are discarded.
private  void prepareTrainTestSet()
          Prepares training and testing sets for validation via testing set, discretizing also numerical variables.
private  void prepareTrainTestSet(int currentFold)
          Prepares training and testing sets for cross-validation, discretizing also numerical variables.
private static void printArgError()
           
private  void printElapsedTime()
           
private static void printHelp()
          Writes an help message to the user, specifying the syntax to be used with JNCC2.
private  void saveTmpPredictions()
          Dumps to file the predictions issued by the classifiers on the testing set(s); they will be later analyzed to compute the indicators, and eventually deleted.
private  void trainValidClassifiers()
          Trains classfier on the training set, validates them on the testing set and save predictions to a temporary file
private  void validateCV(java.lang.String[] args)
          Validates NBc and NCC via 10 runs of 10-folds cross-validation.
private  void validateTFile(java.lang.String TestingFile)
          Validates NBC and NCC via testing file.
private  void validateTFileUnkClasses()
          Learns NCC; classifies the instances of the testing file via NCC, and writes the classifications to file.
private  void writePerfIndicators()
          Once classifiers have been validated (either via CV or single testing file), save to file all the relevant information
private  void writePredictions()
          Write to file the instances, actual classes, probability distribution computed by NBC and non-dominated classes identified by NCC2.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

arffFileAddress

private java.lang.String arffFileAddress
Absolute path of the main Arff file


arffTestingFile

private java.lang.String arffTestingFile
Absolute path of the testing Arff file


arffTestingFileName

private java.lang.String arffTestingFileName
Name of the testing Arff file


categoryNames

private java.util.ArrayList<java.lang.String[]> categoryNames
Matrix of String with rows of different lenght; stores the name of the categories (each row corresponds to a different feature); meaningful for categorical features only.


classNames

private java.util.ArrayList<java.lang.String> classNames
Names of the output classes.


currentCvFold

private int currentCvFold

cvFoldsIdx

private int[] cvFoldsIdx
Indexes for cross validation: in which fold each row of rawDataset falls


datasetName

private java.lang.String datasetName
Dataset Name as read from the field "@relation" in the Arff file


discretizationIntervals

private double[][] discretizationIntervals
Matrix with rows of different length; stores the bin ranges for numerical features


discretLog

private int[] discretLog
How many times each feature has been discretized in a single bin, over the different training/testing experiments.


featNames

private java.util.ArrayList<java.lang.String> featNames
Names of input features


foldsSize

private int[] foldsSize
How many instances are in each fold


mxbean

com.sun.management.OperatingSystemMXBean mxbean
needed to track execution time


nbc

private NaiveBayes nbc
Naive Bayes classifier


ncc2

private NaiveCredalClassifier2 ncc2
NCC2 classifier


nonMarFeatsTesting

private java.util.ArrayList<java.lang.String> nonMarFeatsTesting
Names of NonMar features in testing


nonMarFeatsTraining

private java.util.ArrayList<java.lang.String> nonMarFeatsTraining
Names of NonMar features in training


nonMarTesting

private java.util.ArrayList<java.lang.Integer> nonMarTesting
Index of NonMar features positions in the current testing set (position might change during CV, as different variables can get discretized into a single bin)


nonMarTraining

private java.util.ArrayList<java.lang.Integer> nonMarTraining
Index of NonMar features positions in the current training set (position might change during CV, as different variables can get discretized into a single bin)


notUsedFeatures

private java.util.ArrayList<java.lang.Integer> notUsedFeatures
Variables not used in the current experiment, because discretized in a single bin; indexes refer to rawDataset


numClassesNonMarTesting

private java.util.ArrayList<java.lang.Integer> numClassesNonMarTesting
Number of classes of variables NonMar in the testing set. Useful when the NCC builds all the possible realizations of the NonMar variables


numClassForEachUsedFeature

private java.util.ArrayList<java.lang.Integer> numClassForEachUsedFeature
Number of classes for each used feature


numCvFolds

private int numCvFolds
Number of folds used by cross-validation


numCvRuns

private int numCvRuns
Number of Cross validation Runs


numFlags

private java.util.ArrayList<java.lang.Boolean> numFlags
Flags array, regarding wheter Features are numerical (1) or not (0)


predictionsFile

private java.lang.String predictionsFile
Absolute path of the temporary predictions file


probabilitiesFile

private java.lang.String probabilitiesFile
File that reports the estimated probabilities by precise classifiers and whether the imprecise classifier is precise or not; used to compute the curve of precision vs. accuracy


rawDataset

private java.util.ArrayList<double[]> rawDataset
Copy of the data read from Arff file (having hence -9999 as marker for missing data), and category names substituted by the corresponding indexes.)


rawTestingSet

private java.util.ArrayList<java.lang.String[]> rawTestingSet
Raw testing set exactly as read from file. Used when a testing file with unknown classes is provided, to eventually dump to file the values of the instances. Being declared as String[][], it hosts number as well as categories.


resultsFile

private java.lang.String resultsFile
File that reports avg and std dev of performance indicators; this is the ultimate output file


rowsClassIdx

private java.util.ArrayList<java.lang.Integer>[] rowsClassIdx
Indexes of the rows, in rawDataset, which have the same output class. For instance, the first row collects the indexes of all the rows in rawDataset having output class c1, and so on.


startTime

private double startTime
time at which program is started


testingSet

private java.util.ArrayList<int[]> testingSet
Testing set, accessed by the classifier: numerical variables are discretized, while category names and classes are substituted by indexes; missing data denoted as -9999.


trainingSet

private java.util.ArrayList<int[]> trainingSet
Training set, accessed by the classifier: numerical variables are discretized, while category names and classes are substituted by indexes; missing data denoted as -9999.


unknownClasses

private boolean unknownClasses
Whether classes of the testing set are known or not


usedFeatures

private java.util.ArrayList<java.lang.Integer> usedFeatures
Variables used in the current experiment, hence excluding those discretized in a single bin. Indexes refer to rawDataset


usedFeaturesNames

private java.util.ArrayList<java.lang.String> usedFeaturesNames
Names of the variables used in the current experiment, hence excluding those discretized in a single bin.


validationMethod

private java.lang.String validationMethod
Set either to "CV" or to the name of the testing Arff file


workPath

private java.lang.String workPath
Path where the files for the given experiment (Arff files, NonMar.txt) are found, and where output files will be saved.

Constructor Detail

Jncc

Jncc(java.lang.String UserSuppliedWorkingPath,
     java.lang.String UserSuppliedArffName,
     java.lang.String UserSuppliedValidationName,
     int numArgs)
Initializes the necessary data members, scans the main Arff file and then instantiates the data members FeatureNames, NumFlags, CategoryNames and rawDataset.

Method Detail

checkArgs

private static void checkArgs(java.lang.String[] args)
Sanity-check of the parameters supplied by the user


deleteFileIfExisting

private void deleteFileIfExisting(java.lang.String file)

discretizeNumFeats

private void discretizeNumFeats(java.util.ArrayList<double[]> trainingData)
Discretizes all the numerical features on the Training Set, and instantiates DiscretizationIntervals, UsedFeatures, UsedFeaturesNames, NumClassForEachUsedFeature; updates DiscretizationLog.


drawCVindexes

private void drawCVindexes()
Draws stratified folders for cross-validation, instantiating CvFoldsIdx.


findFeatName

private boolean findFeatName(java.lang.String tmpString)

findNonMarInCurrentDataset

private void findNonMarInCurrentDataset()
Prepares the NonMarInCurrentDataset data member.


getDiscretizationIdx

private int getDiscretizationIdx(java.lang.Double currentValue,
                                 int FeatureIdx)
Given a numerical value of a certain discretized feature, returns the index of the bin in which the value falls


initResultsFiles

private void initResultsFiles(java.lang.String validationFile)
Initializes file where to store predictions (which are only temporary) and performance indicators; validationFile is the unique available Arff file (in case of CV), the testing file in case of validation via testing file; it is not defined in case of unknownclasses.


main

public static void main(java.lang.String[] args)
Arguments of the main: (1) the working path; (2) the name of the main ArrfFile; (3) "cv" or the name of the testing ArffFile; (4)[OPTIONAL] "unknownClasses", in case the actual classes of the testing set are unknown.


parseArffFile

private void parseArffFile()
Scans the main Arff file. Initializes data members; than, scans the Arff file, checking the formal correctness of variable declarations, and the coherence of the data with the declarations; stores the information and the data loaded from file. In particular, it instantiates the data members FeatureNames, NumFlags(whether every feature is numerical or not), CategoryNames(names of categories for each categorical featrue) and rawDataset (a matrix of double which contains the data as read from file, with missing values substitued by -9999, and category names substituted by numerical indexes, and numerical values unchanged.) Moreover, reads the list of NonMar variables, which are then stored in NonMarFeatureNamesTraining and NonMarFeatureNamesTesting.


parseArffTestingFile

private void parseArffTestingFile(boolean UnknownClasses)
Parses the testing file, checking that all declarations are coherent with those already loaded from the training Arff file; if the classes are unknown, it reads only the instances, without looking for the classes. Data are stored in TestingSet: nominal features are simply stored, while numerical features are discretized using the bins available from DiscretizationIntervals.


parseNonMar

private void parseNonMar()
Reads the file NonMar.txt, containing the list of nonMar variables; if no file is found, all variables are assumed to be MAR. If the name of the variable is not preceeded by any token, the feature is supposed to be NonMar on both training and testing set; if it is preceeded by "training" ["testing"], then it is managed as NonMar on training [testing] only, and hence as Mar on testing [training].

Then, put the names of NonMar variables in TrainingNonMarFeatureNames and TestingNonMarFeatureNames.


prepareDataSetFromRawData

private void prepareDataSetFromRawData(java.util.ArrayList<double[]> SourceData,
                                       java.util.ArrayList<int[]> DestinationData)
Take a raw set of data (undiscretized features) and put them into a dataset to be accessed by classifiers; categorical variables are copied unchanged, while numerical variables are converted to categorical according to DiscretizationIntervals; numerical variables discretized into a unique bin (and hence listed in NonUsedFeatures) are discarded.


prepareTrainTestSet

private void prepareTrainTestSet()
Prepares training and testing sets for validation via testing set, discretizing also numerical variables.


prepareTrainTestSet

private void prepareTrainTestSet(int currentFold)
Prepares training and testing sets for cross-validation, discretizing also numerical variables.


printArgError

private static void printArgError()

printElapsedTime

private void printElapsedTime()

printHelp

private static void printHelp()
Writes an help message to the user, specifying the syntax to be used with JNCC2.


saveTmpPredictions

private void saveTmpPredictions()
Dumps to file the predictions issued by the classifiers on the testing set(s); they will be later analyzed to compute the indicators, and eventually deleted. It produces a file which contains the features (apart from the ones discretized into a unique bin, which can change between different runs of CV), the actual class, the NCC classification (i.e., a number of columns equal to the number of classes containing either the outputted class, or 6666 to mean that not all classes have been outputted by NCC), and then NBC and Bma prediction


trainValidClassifiers

private void trainValidClassifiers()
Trains classfier on the training set, validates them on the testing set and save predictions to a temporary file


validateCV

private void validateCV(java.lang.String[] args)
Validates NBc and NCC via 10 runs of 10-folds cross-validation. Reports to file the relevant accuracy measures.


validateTFile

private void validateTFile(java.lang.String TestingFile)
Validates NBC and NCC via testing file. Reports to file the relevant accuracy measures.


validateTFileUnkClasses

private void validateTFileUnkClasses()
Learns NCC; classifies the instances of the testing file via NCC, and writes the classifications to file.


writePerfIndicators

private void writePerfIndicators()
Once classifiers have been validated (either via CV or single testing file), save to file all the relevant information


writePredictions

private void writePredictions()
Write to file the instances, actual classes, probability distribution computed by NBC and non-dominated classes identified by NCC2. when the actual classes are unknown, this constitutes the only output of the classifier. Note that, because of the discretization, different number of features might appear in different runs of CV