NaN-Tb: A statistics toolbox ------------------------------------------------------------ Copyright (C) 2000-2005,2009,2010 Alois Schloegl FEATURES of the NaN-tb: ----------------------- - statistical toolbox - machine learning and classification toolbox - NaN's are treated as missing values - supports weightening of data - supports DIM argument - less round-off errors using extended double - less but more powerful functions (no nan-FUN needed) - supports unbiased estimation - fixes known bugs - compatible with Matlab and Octave - easy to use - The toolbox is tested with Octave 3.x and Matlab 7.x Currently are implemented: -------------------------- level 1: basic functions (not derived) SUMSKIPNAN SUM is a built-in function and cannot not be replaced, For this reason, a different name (than SUM) had to be chosen. SUMSKIPNAN is central, it implements skipping NaN's, the DIM-argument and returns the number of valid elements, too. COVM covariance estimation (several modes) Round-off errors avoided by using internally extended accuracy DECOVM decomposes the extended covarianced matrix into mean and cov XCOVF cross-correlation function NANFILTER filter function CONVSKIPNAN convolution CONV2SKIPNAN (CONV2NAN) 2-dimensional convolution FLAG_NANS_OCCURED returns 0 if no NaN's appeared in the input data of the last call to one of the following functions, and 1 otherwise: sumskipnan, covm, center, cor, coefficient of variation, corrcoef, geomean, harmmean, kurtosis, mad, mean, meandev, meansq, moment, nanmean, nanstd, nansum, rms, sem, skewness, statistic, std, var FLAG_IMPLICIT_SKIP_NAN can be used to turn off and on the NaN-skipping behaviour. This can be useful for debugging or for compatibility reasons. FLAG_ACCURACY_LEVEL can be used to increase the accuracy of summations (sumskipnan and covm) at the cost of speed. LOAD_FISHERIRIS loads famous fisher iris data set STR2ARRAY convert string to array - useful to extract numeric data from delimiter files XPTOPEN read and write SAS Transport Format (XPT); reads ARFF and STATA files level 2a: derived functions MEAN mean (options: arithmetic, geometric, harmonic) VAR variance STD standard deviation MEDIAN median (currently only for 2-dim matrices) SEM standard error of the mean (does not depend on distribution) TRIMMEAN trimmed mean medAbsDev median absolute deviation MEANSQ mean square RMS root mean square STATISTIC estimates various statistics at once MOMENT moment SKEWNESS skewness KURTOSIS excess * IQR interquartile range MAD mean absolute deviation * RANGE range (max-min) CENTER removes mean ZSCORE normalizes x to zero mean and variance 1 (z = (x-mean)/std) zScoreMedian non-parametric z-score, normalizes is to zero median and 1/(1.483*median absolute deviation) HARMMEAN harmonic mean GEOMEAN geometric mean NANTEST checks whether all functions have been replaced DETREND detrending of data with missing values and non-equidistant sampled data COR correlation matrix COV covariance matrix CORRCOEF correlation coefficient, including rank correlation, significance test and confidence intervals SPEARMAN, RANKCORR spearman's rank correlation coefficient. They might be replaced by CORRCOEF. PARTCORRCOEF partial correlation coefficient RANKS calculates ranks for non-parametric statistics TIEDRANK similar to RANKS, used for compatibility reasons QUANTILE q-th quantile PRCTILE,PERCENTILE p-th percentile TRIMEAN trimean ECDF empirical cumulative distribution function CDFPLOT plot empirical cumulative distribution function GSCATTER scatter plot of grouped data NORMPDF normal probability distribution NORMCDF normal cumulative distribution NORMINV inverse of the normal cumulative distribution TPDF student probability distribution TCDF student cumulative distribution TINV inverse of the student cumulative distribution NANSUM, NANSTD fixes for buggy versions included TTEST paired t-test TTEST2 (unpaired) t-test SIGNRANK wilcoxon's signed-rank test level 2b: classification, cross-validation TRAIN_SC train classifier TEST_SC test classifier CLASSIFY classify data (no cross validation) XVAL classify data with cross validation KAPPA performance evaluation TRAIN_LDA_SPARSE utility function FSS feature subset selection and feature ranking CAT2BIN converts categorial to binary data SVMTRAIN_MEX libSVM-training algorithm ROW_COL_DELETION heuristic to select rows and columns to remove missing values REFERENCE(S): ---------------------------------- [1] http://www.itl.nist.gov/ [2] http://mathworld.wolfram.com/ What is the difference to previous implementations? =================================================== 1) The default behavior of previous implementations is that NaNs in the input data results in NaNs in the output data. In many applications this behavior is not what you want. In this implementation, NaNs are handled as missing values and are skipped. 2) In previous implementations the workaround was using different functions like NANSUM, NANMEAN etc. In this toolbox, the same routines can be applied to data with and without NaNs. This enables more natural (better read- and understandable) applications. 3) SUMSKIPNAN is central to the other functions. It implements - the DIMENSION-argument, - handles NaNs as missing values or as exception signal (depending on a hidden FLAG), - and returns the number of valid elements (which are not NaNs) in the second output argument. (Note, NANSUM from Matlab does not support the DIM-argument, and NANSUM(NaN) gives NaN instead of 0); 4) [obsolete] 5) The DIMENSION argument is implemented in most routines. These should work in all Matlab and Octave versions. A workaround for a bug in Octave versions <=2.1.35 is implemented. Also several functions from Matlab have no support for the DIM argument (e.g. SKEWNESS, KURTOSIS, VAR) 6) Compatible to previous Octave implementation MEAN implements also the GEOMETRIC and HARMONIC mean. Handling of some special cases has been removed because its not necessary, anymore. MOMENT implements Mode 'ac' (absolute and/or central) moment as implemented in Octave. 7) Performance increase In most numerical applications, NaN's should be simply skipped. Therefore, it is efficient to skip NaN's in the default case. In case an explicit check for NaN's is necessary, implicit exception handling could be avoided. Eventually the overall performance could increase. 8) More readable code An explicit check for NaN's display the importance of this special case. Therefore, the application program might be more readable. 9) ZSCORE, MAD, HARMMEAN and GEOMEAN DIM-argument and skipping of NaN's implemented. None of these features is implemented in the Matlab versions. 10a) NANMEAN, NANVAR, NANMEDIAN These are not necessary anymore. They are implemented in SUMSKIPNAN, MEAN, VAR, STD and MEDIAN, respectively. 10b) NANSUM, NANSTD These functions are obsolete, too. However, previous implementations do not always provide the expected result. Therefore, a correct version is included for backward compatibility. 11) GPL license Permits to implement useful modifications. 12) NORMPDF, NORMCDF, NORMINV In the Matlab statistics toolbox V 3.0, NORMPDF, NORMCDF and NORMINV gave incorrect results for SIGMA=0; A similar problem was observed in Octave with NORMAL_INV, NORMAL_PDF, and NORMALCDF. The problem is fixed with this version. Furthermore, the check of the input arguments is implemented simpler and easier in this versions. 13) TPDF, TCDF, TINV In the Matlab statistics toolbox V3.0(12.1) and V4.0(13), TCDF and TINV do not handle NaNs correctly. TINV returns 0 instead of NaN, TCDF stops with an error message. In Stats-tb V2.2(R11) TINV has also the same problem. For these reasons, the NaN-tb is a bug fix. Furthermore, the check of the input arguments is implemented simpler. Overall, the code becomes cleaner and leaner. Q: WHY SKIPPING NaN's?: ------------------------ A: Usually, NaN means that the value is not available. This meaning is most common, even many different reasons might cause NaN's. In statistics, NaN's represent missing values, in biosignal processing such missing values might have been caused by some recording error. Other reasons for NaN's are, undetermined expressions like e.g. 0/0, inf-inf, data not available, unknown value, not a numeric value, etc. If NaN has the meaning of a missing value, it is only consequent to say, the sum of NaN's should be zero. Similar arguments hold for the other functions. The mean of X is undefined if and only if X contains no numbers. The implementation sum(X)/sum(~isnan(X)) gives 0/0=NaN, which is the desired result. The variance of X is undefined if and only if X contains less than 2 numbers. In most numerical applications, NaN's should be simply skipped. Therefore, it is efficient to skip NaN's in the default case. In the other cases, the NaN's can still be checked explicitly. This could eventually result in a more readable code and in improved performance, too. Q: What if I need to check for NaN's: ------------------------------------- A: You can always check whether there were some skipped NaN's in your data with the command FLAG_NANS_OCCURED(). m = mean(x); if flag_nans_occured() % do your error handling, e.g. error('there were NaN's in x, ignore m'); end; Its also easy to control the granularity of the checks flag_nans_occured(); % reset flag % do any statistical analysis you want if flag_nans_occured() % check, whether some NaN's occured. end; Installing the NaN-tb for Octave and Matlab: -------------------------------------------- a) Extract files and save them in /your/directory/structure/to/NaN/ b) Include the path with one of the following commands: addpath('/your/directory/structure/to/NaN/') path('/your/directory/structure/to/NaN/',path) Make sure the functions in the NaN-toolbox are found before the default functions. c) run NANINSTTEST This checks whether the installation was successful. d) Compile mex files: This is useful to improve speed, and is required if you used weighted samples. Check if precompiled binaries are provided. If your platform is not supported, compile the C-Mex-function SUMSKIPNAN_MEX.CPP using mex sumskipnan_mex.cpp mex covm_mex.cpp mex histo_mex.cpp Run NANINSTTEST again to check the stability of the compiled SUMSKIPNAN. e) [OPTIONAL] In case you want to use some other SVM classifiers (besides libSVM and LibLinear), you need to install additional toolboxes: OSU-SVM: https://sourceforge.net/projects/svm/ simpleSVM: https://sourceforge.net/projects/simplesvm/ $Id: README.TXT 7777 2010-09-27 08:55:37Z schloegl $ Copyright (C) 2000-2005,2009,2010 by Alois Schloegl WWW: http://biosig-consulting.com/matlab/NaN/ LICENSE: This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, see .