(crossvalidate)=

# `crossvalidate`

Internal testing can be performed with conformal or Venn-ABERS models. This program initially only supported *k*-fold cross-validation but now supports all type of testing strategies within CPSign (*k*-fold CV, leave-one-out CV and a single test-train split).

```{contents} Table of Contents
:backlinks: top
:depth: 3
```

## Usage manual

The full usage manual can be retrieved by running command:

```bash
> ./cpsign-[version]-fatjar.jar crossvalidate
```

## Example Usage

Example (ACP classification):

```bash
> ./cpsign-[version]-fatjar.jar crossvalidate \
   -pt 1 \
   -td sdf /path/to/datafile.sdf \
   -e "Ames test categorisation" \
   -l mutagen, nonmutagen \
   -cp 0.7,0.8,0.9 \ 
   -k 10


                          -= CPSign - CROSSVALIDATE =-

Validating arguments... [done]
Loading precomputed data set... 
Loaded precomputed data set with 123 records and 1930 features. Occurrences of 
each class: 'mutagen'=64, 'nonmutagen'=59.
Starting the validation using 10-fold cross-validation strategy... [done]


Overall statistics:
 - AverageC                       : 1.13
 - AverageC_SD                    : 0.14
 - Balanced Observed Fuzziness    : 0.175
 - Balanced Observed Fuzziness_SD : 0.0581
 - Observed Fuzziness             : 0.182
 - Observed Fuzziness_SD          : 0.0549
 - Unobserved Confidence          : 0.908
 - Unobserved Confidence_SD       : 0.0392
 - Unobserved Credibility         : 0.58
 - Unobserved Credibility_SD      : 0.0891
 - Balanced Accuracy              : 0.756
 - Balanced Accuracy_SD           : 0.121
 - Classifier Accuracy            : 0.741
 - Classifier Accuracy_SD         : 0.118
 - F1Score_weighted               : 0.742
 - F1Score_weighted_SD            : 0.119
 - F1Score_macro                  : 0.731
 - F1Score_macro_SD               : 0.122
 - F1Score_micro                  : 0.741
 - F1Score_micro_SD               : 0.118
 - NPV                            : 0.771
 - NPV_SD                         : 0.183
 - Precision                      : 0.721
 - Precision_SD                   : 0.179
 - ROC AUC                        : 0.826
 - ROC AUC_SD                     : 0.0967
 - Recall                         : 0.805
 - Recall_SD                      : 0.162

Calibration plot:
Confidence	Accuracy(nonmutagen)	Accuracy(nonmutagen)_SD	Accuracy	Accuracy_SD	Accuracy(mutagen)	Accuracy(mutagen)_SD	Proportion empty-label prediction sets	Proportion empty-label prediction sets_SD	Proportion multi-label prediction sets	Proportion multi-label prediction sets_SD	Proportion single-label prediction sets	Proportion single-label prediction sets_SD
0.7	0.751	0.175	0.662	0.167	0.602	0.243	0.144	0.138	0.0	0.0	0.856	0.138
0.8	0.838	0.121	0.789	0.0949	0.766	0.189	0.0154	0.0487	0.146	0.11	0.839	0.0979
0.9	0.921	0.108	0.869	0.0674	0.851	0.112	0.00769	0.0243	0.396	0.245	0.596	0.232
```

Note here that each metric output the mean out of the *k* fold as well as the standard deviation calculated across these folds. If a single test-train split is used only one statistic is given. In general we recommend to use either `TSV` or `CSV` out to make it easier to look at model calibration.


Example (AVAP classification):

```bash
> ./cpsign-[version]-fatjar.jar cv \
   -pt 5 \
   -td sdf /path/to/datafile.sdf \
   -e "Ames test categorisation" \
   -l mutagen, nonmutagen \
   -k 5


                          -= CPSign - CROSSVALIDATE =-

Validating arguments... [done]
Loading precomputed data set...
Loaded precomputed data set with 123 records and 1930 features. Occurrences of
each class: 'mutagen'=64, 'nonmutagen'=59.
Starting the validation using 5-fold cross-validation strategy... [done]

Overall statistics:
 - Classifier Accuracy    : 0.689
 - Classifier Accuracy_SD : 0.134
 - Balanced Accuracy      : 0.7
 - Balanced Accuracy_SD   : 0.141
 - Brier Score            : 0.427
 - Brier Score_SD         : 0.119
 - Log loss               : 0.62
 - Log loss_SD            : 0.138
 - F1Score_weighted       : 0.686
 - F1Score_weighted_SD    : 0.133
 - F1Score_macro          : 0.687
 - F1Score_macro_SD       : 0.133
 - F1Score_micro          : 0.689
 - F1Score_micro_SD       : 0.134
 - Precision              : 0.733
 - Precision_SD           : 0.225
 - Recall                 : 0.639
 - Recall_SD              : 0.18
 - NPV                    : 0.693
 - NPV_SD                 : 0.156
 - ROC AUC                : 0.754
 - ROC AUC_SD             : 0.177
 - Mean P0-P1 width       : 0.278
 - Mean P0-P1 width_SD    : 0.054
 - Median P0-P1 width     : 0.227
 - Median P0-P1 width_SD  : 0.0571

Calibration plot:
Expected probability	Num examples in bin	Observed frequency	Observed frequency_SD
0.15	19.0	0.29	0.415
0.25	30.0	0.336	0.237
0.35	15.0	0.481	0.0321
0.45	8.0	0.233	0.252
0.55	17.0	0.6	0.435
0.65	15.0	0.879	0.21
0.75	10.0	0.833	0.333
0.85	18.0	0.883	0.162

```

The VAP outputs a calibration curve to, that ideally should be a straight line with slope 1 and intersect 0. For this very small dataset the are too few examples to get a descent calibration curve. In case more/less points are desired on the calibration curve, set the desired points to the `--calibration-points` flag. For instance running with `--calibration-points 0.1:0.9:0.2` gave the following curve instead:

```bash
Calibration plot:
Expected probability	Num examples in bin	Observed frequency	Observed frequency_SD
0.1	19.0	0.29	0.415
0.3	45.0	0.354	0.203
0.5	24.0	0.577	0.441
0.7	25.0	0.75	0.319
0.9	18.0	0.883	0.162
```