(train)=

# `train`

The `train` program trains either Conformal or Venn-ABERS predictors, based on a precomputed data set generated by the {ref}`precompute` program. Some semantics is needed before diving deeper into the available parameters. In the CLI the underlying ML model is called the **scorer** - i.e. the ML model produces a score (the prediction) that the nonconformity function takes as input and produces the nonconformity score. The same applies for the Venn-ABERS predictor, which acts upon the score (prediction from the ML model) when producing its predictions using the isotonic regression. In the same notion, an error model is called an **error-scorer** - producing another score (prediction of expected error in the scorer-prediction) that can be given to the nonconformity function (if applicable) in order to get normalized prediction widths for regression models. The usage of an error-scorer is sometimes referred to as normalized predictions or normalized nonconformity measures.

```{contents} Table of Contents
:backlinks: top
:depth: 3
```

## Usage manual

The full usage manual can be retrieved by running command:

```bash
> ./cpsign-[version]-fatjar.jar train
```

Note that there are default values for many parameters, and they are chosen in preference to fast runtime so that the user do not start long running jobs by mistake. 

## Examples

### CCP classification

Here is an example that trains a Cross-Conformal Predictor (CCP) by selecting the "folded-stratified" sampling strategy. By using the {ref}`:-syntax` we set the *sub-parameter* `numSamples=8`, the available sub-parameters can be found running `explain sampling`. We also use an RBF-kernel Support Vector Classifier (SVC) by setting `--scorer C_SVC`. This code snippet assumes there is a precomputed data set in `output/models/precomputed-clf.jar`. By giving the `--seed` parameter we use an explicit RNG seed in order to get reproducible results.

```bash
> ./cpsign-[version]-fatjar.jar train \
	--predictor-type acp_classification \
	--data-set output/models/precomputed-clf.jar \
	--scorer C_SVC \
	--model-out output/models/clf-trained.jar \
	--model-name "trained classifier model" \ 
	--seed 45612512 \
	--sampling-strategy folded-stratified:numSamples=8
```

### ACP regression

Here is an example of training an Aggregated Conformal Predictor (ACP) for a regression task. The difference between the CCP model in the above example is that sampling is done randomly instead of in a folded fashion (here using the `Random` splitting strategy). By setting the *sub-parameter* `numSamples=20` we chose to aggregate 20 ICPs and setting `nCalib=100` fixes the number of training instances in the *calibration set* to be 100 for each ICP model (all remaining training instances are placed in the *proper training set*). Again, we assume there is a precomputed data set in `output/models/precomputed-reg.jar`. In this example, we use an **error-scorer** model in order to normalize the predictions based on the predicted difficulty of predicting the object - by default the error scorer will be of the same type and have identical hyper-parameters as the scorer model (`--scorer`), but in this case we use a linear-kernel SVR in order to save some computational time. The error scorer parameter only has an effect in case a nonconformity measure (`--ncm` parameter) that uses such a normalizer model is used. We also specify that we wish to interpolate between the nonconformity values in the calculation of prediction intervals using the `--pvalue-calc` parameter (see references [[10-11]](../references.md)).

```bash
> ./cpsign-[version]-fatjar.jar train \
	--predictor-type acp_regression \
	--data-set /output/models/precomputed-reg.jar \
	--scorer EpsilonSVR \
	--error-scorer LinearSVR \
	--pvalue-calc linear-interpolation \
	--model-out/output/models/reg-trained.jar \
	--sampling-strategy random:numSamples=20:nCalib=100
```

### TCP Classification 

The Transductive Conformal Predictor (TCP) is only available for classification tasks and is *much more* computationally demanding than the inductive algorithm versions. The training step however is very quick as it only copies the data and sets the hyper-parameters that you give - actual model training is not performed until prediction time for TCP models (that is, unless the `--percentiles` flag is given). In this example we chose a non-default *nonconformity measure* with `--ncm InverseProbability`, which then also requires the **scorer** to be a ML model that can output probability scores - which is set using the `--scorer PlattScaledC_SVC` argument.

```bash
> ./cpsign-[version]-fatjar.jar train \
	--predictor-type tcp_classification \
	--data-set output/models/precomputed-clf.jar \
    --ncm InverseProbability \
	--scorer PlattScaledC_SVC \
	--model-out output/models/clf-tcp-trained.jar \
	--model-name "TCP classifier model" 

```

### Single ICP models

If you wish to train an Inductive Conformal Predictor (ICP), i.e. a single partition of *proper training set* and *calibration set*, that is done by using either of the sampling strategies "Random" or "RandomStratified" and setting the *sub-parameter* `numSamples=1` and setting the `--predictor-type` parameter to either "ACP_classification" or "ACP_regression". In essence this performs an "aggregation" of a single model internally - which simply outputs the predictions of the single ICP itself.

### CVAP Classification 

The Cross Venn-ABERS Predictor (CVAP) produces well calibrated probability predictions for binary classification. Many of the available parameters do not apply to CVAP, namely `--ncm`, `--pvalue-calc` and `--error-scorer`. Here are some rudimentary arguments to train a CVAP model, again relying on a precomputed data set in `output/models/precomputed-clf.jar`. Note that according to the literature the sampling should be done in a folded fashion, but CPSign allows you to pick a random sampling strategy as well, thus dropping the "Cross" in CVAP - leading to the name "VAP_classification".

```bash
> ./cpsign-[version]-fatjar.jar train \
	--predictor-type VAP_classification \
	--data-set output/models/precomputed-clf.jar \
	--scorer C_SVC \
	--model-out output/models/cvap-trained.jar \
	--model-name "trained probability model" \ 
	--sampling-strategy folded-stratified
```


## Percentiles: important performance note

The optional parameter `--percentiles` controls if and how many records should be used to compute atom contributions of molecules specified by the `--percentiles-data` parameter. These atom contributions are only used in case prediction images should be generated at some point (see {ref}`Molecule Gradient<molecule-gradient>` for more details), in order to normalize these predicted gradients needed for the images. This step is **very time-consuming**, especially if you are training a TCP model. For this reason the default is to skip this completely, and using it should only be done in case you are going to generate images or look at feature importance.


## Nonconformity Measures

The available nonconformity functions can be retrieved using the `explain ncm` command, and more information about the ones supplied with cpsign can be found in the {ref}`Nonconformity measures <nonconf-measure>` section.


## Note on parallel model training 

CPSign does not support multithreading or parallel execution directly, partly due to third party dependencies that are not thread safe. Instead, parallel training can be implemented by the user by first running the {ref}`precompute` step once, followed by several invocation to `train` made by different processes (in different JVMs). In this way an aggregated conformal prediction (ACP) model can trained in parallel - where different ICPs are trained in parallel - and then later aggregated into a complete ACP model by running either the {ref}`aggregate` or the {ref}`fast-aggregate` program. For most use cases this may not be worth the effort as the training time is usually fairly quick, but for large data sets or in case a kernel based SVM is used this might be worth trying out. When only training part of an ACP, the `--seed` parameter must be given (and be the same for all runs) and the explicit ICP indices must be given for each invocation of `train` (controlled by the `--splits` parameter).