train

The train program trains either Conformal or Venn-ABERS predictors, based on a precomputed data set generated by the precompute program. Some semantics is needed before diving deeper into the available parameters. In the CLI the underlying ML model is called the scorer - i.e. the ML model produces a score (the prediction) that the nonconformity function takes as input and produces the nonconformity score. The same applies for the Venn-ABERS predictor, which acts upon the score (prediction from the ML model) when producing its predictions using the isotonic regression. In the same notion, an error model is called an error-scorer - producing another score (prediction of expected error in the scorer-prediction) that can be given to the nonconformity function (if applicable) in order to get normalized prediction widths for regression models. The usage of an error-scorer is sometimes referred to as normalized predictions or normalized nonconformity measures.

Usage manual

The full usage manual can be retrieved by running command:

> ./cpsign-[version]-fatjar.jar train

Note that there are default values for many parameters, and they are chosen in preference to fast runtime so that the user do not start long running jobs by mistake.

Examples

CCP classification

Here is an example that trains a Cross-Conformal Predictor (CCP) by selecting the “folded-stratified” sampling strategy. By using the :-syntax we set the sub-parameter numSamples=8, the available sub-parameters can be found running explain sampling. We also use an RBF-kernel Support Vector Classifier (SVC) by setting --scorer C_SVC. This code snippet assumes there is a precomputed data set in output/models/precomputed-clf.jar. By giving the --seed parameter we use an explicit RNG seed in order to get reproducible results.

> ./cpsign-[version]-fatjar.jar train \
	--predictor-type acp_classification \
	--data-set output/models/precomputed-clf.jar \
	--scorer C_SVC \
	--model-out output/models/clf-trained.jar \
	--model-name "trained classifier model" \ 
	--seed 45612512 \
	--sampling-strategy folded-stratified:numSamples=8

ACP regression

Here is an example of training an Aggregated Conformal Predictor (ACP) for a regression task. The difference between the CCP model in the above example is that sampling is done randomly instead of in a folded fashion (here using the Random splitting strategy). By setting the sub-parameter numSamples=20 we chose to aggregate 20 ICPs and setting nCalib=100 fixes the number of training instances in the calibration set to be 100 for each ICP model (all remaining training instances are placed in the proper training set). Again, we assume there is a precomputed data set in output/models/precomputed-reg.jar. In this example, we use an error-scorer model in order to normalize the predictions based on the predicted difficulty of predicting the object - by default the error scorer will be of the same type and have identical hyper-parameters as the scorer model (--scorer), but in this case we use a linear-kernel SVR in order to save some computational time. The error scorer parameter only has an effect in case a nonconformity measure (--ncm parameter) that uses such a normalizer model is used. We also specify that we wish to interpolate between the nonconformity values in the calculation of prediction intervals using the --pvalue-calc parameter (see references [10-11]).

> ./cpsign-[version]-fatjar.jar train \
	--predictor-type acp_regression \
	--data-set /output/models/precomputed-reg.jar \
	--scorer EpsilonSVR \
	--error-scorer LinearSVR \
	--pvalue-calc linear-interpolation \
	--model-out/output/models/reg-trained.jar \
	--sampling-strategy random:numSamples=20:nCalib=100

TCP Classification

The Transductive Conformal Predictor (TCP) is only available for classification tasks and is much more computationally demanding than the inductive algorithm versions. The training step however is very quick as it only copies the data and sets the hyper-parameters that you give - actual model training is not performed until prediction time for TCP models (that is, unless the --percentiles flag is given). In this example we chose a non-default nonconformity measure with --ncm InverseProbability, which then also requires the scorer to be a ML model that can output probability scores - which is set using the --scorer PlattScaledC_SVC argument.

> ./cpsign-[version]-fatjar.jar train \
	--predictor-type tcp_classification \
	--data-set output/models/precomputed-clf.jar \
    --ncm InverseProbability \
	--scorer PlattScaledC_SVC \
	--model-out output/models/clf-tcp-trained.jar \
	--model-name "TCP classifier model" 

Single ICP models

If you wish to train an Inductive Conformal Predictor (ICP), i.e. a single partition of proper training set and calibration set, that is done by using either of the sampling strategies “Random” or “RandomStratified” and setting the sub-parameter numSamples=1 and setting the --predictor-type parameter to either “ACP_classification” or “ACP_regression”. In essence this performs an “aggregation” of a single model internally - which simply outputs the predictions of the single ICP itself.

CVAP Classification

The Cross Venn-ABERS Predictor (CVAP) produces well calibrated probability predictions for binary classification. Many of the available parameters do not apply to CVAP, namely --ncm, --pvalue-calc and --error-scorer. Here are some rudimentary arguments to train a CVAP model, again relying on a precomputed data set in output/models/precomputed-clf.jar. Note that according to the literature the sampling should be done in a folded fashion, but CPSign allows you to pick a random sampling strategy as well, thus dropping the “Cross” in CVAP - leading to the name “VAP_classification”.

> ./cpsign-[version]-fatjar.jar train \
	--predictor-type VAP_classification \
	--data-set output/models/precomputed-clf.jar \
	--scorer C_SVC \
	--model-out output/models/cvap-trained.jar \
	--model-name "trained probability model" \ 
	--sampling-strategy folded-stratified

Percentiles: important performance note

The optional parameter --percentiles controls if and how many records should be used to compute atom contributions of molecules specified by the --percentiles-data parameter. These atom contributions are only used in case prediction images should be generated at some point (see Molecule Gradient for more details), in order to normalize these predicted gradients needed for the images. This step is very time-consuming, especially if you are training a TCP model. For this reason the default is to skip this completely, and using it should only be done in case you are going to generate images or look at feature importance.

Nonconformity Measures

The available nonconformity functions can be retrieved using the explain ncm command, and more information about the ones supplied with cpsign can be found in the Nonconformity measures section.

Note on parallel model training

CPSign does not support multithreading or parallel execution directly, partly due to third party dependencies that are not thread safe. Instead, parallel training can be implemented by the user by first running the precompute step once, followed by several invocation to train made by different processes (in different JVMs). In this way an aggregated conformal prediction (ACP) model can trained in parallel - where different ICPs are trained in parallel - and then later aggregated into a complete ACP model by running either the aggregate or the fast-aggregate program. For most use cases this may not be worth the effort as the training time is usually fairly quick, but for large data sets or in case a kernel based SVM is used this might be worth trying out. When only training part of an ACP, the --seed parameter must be given (and be the same for all runs) and the explicit ICP indices must be given for each invocation of train (controlled by the --splits parameter).