precompute

Precomputing data means parsing a dataset in a chemical file format and computing descriptors for all molecules. This is the first step needed in most pipelines and yields a precomputed data set which is the input of (almost) all other programs. The output of precompute is a JAR file, including the numerical representation of your input data (potentially altered by any data transformations) and all meta data needed in downstream tasks - such as which descriptors were used, the transformations needed for future predictions, textual labels, the property being modelled etc.

Usage manual

The full usage manual can be retrieved by running command:

> ./cpsign-[version]-fatjar.jar precompute

Data transformations

The precompute program can optionally include data transformations, or for a clearer separation of tasks, these may be included in a separate step using the transform program. The required input to transform is a precomputed data set generated by the precompute program.

Example usage

Here is a simple example of running precompute, where we can walk through the parameters line by line:

  1. Invocation of CPSign and specify running the program precompute.

  2. Specify the data set (--train-data), the type (SDF) and the path to it (ames.sdf, should be in the same directory).

  3. Specify which property to use in future modeling steps ("Ames test categorisation").

  4. Specify the possible labels given the property, they can be either "mutagen" or "nonmutagen" - all records with a different value will be discarded.

  5. Specify the location in which the precomputed data set should be saved (--model-out <path>).

  6. Specify that the time needed to complete each step should be printed in the terminal

> ./cpsign-[version]-fatjar.jar precompute \
	--train-data SDF ames.sdf \
	--property "Ames test categorisation" \
	--labels mutagen nonmutagen \
	--model-out /output/ames-precomputed.jar \
	--time

The output of running this will be something similar to this:

                           -= CPSign - PRECOMPUTE =-

Validating arguments... [done]
(4 ms)
Reading train file and calculating descriptors...
Successfully parsed 123 molecules. Detected labels: 'mutagen'=64, 
'nonmutagen'=59. Generated 1930 new signatures. Skipped 3 molecule(s) due to 
parsing issues/Heavy Atom Count/ChemDescriptor calculation.
(614 ms)
Saving precomputed data set to file:
/output/ames-precomputed.jar ... [done]
(133 ms)
Program finished in 730 ms

“Exclusive” datasets

CPSign makes it possible to assign portions of data to always be included in model calibration or training of underlying scoring models (i.e. the calibration set and proper training set). As this is considered non-standard usage these parameters are hidden from the usage text of precompute, but there is information available from the CLI using explain exclusive-data. The basic usage is giving either the --model-data (mark data for proper training set) or --calibration-data (mark data for calibration set) and give file-format and sub-arguments using the same syntax as per the --train-data parameter.

For more information on why this can be useful we refer to article [13] in references.

Note: These parameters are not possible to use when using the TCP, as all the data is used for both training the underlying model and the calibration of the predictions.

Picking descriptors

The default descriptor used in CPSign and precompute is the signatures descriptor [1-3], but there several other descriptors available in CPSign. To query the available descriptors you run:

> ./cpsign-[version]-fatjar.jar explain descriptors

This feature is extendable but the default ones are broadly;

  • Signatures: with configurable height(s), choice of standard or stereo-signatures and choice between count and bit version.

  • ECFP: with different diameters (0-6), configurable length as well as choice between count or bit version.

  • CDK physchem: A broad range of physico-chemical descriptors as implemented in CDK IMolecularDescriptor instances.

  • Supplied: User-supplied descriptors, which are loaded from input in the form of columns in CSV format or properties in SDFiles. This descriptors allow users to use their own descriptors from other software/measurements and make them available for modeling.

Custom descriptors

Custom descriptors can be entered in CPSign using the Supplied descriptor, which will look for properties given in the CSV or SDF input file(s). To exemplify this, lets consider that your CSV contains a header with the columns SMILES,target,feature1,feature2,feature3 - where target is the property that you wish to model and feature 1-3 are custom features. To specify this at the CLI you can do so using one of these arguments;

# Either explicitly name the ones you want to add
--descriptors Supplied:props=feature1,feature2,feature3
# Or by using all properties, except some explicit ones
--descriptors Supplied:props=all,-SMILES,-target

The first option is to list all properties explicitly that you wish to use. But if you have many custom features it might be easier to instead include all properties (all) and then explicitly omit the properties that should not be included. Omission is performed by pre-pending a property/header name with a - character. In this case you should for instance remove the SMILES column and the column including the values that you want to model. The --descriptors parameter can be given several times so the first line is also equivalent with the following three lines;

--descriptors Supplied:props=feature1
--descriptors Supplied:props=feature2
--descriptors Supplied:props=feature3

If you want to use the default descriptor (Signatures) together with your custom descriptors, you will also have to explicitly specify --descriptors Signatures (otherwise your custom descriptors will override the default descriptor).

Note 1: Due to the syntax in the CLI the header column names cannot include a , character as that will be split into two separate features.

Note 2: Header names and input to the --descriptors parameter will be treated case-insensitive.

Note 3: Feature values must be numerical values, i.e., if your data contains categorical values such as low/medium/high these must be converted into numerical values such as 0/1/2 before reading it into CPSign.

Note 4: By default CPSign will allow for missing values and include a NaN (not a number) if e.g. a record do not have a valid entry for one of the columns. Thus using custom features require more care in regards to data sanitization, e.g. removing/imputing missing values and scaling of features.

Note 5: A CSV input file is required to include SMILES, even if using exclusively custom descriptors. If you only want to perform modeling using custom descriptors without including chemistry you have to use the conf-ai module instead which only exists as a Java API.