Input formats in CPSign¶
Raw - numerical data¶
CPSign loads and stores numerical data in LIBSVM/Liblinear file format:
<label> <index>:<value> <index>:<value> ..
<label> <index>:<value> <index>:<value> ..
..
Also note that the <index>
must start at 1 and not 0, to conform with LIBSVM/LibLinear requirements.
Chemical data¶
CSV file format¶
CPSign supports CSV files in a fairly flexible manner, allowing to specify the separator characters and other parameters that might differ between formats. Molecules should be encoded as SMILES strings and there is a requirement that the CSV must either contain an explicit header row - or the user must specify the header from either the CLI or Java API - so that all fields can be located. There must exist a header containing the text “smiles” (case insensitive), the first header containing “smiles” will be taken as SMILES column. Example of a supported CSV file, using tab as delimiter:
SMILES Sample_ID Activity Additional_Notes
OC(=O)\C=C/C(O)=O.C[C@]12CC=C3[C@@H](CCC4=CC(=O)C=C[C@]34C)[C@@H]1CC[C@@H]2C(=O)CN1CCN(CC1)C1=NC(=NC(=C1)N1CCCC1)N1CCCC1 NCGC00261900-01 POS Here's some additional information
[Na+].NC1=NC=NC2=C1N=C(Br)N2C1OC2CO[P@]([O-])(=O)O[C@@H]2C1O NCGC00260869-01 NEG More notes
O=C1N2CCC3=C(NC4=C3C=CC=C4)C2=NC2=C1C=CC=C2 NCGC00261776-01 NEG
Cl.FC1=CC=C(C=C1)C(OCCCC1=CNC=N1)C1=CC=C(F)C=C1 NCGC00261380-01 POS
CC1=CC=C(C=C1)S(=O)(=O)N[C@@H](CC1=CC=CC=C1)C(=O)CCl NCGC00261842-01 NEG Not all lines need to contain the additional notes
...
The column header for SMILES can contain more text, e.g. “canonical smiles” and “smiles-col” are both valid header names.
SDF¶
SDFiles are supported in both v2000 and v3000.
SMILES as single molecule¶
The predict and predict-online command can predict single molecules using the --smiles
parameter, this parameter takes a string of text where the string must start with a valid SMILES and can then optionally include a blank space character (tab, space) and an identifier.
JSON file format¶
CPSign supports a JSON input format, the format require that the top level starts as a JSON array (meaning that the first
character must be a square bracket “\[
“). Each index of the array is one record and each record must include a key-value
for SMILES for the molecule. This SMILES key-value pair must have the key “SMILES”, “smiles” or “Smiles”. Here are some
examples for the file format (it is not required that the file is properly indented).
Example classification JSON file:
[
{
"cdk:Title" : "1728-95-6",
"Ames test categorisation" : "mutagen",
"smiles" : "C1(=C(C=2C=CC=CC2)N=C(N1)C3=CC=C(OC)C=C3)C=4C=CC=CC4"
},
{
"cdk:Title" : "91-08-7",
"Ames test categorisation" : "mutagen",
"smiles" : "C=1(C(=C(C=CC1)N=C=O)C)N=C=O"
}
]
Example regression JSON file:
[
{
"BIO" : "0.43",
"comment" : "This is a comment",
"smiles" : "SC1=C(C(F)(F)F)C=CC=C1"
},
{
"BIO" : "1.60",
"comment" : "Comment for second molecule",
"smiles" : "SC1=C(C(F)(F)F)C=C([N+]([O-])=O)C=C1"
}
]
Compressed input files¶
CPSign automatically reads files compressed in GZIP format.