Input formats in CPSign ¶

Raw - numerical data ¶

CPSign loads and stores numerical data in LIBSVM/Liblinear file format:

<label> <index>:<value> <index>:<value> ..
<label> <index>:<value> <index>:<value> ..
..

Also note that the <index> must start at 1 and not 0, to conform with LIBSVM/LibLinear requirements.

CPSign supports CSV files in a fairly flexible manner, allowing to specify the separator characters and other parameters that might differ between formats. Molecules should be encoded as SMILES strings and there is a requirement that the CSV must either contain an explicit header row - or the user must specify the header from either the CLI or Java API - so that all fields can be located. There must exist a header containing the text “smiles” (case insensitive), the first header containing “smiles” will be taken as SMILES column. Example of a supported CSV file, using tab as delimiter:

SMILES   Sample_ID   Activity Additional_Notes
OC(=O)\C=C/C(O)=O.C[C@]12CC=C3[C@@H](CCC4=CC(=O)C=C[C@]34C)[C@@H]1CC[C@@H]2C(=O)CN1CCN(CC1)C1=NC(=NC(=C1)N1CCCC1)N1CCCC1   NCGC00261900-01   POS   Here's some additional information
[Na+].NC1=NC=NC2=C1N=C(Br)N2C1OC2CO[P@]([O-])(=O)O[C@@H]2C1O   NCGC00260869-01   NEG   More notes
O=C1N2CCC3=C(NC4=C3C=CC=C4)C2=NC2=C1C=CC=C2  NCGC00261776-01   NEG
Cl.FC1=CC=C(C=C1)C(OCCCC1=CNC=N1)C1=CC=C(F)C=C1 NCGC00261380-01   POS
CC1=CC=C(C=C1)S(=O)(=O)N[C@@H](CC1=CC=CC=C1)C(=O)CCl  NCGC00261842-01   NEG   Not all lines need to contain the additional notes
...

The column header for SMILES can contain more text, e.g. “canonical smiles” and “smiles-col” are both valid header names.

SDF ¶

SDFiles are supported in both v2000 and v3000.

SMILES as single molecule ¶

The predict and predict-online command can predict single molecules using the --smiles parameter, this parameter takes a string of text where the string must start with a valid SMILES and can then optionally include a blank space character (tab, space) and an identifier.

JSON file format ¶

CPSign supports a JSON input format, the format require that the top level starts as a JSON array (meaning that the first character must be a square bracket “\[“). Each index of the array is one record and each record must include a key-value for SMILES for the molecule. This SMILES key-value pair must have the key “SMILES”, “smiles” or “Smiles”. Here are some examples for the file format (it is not required that the file is properly indented).

Example classification JSON file:

[
   {
      "cdk:Title" : "1728-95-6",
      "Ames test categorisation" : "mutagen",
      "smiles" : "C1(=C(C=2C=CC=CC2)N=C(N1)C3=CC=C(OC)C=C3)C=4C=CC=CC4"
   },

   {
      "cdk:Title" : "91-08-7",
      "Ames test categorisation" : "mutagen",
      "smiles" : "C=1(C(=C(C=CC1)N=C=O)C)N=C=O"
   }
]

Example regression JSON file:

[
   {
      "BIO" : "0.43",
      "comment" : "This is a comment",
      "smiles" : "SC1=C(C(F)(F)F)C=CC=C1"
   },

   {
      "BIO" : "1.60",
      "comment" : "Comment for second molecule",
      "smiles" : "SC1=C(C(F)(F)F)C=C([N+]([O-])=O)C=C1"
   }
]

Compressed input files ¶

CPSign automatically reads files compressed in GZIP format.

Input formats in CPSign ¶

Raw - numerical data ¶

Chemical data ¶

CSV file format ¶

SDF ¶

SMILES as single molecule ¶

JSON file format ¶

Compressed input files ¶

CPSign

Navigation

Related Topics