Feature Representation Methods in ChemML

To build a machine learning model, raw chemical data is first converted into a numerical representation. The representation contains spatial or topological information that defines a molecule. The resulting features may either be in continuous (molecular descriptors) or discrete (molecular fingerprints) form.

[1]:
from chemml.chem import Molecule
from chemml.datasets import load_organic_density
import numpy as np
import warnings
warnings.filterwarnings('ignore')

Creating chemml.chem.Molecule object from molecule SMILES

All feature representation methods available in ChemML require chemml.chem.Molecule as inputs

[2]:
# Importing an existing dataset from ChemML
molecules, target, dragon_subset = load_organic_density()
mol_objs_list = []
for smi in molecules['smiles']:
    mol = Molecule(smi, 'smiles')
    mol.hydrogens('add')
    mol.to_xyz('MMFF', maxIters=10000, mmffVariant='MMFF94s')
    mol_objs_list.append(mol)

Coulomb Matrix

Simple molecular descriptor which mimics the electro-static interaction between nuclei.

[3]:
from chemml.chem import CoulombMatrix

#The coulomb matrix type can be sorted (SC), unsorted(UM), unsorted triangular(UT), eigen spectrum(E), or random (RC)
CM = CoulombMatrix(cm_type='SC',n_jobs=-1)

features = CM.represent(mol_objs_list)
print(features[:5])
featurizing molecules in batches of 31 ...
500/500 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12s 24ms/step
Merging batch features ...    [DONE]
         0          1           2          3          4           5     \
0  388.023441  67.563795  388.023441  46.773304  71.039377  388.023441
1   73.516695  12.680660   73.516695  13.507005  15.308384   53.358707
2  388.023441  10.343694   73.516695  40.719814   5.612250   53.358707
3  388.023441  72.013552  388.023441  49.045255  31.222555   73.516695
4  388.023441  34.076060  388.023441  20.740383  20.314923   73.516695

        6          7          8          9     ...  1643  1644  1645  1646  \
0  43.471164  31.884828  23.619673  53.358707  ...   0.0   0.0   0.0   0.0
1  15.511761  10.387486   7.267909  53.358707  ...   0.0   0.0   0.0   0.0
2  22.032304   7.173553  20.331941  53.358707  ...   0.0   0.0   0.0   0.0
3  26.287638  24.264785  15.451307  73.516695  ...   0.0   0.0   0.0   0.0
4  21.673878  43.473700  12.535104  53.358707  ...   0.0   0.0   0.0   0.0

   1647  1648  1649  1650  1651  1652
0   0.0   0.0   0.0   0.0   0.0   0.0
1   0.0   0.0   0.0   0.0   0.0   0.0
2   0.0   0.0   0.0   0.0   0.0   0.0
3   0.0   0.0   0.0   0.0   0.0   0.0
4   0.0   0.0   0.0   0.0   0.0   0.0

[5 rows x 1653 columns]

Fingerprints from RDKit

Molecular fingerprints are a way of encoding the structure of a molecule. The most common type of fingerprint is a series of binary digits (bits) that represent the presence or absence of particular substructures in the molecule. Comparing fingerprints allows you to determine the similarity between two molecules, to find matches to a query substructure, etc.

[4]:
from chemml.chem import RDKitFingerprint

# RDKit fingerprint types: 'morgan', 'hashed_topological_torsion' or 'htt' , 'MACCS' or 'maccs', 'hashed_atom_pair' or 'hap'
morgan_fp = RDKitFingerprint(fingerprint_type='morgan', vector='bit', n_bits=1024, radius=3)
features = morgan_fp.represent(mol_objs_list)
print(features[:5])
   0     1     2     3     4     5     6     7     8     9     ...  1014  \
0     0     0     0     0     0     0     1     0     0     0  ...     0
1     0     0     0     0     0     0     0     0     0     0  ...     0
2     0     0     0     0     0     0     0     0     0     0  ...     0
3     0     0     0     0     0     0     1     0     0     0  ...     0
4     0     0     0     1     0     0     0     0     1     0  ...     0

   1015  1016  1017  1018  1019  1020  1021  1022  1023
0     0     0     0     0     1     0     0     0     0
1     0     0     0     0     0     0     0     0     0
2     0     0     0     0     0     0     0     0     0
3     0     0     0     0     0     0     0     0     0
4     0     0     0     0     1     0     1     0     0

[5 rows x 1024 columns]

Molecule tensors from chemml.chem.Molecule objects

Molecule tensors can be used to create neural graph fingerprints using chemml.models

[5]:
from chemml.chem import tensorise_molecules
atoms,bonds,edges = tensorise_molecules(molecules=mol_objs_list, max_degree=5, max_atoms=None, n_jobs=-1, batch_size=100, verbose=True)
Tensorising molecules in batches of 100 ...
500/500 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7s 13ms/step
Merging batch tensors ...    [DONE]
[6]:
print("Matrix for atom features (num_molecules, max_atoms, num_atom_features):\n", atoms.shape)
print("Matrix for connectivity between atoms (num_molecules, max_atoms, max_degree):\n", edges.shape)
print("Matrix for bond features (num_molecules, max_atoms, max_degree, num_bond_features):\n", bonds.shape)
Matrix for atom features (num_molecules, max_atoms, num_atom_features):
 (500, 57, 62)
Matrix for connectivity between atoms (num_molecules, max_atoms, max_degree):
 (500, 57, 5)
Matrix for bond features (num_molecules, max_atoms, max_degree, num_bond_features):
 (500, 57, 5, 6)

Descriptors from RDKit

Comprehensive set of molecular descriptors calculated using RDKit. Includes topological, geometrical, electronic, and constitutional properties. Efficient calculation for large datasets. Flexible selection of specific or all descriptors via the RDKDesc class declaration. Integrates with other RDKit functions and Python workflows.

[7]:
from chemml.chem import RDKDesc

rdd = RDKDesc()
features = rdd.represent(mol_objs_list)
print(features[:5])
Calculating RDKit descriptors: 100%|██████████| 500/500 [00:04<00:00, 105.85it/s]
   MaxAbsEStateIndex  MaxEStateIndex  MinAbsEStateIndex  MinEStateIndex  \
0           8.638741        8.638741           0.039513       -3.852793
1           8.193508        8.193508           0.286175       -0.599619
2           8.666998        8.666998           0.049406       -3.394321
3           8.032986        8.032986           0.013750       -2.742153
4           9.189644        9.189644           0.166504       -3.306171

        qed        SPS    MolWt  HeavyAtomMolWt  ExactMolWt  \
0  0.816913  70.705882  285.503         266.351  285.067963
1  0.735869  16.444444  240.222         232.158  240.064725
2  0.801905  32.818182  313.386         298.266  313.099731
3  0.749729  51.384615  218.299         208.219  218.007136
4  0.772983  38.095238  319.415         306.311  319.056152

   NumValenceElectrons  ...  fr_sulfonamd  fr_sulfone  fr_term_acetylene  \
0                   94  ...             0           0                  0
1                   88  ...             0           0                  0
2                  112  ...             0           0                  0
3                   72  ...             0           0                  0
4                  108  ...             0           0                  0

   fr_tetrazole  fr_thiazole  fr_thiocyan  fr_thiophene  fr_unbrch_alkane  \
0             0            1            0             0                 0
1             0            0            0             0                 0
2             0            1            0             0                 0
3             0            0            0             0                 0
4             0            2            0             0                 0

   fr_urea                                 SMILES
0        0              c1nc(C2CSCCS2)sc1CC1CCCC1
1        0           Oc1nccnc1-c1coc(-c2cnccn2)c1
2        0  c1cc(-c2ccc(N3CNCN(c4cncs4)C3)nc2)co1
3        0                    Oc1occc1C1(O)CSCCS1
4        0       c1nc(-c2cocc2C2NCNCN2c2cscn2)cs1

[5 rows x 218 columns]

Descriptors from Mordred

Note: This function requires Mordred to be installed from the link.

Mordred molecular descriptors are an open-source alternative to Dragon/RDKit descriptors. This library can generate up to 1800+ descriptors, in comparison to Dragon’s 5200+ and RDKit’s 200.

[8]:
from chemml.chem import Mordred

mord = Mordred()
features = mord.represent(mol_objs_list, quiet=False)
print(features[:5])
100%|██████████| 500/500 [00:08<00:00, 58.16it/s]
   nAcid  nBase    SpAbs_A   SpMax_A  SpDiam_A     SpAD_A   SpMAD_A   LogEE_A  \
0      0      0  22.998278  2.356990  4.586123  22.998278  1.352840  3.780704
1      0      0  24.114905  2.401132  4.700060  24.114905  1.339717  3.834684
2      0      1  30.133660  2.388743  4.753098  30.133660  1.369712  4.046753
3      0      0  16.550756  2.429396  4.799667  16.550756  1.273135  3.507942
4      0      2  28.597221  2.464180  4.847996  28.597221  1.361772  4.008031

          SM1_A     VE1_A  ...     TSRW10          MW       AMW  WPath  WPol  \
0 -5.218048e-15  3.771227  ...  64.739856  285.067963  7.918555    564    19
1  8.215650e-15  3.859715  ...  64.604946  240.064725  9.233259    624    25
2 -1.232348e-14  4.414919  ...  71.264318  313.099731  8.462155   1142    30
3  6.328271e-15  3.324299  ...  58.294869  218.007136  9.478571    226    18
4 -1.554312e-15  4.162396  ...  71.560531  319.056152  9.384004    870    29

   Zagreb1  Zagreb2  mZagreb1  mZagreb2                                 SMILES
0     88.0    101.0  3.694444  3.777778              c1nc(C2CSCCS2)sc1CC1CCCC1
1     94.0    110.0  4.555556  4.000000           Oc1nccnc1-c1coc(-c2cnccn2)c1
2    118.0    139.0  4.666667  4.833333  c1cc(-c2ccc(N3CNCN(c4cncs4)C3)nc2)co1
3     68.0     80.0  4.284722  2.861111                    Oc1occc1C1(O)CSCCS1
4    114.0    137.0  4.416667  4.638889       c1nc(-c2cocc2C2NCNCN2c2cscn2)cs1

[5 rows x 1373 columns]

Descriptors from PaDELPy

Note: This function requires PaDELPy and JRE 6+ to be installed from the link.

PaDEL-Descriptor: Open-source software for calculating molecular descriptors and fingerprints. Computes 797 descriptors (663 1D/2D, 134 3D) and 10 fingerprint types. Uses Chemistry Development Kit and custom implementations. Offers GUI and CLI, supports multiple file formats, and enables multithreading for efficient calculations.

[11]:
from chemml.chem import PadelDesc

padel = PadelDesc()
features = padel.represent(mol_objs_list[:10])
print(features[:5])
  nAcid                ALogP                 ALogp2                 AMR  \
0     0   1.4476999999999995      2.095835289999999  63.309999999999995
1     0  -0.6144000000000001    0.37748736000000005              4.1837
2     0  0.05900000000000016  0.0034810000000000192  31.210299999999997
3     0  0.16129999999999978   0.026017689999999927  38.772499999999994
4     0   0.3426999999999998    0.11744328999999985  38.349399999999996

                 apol naAromAtom nAromBond nAtom nHeavyAtom  nH  ... P2s E1s  \
0   45.34906699999998          5         5    36         17  19  ...
1           32.458344         17        19    26         18   8  ...
2   45.60389499999999         16        17    37         22  15  ...
3  28.953929999999986          5         5    23         13  10  ...
4  43.650308999999986         15        16    34         21  13  ...

  E2s E3s Ts As Vs Ks Ds                                 SMILES
0                                     c1nc(C2CSCCS2)sc1CC1CCCC1
1                                  Oc1nccnc1-c1coc(-c2cnccn2)c1
2                         c1cc(-c2ccc(N3CNCN(c4cncs4)C3)nc2)co1
3                                           Oc1occc1C1(O)CSCCS1
4                              c1nc(-c2cocc2C2NCNCN2c2cscn2)cs1

[5 rows x 1876 columns]