Feature Representation Methods in ChemML

To build a machine learning model, raw chemical data is first converted into a numerical representation. The representation contains spatial or topological information that defines a molecule. The resulting features may either be in continuous (molecular descriptors) or discrete (molecular fingerprints) form.

[1]:
from chemml.chem import Molecule
from chemml.datasets import load_organic_density
import numpy as np
import warnings
warnings.filterwarnings('ignore')

Creating chemml.chem.Molecule object from molecule SMILES

All feature representation methods available in ChemML require chemml.chem.Molecule as inputs

[2]:
# Importing an existing dataset from ChemML
molecules, target, dragon_subset = load_organic_density()
mol_objs_list = []
for smi in molecules['smiles']:
    mol = Molecule(smi, 'smiles')
    mol.hydrogens('add')
    mol.to_xyz('MMFF', maxIters=10000, mmffVariant='MMFF94s')
    mol_objs_list.append(mol)

Coulomb Matrix

Simple molecular descriptor which mimics the electro-static interaction between nuclei.

[3]:
from chemml.chem import CoulombMatrix

#The coulomb matrix type can be sorted (SC), unsorted(UM), unsorted triangular(UT), eigen spectrum(E), or random (RC)
CM = CoulombMatrix(cm_type='SC',n_jobs=-1)

features = CM.represent(mol_objs_list)
print(features[:5])
featurizing molecules in batches of 62 ...
500/500 [==================================================] - 2s 4ms/step
Merging batch features ...    [DONE]
         0          1           2          3          4           5     \
0  388.023441  72.409571  388.023441  49.814719  70.880993  388.023441
1   73.516695  14.960477   73.516695  12.181391  15.302647   53.358707
2  388.023441  11.859397   73.516695  40.634258   5.622341   53.358707
3  388.023441  74.210007  388.023441  48.421152  40.116506   73.516695
4  388.023441  34.568986  388.023441  20.742816  20.052745   73.516695

        6          7          8          9     ...  1643  1644  1645  1646  \
0  43.493386  29.659853  22.280913  53.358707  ...   0.0   0.0   0.0   0.0
1  20.835105   8.040493   7.380449  53.358707  ...   0.0   0.0   0.0   0.0
2  24.709312   7.173379  20.351225  53.358707  ...   0.0   0.0   0.0   0.0
3  26.725846  20.224880  15.430347  73.516695  ...   0.0   0.0   0.0   0.0
4  22.094433  43.457407  12.307203  53.358707  ...   0.0   0.0   0.0   0.0

   1647  1648  1649  1650  1651  1652
0   0.0   0.0   0.0   0.0   0.0   0.0
1   0.0   0.0   0.0   0.0   0.0   0.0
2   0.0   0.0   0.0   0.0   0.0   0.0
3   0.0   0.0   0.0   0.0   0.0   0.0
4   0.0   0.0   0.0   0.0   0.0   0.0

[5 rows x 1653 columns]

Fingerprints from RDKit

Molecular fingerprints are a way of encoding the structure of a molecule. The most common type of fingerprint is a series of binary digits (bits) that represent the presence or absence of particular substructures in the molecule. Comparing fingerprints allows you to determine the similarity between two molecules, to find matches to a query substructure, etc.

[4]:
from chemml.chem import RDKitFingerprint

# RDKit fingerprint types: 'morgan', 'hashed_topological_torsion' or 'htt' , 'MACCS' or 'maccs', 'hashed_atom_pair' or 'hap'
morgan_fp = RDKitFingerprint(fingerprint_type='morgan', vector='bit', n_bits=1024, radius=3)
features = morgan_fp.represent(mol_objs_list)
print(features[:5])
   0     1     2     3     4     5     6     7     8     9     ...  1014  \
0     0     0     0     0     0     0     0     0     0     0  ...     0
1     0     0     0     0     0     0     0     0     0     0  ...     0
2     0     0     0     0     0     0     0     0     0     0  ...     0
3     0     0     0     0     0     0     0     0     0     0  ...     0
4     0     0     0     0     0     0     0     1     0     0  ...     0

   1015  1016  1017  1018  1019  1020  1021  1022  1023
0     0     0     0     0     0     0     0     0     0
1     0     0     0     0     0     0     0     0     0
2     0     0     0     0     0     0     0     0     0
3     0     0     0     0     0     0     0     0     0
4     0     0     0     0     0     0     0     0     0

[5 rows x 1024 columns]

Molecule tensors from chemml.chem.Molecule objects

Molecule tensors can be used to create neural graph fingerprints using chemml.models

[5]:
from chemml.chem import tensorise_molecules
atoms,bonds,edges = tensorise_molecules(molecules=mol_objs_list, max_degree=5, max_atoms=None, n_jobs=-1, batch_size=100, verbose=True)
Tensorising molecules in batches of 100 ...
500/500 [==================================================] - 1s 1ms/step
Merging batch tensors ...    [DONE]
[6]:
print("Matrix for atom features (num_molecules, max_atoms, num_atom_features):\n", atoms.shape)
print("Matrix for connectivity between atoms (num_molecules, max_atoms, max_degree):\n", edges.shape)
print("Matrix for bond features (num_molecules, max_atoms, max_degree, num_bond_features):\n", bonds.shape)
Matrix for atom features (num_molecules, max_atoms, num_atom_features):
 (500, 57, 62)
Matrix for connectivity between atoms (num_molecules, max_atoms, max_degree):
 (500, 57, 5)
Matrix for bond features (num_molecules, max_atoms, max_degree, num_bond_features):
 (500, 57, 5, 6)