Feature Representation Methods in ChemML
To build a machine learning model, raw chemical data is first converted into a numerical representation. The representation contains spatial or topological information that defines a molecule. The resulting features may either be in continuous (molecular descriptors) or discrete (molecular fingerprints) form.
[1]:
from chemml.chem import Molecule
from chemml.datasets import load_organic_density
import numpy as np
import warnings
warnings.filterwarnings('ignore')
Creating chemml.chem.Molecule
object from molecule SMILES
All feature representation methods available in ChemML require chemml.chem.Molecule
as inputs
[2]:
# Importing an existing dataset from ChemML
molecules, target, dragon_subset = load_organic_density()
mol_objs_list = []
for smi in molecules['smiles']:
mol = Molecule(smi, 'smiles')
mol.hydrogens('add')
mol.to_xyz('MMFF', maxIters=10000, mmffVariant='MMFF94s')
mol_objs_list.append(mol)
Coulomb Matrix
Simple molecular descriptor which mimics the electro-static interaction between nuclei.
[3]:
from chemml.chem import CoulombMatrix
#The coulomb matrix type can be sorted (SC), unsorted(UM), unsorted triangular(UT), eigen spectrum(E), or random (RC)
CM = CoulombMatrix(cm_type='SC',n_jobs=-1)
features = CM.represent(mol_objs_list)
print(features[:5])
featurizing molecules in batches of 62 ...
500/500 [==================================================] - 2s 4ms/step
Merging batch features ... [DONE]
0 1 2 3 4 5 \
0 388.023441 72.409571 388.023441 49.814719 70.880993 388.023441
1 73.516695 14.960477 73.516695 12.181391 15.302647 53.358707
2 388.023441 11.859397 73.516695 40.634258 5.622341 53.358707
3 388.023441 74.210007 388.023441 48.421152 40.116506 73.516695
4 388.023441 34.568986 388.023441 20.742816 20.052745 73.516695
6 7 8 9 ... 1643 1644 1645 1646 \
0 43.493386 29.659853 22.280913 53.358707 ... 0.0 0.0 0.0 0.0
1 20.835105 8.040493 7.380449 53.358707 ... 0.0 0.0 0.0 0.0
2 24.709312 7.173379 20.351225 53.358707 ... 0.0 0.0 0.0 0.0
3 26.725846 20.224880 15.430347 73.516695 ... 0.0 0.0 0.0 0.0
4 22.094433 43.457407 12.307203 53.358707 ... 0.0 0.0 0.0 0.0
1647 1648 1649 1650 1651 1652
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0
[5 rows x 1653 columns]
Fingerprints from RDKit
Molecular fingerprints are a way of encoding the structure of a molecule. The most common type of fingerprint is a series of binary digits (bits) that represent the presence or absence of particular substructures in the molecule. Comparing fingerprints allows you to determine the similarity between two molecules, to find matches to a query substructure, etc.
[4]:
from chemml.chem import RDKitFingerprint
# RDKit fingerprint types: 'morgan', 'hashed_topological_torsion' or 'htt' , 'MACCS' or 'maccs', 'hashed_atom_pair' or 'hap'
morgan_fp = RDKitFingerprint(fingerprint_type='morgan', vector='bit', n_bits=1024, radius=3)
features = morgan_fp.represent(mol_objs_list)
print(features[:5])
0 1 2 3 4 5 6 7 8 9 ... 1014 \
0 0 0 0 0 0 0 0 0 0 0 ... 0
1 0 0 0 0 0 0 0 0 0 0 ... 0
2 0 0 0 0 0 0 0 0 0 0 ... 0
3 0 0 0 0 0 0 0 0 0 0 ... 0
4 0 0 0 0 0 0 0 1 0 0 ... 0
1015 1016 1017 1018 1019 1020 1021 1022 1023
0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
[5 rows x 1024 columns]
Molecule tensors from chemml.chem.Molecule
objects
Molecule tensors can be used to create neural graph fingerprints using chemml.models
[5]:
from chemml.chem import tensorise_molecules
atoms,bonds,edges = tensorise_molecules(molecules=mol_objs_list, max_degree=5, max_atoms=None, n_jobs=-1, batch_size=100, verbose=True)
Tensorising molecules in batches of 100 ...
500/500 [==================================================] - 1s 1ms/step
Merging batch tensors ... [DONE]
[6]:
print("Matrix for atom features (num_molecules, max_atoms, num_atom_features):\n", atoms.shape)
print("Matrix for connectivity between atoms (num_molecules, max_atoms, max_degree):\n", edges.shape)
print("Matrix for bond features (num_molecules, max_atoms, max_degree, num_bond_features):\n", bonds.shape)
Matrix for atom features (num_molecules, max_atoms, num_atom_features):
(500, 57, 62)
Matrix for connectivity between atoms (num_molecules, max_atoms, max_degree):
(500, 57, 5)
Matrix for bond features (num_molecules, max_atoms, max_degree, num_bond_features):
(500, 57, 5, 6)