Feature Representation Methods in ChemML
To build a machine learning model, raw chemical data is first converted into a numerical representation. The representation contains spatial or topological information that defines a molecule. The resulting features may either be in continuous (molecular descriptors) or discrete (molecular fingerprints) form.
[1]:
from chemml.chem import Molecule
from chemml.datasets import load_organic_density
import numpy as np
import warnings
warnings.filterwarnings('ignore')
Creating chemml.chem.Molecule object from molecule SMILES
All feature representation methods available in ChemML require chemml.chem.Molecule as inputs
[2]:
# Importing an existing dataset from ChemML
molecules, target, dragon_subset = load_organic_density()
mol_objs_list = []
for smi in molecules['smiles']:
mol = Molecule(smi, 'smiles')
mol.hydrogens('add')
mol.to_xyz('MMFF', maxIters=10000, mmffVariant='MMFF94s')
mol_objs_list.append(mol)
Coulomb Matrix
Simple molecular descriptor which mimics the electro-static interaction between nuclei.
[3]:
from chemml.chem import CoulombMatrix
#The coulomb matrix type can be sorted (SC), unsorted(UM), unsorted triangular(UT), eigen spectrum(E), or random (RC)
CM = CoulombMatrix(cm_type='SC',n_jobs=-1)
features = CM.represent(mol_objs_list)
print(features[:5])
featurizing molecules in batches of 31 ...
500/500 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12s 24ms/step
Merging batch features ... [DONE]
0 1 2 3 4 5 \
0 388.023441 67.563795 388.023441 46.773304 71.039377 388.023441
1 73.516695 12.680660 73.516695 13.507005 15.308384 53.358707
2 388.023441 10.343694 73.516695 40.719814 5.612250 53.358707
3 388.023441 72.013552 388.023441 49.045255 31.222555 73.516695
4 388.023441 34.076060 388.023441 20.740383 20.314923 73.516695
6 7 8 9 ... 1643 1644 1645 1646 \
0 43.471164 31.884828 23.619673 53.358707 ... 0.0 0.0 0.0 0.0
1 15.511761 10.387486 7.267909 53.358707 ... 0.0 0.0 0.0 0.0
2 22.032304 7.173553 20.331941 53.358707 ... 0.0 0.0 0.0 0.0
3 26.287638 24.264785 15.451307 73.516695 ... 0.0 0.0 0.0 0.0
4 21.673878 43.473700 12.535104 53.358707 ... 0.0 0.0 0.0 0.0
1647 1648 1649 1650 1651 1652
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0
[5 rows x 1653 columns]
Fingerprints from RDKit
Molecular fingerprints are a way of encoding the structure of a molecule. The most common type of fingerprint is a series of binary digits (bits) that represent the presence or absence of particular substructures in the molecule. Comparing fingerprints allows you to determine the similarity between two molecules, to find matches to a query substructure, etc.
[4]:
from chemml.chem import RDKitFingerprint
# RDKit fingerprint types: 'morgan', 'hashed_topological_torsion' or 'htt' , 'MACCS' or 'maccs', 'hashed_atom_pair' or 'hap'
morgan_fp = RDKitFingerprint(fingerprint_type='morgan', vector='bit', n_bits=1024, radius=3)
features = morgan_fp.represent(mol_objs_list)
print(features[:5])
0 1 2 3 4 5 6 7 8 9 ... 1014 \
0 0 0 0 0 0 0 1 0 0 0 ... 0
1 0 0 0 0 0 0 0 0 0 0 ... 0
2 0 0 0 0 0 0 0 0 0 0 ... 0
3 0 0 0 0 0 0 1 0 0 0 ... 0
4 0 0 0 1 0 0 0 0 1 0 ... 0
1015 1016 1017 1018 1019 1020 1021 1022 1023
0 0 0 0 0 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 1 0 1 0 0
[5 rows x 1024 columns]
Molecule tensors from chemml.chem.Molecule objects
Molecule tensors can be used to create neural graph fingerprints using chemml.models
[5]:
from chemml.chem import tensorise_molecules
atoms,bonds,edges = tensorise_molecules(molecules=mol_objs_list, max_degree=5, max_atoms=None, n_jobs=-1, batch_size=100, verbose=True)
Tensorising molecules in batches of 100 ...
500/500 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7s 13ms/step
Merging batch tensors ... [DONE]
[6]:
print("Matrix for atom features (num_molecules, max_atoms, num_atom_features):\n", atoms.shape)
print("Matrix for connectivity between atoms (num_molecules, max_atoms, max_degree):\n", edges.shape)
print("Matrix for bond features (num_molecules, max_atoms, max_degree, num_bond_features):\n", bonds.shape)
Matrix for atom features (num_molecules, max_atoms, num_atom_features):
(500, 57, 62)
Matrix for connectivity between atoms (num_molecules, max_atoms, max_degree):
(500, 57, 5)
Matrix for bond features (num_molecules, max_atoms, max_degree, num_bond_features):
(500, 57, 5, 6)
Descriptors from RDKit
Comprehensive set of molecular descriptors calculated using RDKit. Includes topological, geometrical, electronic, and constitutional properties. Efficient calculation for large datasets. Flexible selection of specific or all descriptors via the RDKDesc class declaration. Integrates with other RDKit functions and Python workflows.
[7]:
from chemml.chem import RDKDesc
rdd = RDKDesc()
features = rdd.represent(mol_objs_list)
print(features[:5])
Calculating RDKit descriptors: 100%|██████████| 500/500 [00:04<00:00, 105.85it/s]
MaxAbsEStateIndex MaxEStateIndex MinAbsEStateIndex MinEStateIndex \
0 8.638741 8.638741 0.039513 -3.852793
1 8.193508 8.193508 0.286175 -0.599619
2 8.666998 8.666998 0.049406 -3.394321
3 8.032986 8.032986 0.013750 -2.742153
4 9.189644 9.189644 0.166504 -3.306171
qed SPS MolWt HeavyAtomMolWt ExactMolWt \
0 0.816913 70.705882 285.503 266.351 285.067963
1 0.735869 16.444444 240.222 232.158 240.064725
2 0.801905 32.818182 313.386 298.266 313.099731
3 0.749729 51.384615 218.299 208.219 218.007136
4 0.772983 38.095238 319.415 306.311 319.056152
NumValenceElectrons ... fr_sulfonamd fr_sulfone fr_term_acetylene \
0 94 ... 0 0 0
1 88 ... 0 0 0
2 112 ... 0 0 0
3 72 ... 0 0 0
4 108 ... 0 0 0
fr_tetrazole fr_thiazole fr_thiocyan fr_thiophene fr_unbrch_alkane \
0 0 1 0 0 0
1 0 0 0 0 0
2 0 1 0 0 0
3 0 0 0 0 0
4 0 2 0 0 0
fr_urea SMILES
0 0 c1nc(C2CSCCS2)sc1CC1CCCC1
1 0 Oc1nccnc1-c1coc(-c2cnccn2)c1
2 0 c1cc(-c2ccc(N3CNCN(c4cncs4)C3)nc2)co1
3 0 Oc1occc1C1(O)CSCCS1
4 0 c1nc(-c2cocc2C2NCNCN2c2cscn2)cs1
[5 rows x 218 columns]
Descriptors from Mordred
Note: This function requires Mordred to be installed from the link.
Mordred molecular descriptors are an open-source alternative to Dragon/RDKit descriptors. This library can generate up to 1800+ descriptors, in comparison to Dragon’s 5200+ and RDKit’s 200.
[8]:
from chemml.chem import Mordred
mord = Mordred()
features = mord.represent(mol_objs_list, quiet=False)
print(features[:5])
100%|██████████| 500/500 [00:08<00:00, 58.16it/s]
nAcid nBase SpAbs_A SpMax_A SpDiam_A SpAD_A SpMAD_A LogEE_A \
0 0 0 22.998278 2.356990 4.586123 22.998278 1.352840 3.780704
1 0 0 24.114905 2.401132 4.700060 24.114905 1.339717 3.834684
2 0 1 30.133660 2.388743 4.753098 30.133660 1.369712 4.046753
3 0 0 16.550756 2.429396 4.799667 16.550756 1.273135 3.507942
4 0 2 28.597221 2.464180 4.847996 28.597221 1.361772 4.008031
SM1_A VE1_A ... TSRW10 MW AMW WPath WPol \
0 -5.218048e-15 3.771227 ... 64.739856 285.067963 7.918555 564 19
1 8.215650e-15 3.859715 ... 64.604946 240.064725 9.233259 624 25
2 -1.232348e-14 4.414919 ... 71.264318 313.099731 8.462155 1142 30
3 6.328271e-15 3.324299 ... 58.294869 218.007136 9.478571 226 18
4 -1.554312e-15 4.162396 ... 71.560531 319.056152 9.384004 870 29
Zagreb1 Zagreb2 mZagreb1 mZagreb2 SMILES
0 88.0 101.0 3.694444 3.777778 c1nc(C2CSCCS2)sc1CC1CCCC1
1 94.0 110.0 4.555556 4.000000 Oc1nccnc1-c1coc(-c2cnccn2)c1
2 118.0 139.0 4.666667 4.833333 c1cc(-c2ccc(N3CNCN(c4cncs4)C3)nc2)co1
3 68.0 80.0 4.284722 2.861111 Oc1occc1C1(O)CSCCS1
4 114.0 137.0 4.416667 4.638889 c1nc(-c2cocc2C2NCNCN2c2cscn2)cs1
[5 rows x 1373 columns]
Descriptors from PaDELPy
Note: This function requires PaDELPy and JRE 6+ to be installed from the link.
PaDEL-Descriptor: Open-source software for calculating molecular descriptors and fingerprints. Computes 797 descriptors (663 1D/2D, 134 3D) and 10 fingerprint types. Uses Chemistry Development Kit and custom implementations. Offers GUI and CLI, supports multiple file formats, and enables multithreading for efficient calculations.
[11]:
from chemml.chem import PadelDesc
padel = PadelDesc()
features = padel.represent(mol_objs_list[:10])
print(features[:5])
nAcid ALogP ALogp2 AMR \
0 0 1.4476999999999995 2.095835289999999 63.309999999999995
1 0 -0.6144000000000001 0.37748736000000005 4.1837
2 0 0.05900000000000016 0.0034810000000000192 31.210299999999997
3 0 0.16129999999999978 0.026017689999999927 38.772499999999994
4 0 0.3426999999999998 0.11744328999999985 38.349399999999996
apol naAromAtom nAromBond nAtom nHeavyAtom nH ... P2s E1s \
0 45.34906699999998 5 5 36 17 19 ...
1 32.458344 17 19 26 18 8 ...
2 45.60389499999999 16 17 37 22 15 ...
3 28.953929999999986 5 5 23 13 10 ...
4 43.650308999999986 15 16 34 21 13 ...
E2s E3s Ts As Vs Ks Ds SMILES
0 c1nc(C2CSCCS2)sc1CC1CCCC1
1 Oc1nccnc1-c1coc(-c2cnccn2)c1
2 c1cc(-c2ccc(N3CNCN(c4cncs4)C3)nc2)co1
3 Oc1occc1C1(O)CSCCS1
4 c1nc(-c2cocc2C2NCNCN2c2cscn2)cs1
[5 rows x 1876 columns]