{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Representation Methods in ChemML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To build a machine learning model, raw chemical data is first converted into a numerical representation. The representation contains spatial or topological information that defines a molecule. The resulting features may either be in continuous (molecular descriptors) or discrete (molecular fingerprints) form." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from chemml.chem import Molecule\n", "from chemml.datasets import load_organic_density\n", "import numpy as np\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating `chemml.chem.Molecule` object from molecule SMILES\n", "\n", "All feature representation methods available in ChemML require `chemml.chem.Molecule` as inputs" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Importing an existing dataset from ChemML\n", "molecules, target, dragon_subset = load_organic_density()\n", "mol_objs_list = []\n", "for smi in molecules['smiles']:\n", " mol = Molecule(smi, 'smiles')\n", " mol.hydrogens('add')\n", " mol.to_xyz('MMFF', maxIters=10000, mmffVariant='MMFF94s')\n", " mol_objs_list.append(mol)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Coulomb Matrix](https://doi.org/10.1103/PhysRevLett.108.058301)\n", "\n", "Simple molecular descriptor which mimics the electro-static interaction between nuclei. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "featurizing molecules in batches of 62 ...\n", "500/500 [==================================================] - 2s 4ms/step\n", "Merging batch features ... [DONE]\n", " 0 1 2 3 4 5 \\\n", "0 388.023441 72.409571 388.023441 49.814719 70.880993 388.023441 \n", "1 73.516695 14.960477 73.516695 12.181391 15.302647 53.358707 \n", "2 388.023441 11.859397 73.516695 40.634258 5.622341 53.358707 \n", "3 388.023441 74.210007 388.023441 48.421152 40.116506 73.516695 \n", "4 388.023441 34.568986 388.023441 20.742816 20.052745 73.516695 \n", "\n", " 6 7 8 9 ... 1643 1644 1645 1646 \\\n", "0 43.493386 29.659853 22.280913 53.358707 ... 0.0 0.0 0.0 0.0 \n", "1 20.835105 8.040493 7.380449 53.358707 ... 0.0 0.0 0.0 0.0 \n", "2 24.709312 7.173379 20.351225 53.358707 ... 0.0 0.0 0.0 0.0 \n", "3 26.725846 20.224880 15.430347 73.516695 ... 0.0 0.0 0.0 0.0 \n", "4 22.094433 43.457407 12.307203 53.358707 ... 0.0 0.0 0.0 0.0 \n", "\n", " 1647 1648 1649 1650 1651 1652 \n", "0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[5 rows x 1653 columns]\n" ] } ], "source": [ "from chemml.chem import CoulombMatrix\n", "\n", "#The coulomb matrix type can be sorted (SC), unsorted(UM), unsorted triangular(UT), eigen spectrum(E), or random (RC)\n", "CM = CoulombMatrix(cm_type='SC',n_jobs=-1) \n", "\n", "features = CM.represent(mol_objs_list)\n", "print(features[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Fingerprints from RDKit](https://www.rdkit.org/)\n", "\n", "Molecular fingerprints are a way of encoding the structure of a molecule. The most common type of fingerprint is a series of binary digits (bits) that represent the presence or absence of particular substructures in the molecule. Comparing fingerprints allows you to determine the similarity between two molecules, to find matches to a query substructure, etc." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0 1 2 3 4 5 6 7 8 9 ... 1014 \\\n", "0 0 0 0 0 0 0 0 0 0 0 ... 0 \n", "1 0 0 0 0 0 0 0 0 0 0 ... 0 \n", "2 0 0 0 0 0 0 0 0 0 0 ... 0 \n", "3 0 0 0 0 0 0 0 0 0 0 ... 0 \n", "4 0 0 0 0 0 0 0 1 0 0 ... 0 \n", "\n", " 1015 1016 1017 1018 1019 1020 1021 1022 1023 \n", "0 0 0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 0 0 \n", "\n", "[5 rows x 1024 columns]\n" ] } ], "source": [ "from chemml.chem import RDKitFingerprint\n", "\n", "# RDKit fingerprint types: 'morgan', 'hashed_topological_torsion' or 'htt' , 'MACCS' or 'maccs', 'hashed_atom_pair' or 'hap' \n", "morgan_fp = RDKitFingerprint(fingerprint_type='morgan', vector='bit', n_bits=1024, radius=3)\n", "features = morgan_fp.represent(mol_objs_list)\n", "print(features[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Molecule tensors from `chemml.chem.Molecule` objects\n", "\n", "Molecule tensors can be used to create neural graph fingerprints using `chemml.models`" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tensorising molecules in batches of 100 ...\n", "500/500 [==================================================] - 1s 1ms/step\n", "Merging batch tensors ... [DONE]\n" ] } ], "source": [ "from chemml.chem import tensorise_molecules\n", "atoms,bonds,edges = tensorise_molecules(molecules=mol_objs_list, max_degree=5, max_atoms=None, n_jobs=-1, batch_size=100, verbose=True)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Matrix for atom features (num_molecules, max_atoms, num_atom_features):\n", " (500, 57, 62)\n", "Matrix for connectivity between atoms (num_molecules, max_atoms, max_degree):\n", " (500, 57, 5)\n", "Matrix for bond features (num_molecules, max_atoms, max_degree, num_bond_features):\n", " (500, 57, 5, 6)\n" ] } ], "source": [ "print(\"Matrix for atom features (num_molecules, max_atoms, num_atom_features):\\n\", atoms.shape)\n", "print(\"Matrix for connectivity between atoms (num_molecules, max_atoms, max_degree):\\n\", edges.shape)\n", "print(\"Matrix for bond features (num_molecules, max_atoms, max_degree, num_bond_features):\\n\", bonds.shape)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.12 ('chemml_dev')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "vscode": { "interpreter": { "hash": "449e066aefa9e8d62513c10717355272508479920eef85d560c0383291a2cfea" } }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 2 }