Neural Fingerprints

We create atom, bond, and edge tensors from molecule SMILES using chemml.chem.tensorize_molecules in order to build neural fingerprints using chemml.models.NeuralGraphHidden and chemml.models.NeuralGraphOutput modules. These neural fingerprints are then used as features to train a simple feed forward neural network to predict densities of small organic compounds using tensorflow.

Here we import a sample dataset from ChemML library which has the SMILES codes for 500 small organic molecules with their densities in \(kg/m^3\).

[1]:
import numpy as np
from chemml.datasets import load_organic_density
molecules, target, dragon_subset = load_organic_density()
target = np.asarray(target['density_Kg/m3'])

Building chemml.chem.Molecule objects from molecule SMILES.

[2]:
from chemml.chem import Molecule
mol_objs_list = []
for smi in molecules['smiles']:
    mol = Molecule(smi, 'smiles')
    mol.hydrogens('add')
    mol.to_xyz('MMFF', maxIters=10000, mmffVariant='MMFF94s')
    mol_objs_list.append(mol)

Molecule tensors can be used to create neural graph fingerprints using chemml.models

[3]:
from chemml.chem import tensorise_molecules
xatoms, xbonds, xedges = tensorise_molecules(molecules=mol_objs_list, max_degree=5,
                                        max_atoms=None, n_jobs=-1, batch_size=100, verbose=True)
Tensorising molecules in batches of 100 ...
500/500 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8s 16ms/step
Merging batch tensors ...    [DONE]

Splitting and preprocessing the data

[4]:
from sklearn.model_selection import ShuffleSplit
from sklearn.preprocessing import StandardScaler
y_scale = StandardScaler()
rs = ShuffleSplit(n_splits=1, test_size=.20, random_state=42)

for train, test in rs.split(mol_objs_list):
    xatoms_train = xatoms[train]
    xatoms_test = xatoms[test]
    xbonds_train = xbonds[train]
    xbonds_test = xbonds[test]
    xedges_train = xedges[train]
    xedges_test = xedges[test]
    target_train = target[train]
    target_test = target[test]
    target_train = y_scale.fit_transform(target_train.reshape(-1,1))
[5]:
print('Training data:\n')
print('Atoms: ',xatoms_train.shape)
print('Bonds: ',xbonds_train.shape)
print('Edges: ',xedges_train.shape)
print('Target: ',target_train.shape)

print('\nTesting data:\n')
print('Atoms: ',xatoms_test.shape)
print('Bonds: ',xbonds_test.shape)
print('Edges: ',xedges_test.shape)
print('Target: ',target_test.shape)
Training data:

Atoms:  (400, 57, 62)
Bonds:  (400, 57, 5, 6)
Edges:  (400, 57, 5)
Target:  (400, 1)

Testing data:

Atoms:  (100, 57, 62)
Bonds:  (100, 57, 5, 6)
Edges:  (100, 57, 5)
Target:  (100,)

Building the Neural Fingerprints

The atom, bond, and edge tensors are used here to build 200 neural fingerprints of width 8 (i.e., the size atomic neighborhood which will be considered in the convolution process).

[6]:
from chemml.models import NeuralGraphHidden, NeuralGraphOutput
from tensorflow.keras.layers import Input, add
import tensorflow as tf
tf.random.set_seed(42)

conv_width = 8
fp_length = 200

num_molecules = xatoms_train.shape[0]
max_atoms = xatoms_train.shape[1]
max_degree = xbonds_train.shape[2]
num_atom_features = xatoms_train.shape[-1]
num_bond_features = xbonds_train.shape[-1]

# Creating input layers for atoms ,bonds and edge information
atoms0 = Input(name='atom_inputs', shape=(max_atoms, num_atom_features),batch_size=None)
bonds = Input(name='bond_inputs', shape=(max_atoms, max_degree, num_bond_features),batch_size=None)
edges = Input(name='edge_inputs', shape=(max_atoms, max_degree), dtype='int32',batch_size=None)

# Defining the convolved atom feature layers
atoms1 = NeuralGraphHidden(conv_width, activation='relu', use_bias=False)([atoms0, bonds, edges])
atoms2 = NeuralGraphHidden(conv_width, activation='relu', use_bias=False)([atoms1, bonds, edges])

# Defining the outputs of each (convolved) atom feature layer to fingerprint
fp_out0 = NeuralGraphOutput(fp_length, activation='softmax')([atoms0,bonds,edges])
fp_out1 = NeuralGraphOutput(fp_length, activation='softmax')([atoms1,bonds,edges])
fp_out2 = NeuralGraphOutput(fp_length, activation='softmax')([atoms2,bonds,edges])

# Sum outputs to obtain fingerprint
final_fp = add([fp_out0, fp_out1, fp_out2])
print('Neural Fingerprint Shape: ',final_fp.shape)
Neural Fingerprint Shape:  (None, 200)

Building and training the neural network

Here, we build and train a simple feed forward neural network using tensorflow.keras and provide our neural fingerprints as features.

[7]:
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense

# Build and compile model for regression.
dense_layer0 = Dense(128,activation='relu',name='dense_layer0',
                     kernel_regularizer=tf.keras.regularizers.l2(0.01))(final_fp)
dense_layer1 = Dense(64,activation='relu',name='dense_layer1',
                     kernel_regularizer=tf.keras.regularizers.l2(0.01))(dense_layer0)
dense_layer2 = Dense(32,activation='relu',name='dense_layer2',
                     kernel_regularizer=tf.keras.regularizers.l2(0.01))(dense_layer1)

main_prediction = Dense(1, activation='linear', name='main_prediction')(dense_layer1)
model = Model(inputs=[atoms0, bonds, edges], outputs=[main_prediction])
model.compile(optimizer='adam', loss='mae')

# Show summary
model.summary()

model.fit([xatoms_train, xbonds_train, xedges_train], target_train, epochs=50,
          steps_per_epoch=None, batch_size=None,verbose=False,validation_split=0.1)
WARNING:tensorflow:TensorFlow GPU support is not available on native Windows for TensorFlow >= 2.11. Even if CUDA/cuDNN are installed, GPU will not be used. Please use WSL2 or the TensorFlow-DirectML plugin.
Model: "functional"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)         Output Shape          Param #  Connected to      ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ atom_inputs         │ (None, 57, 62)    │          0 │ -                 │
│ (InputLayer)        │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ bond_inputs         │ (None, 57, 5, 6)  │          0 │ -                 │
│ (InputLayer)        │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ edge_inputs         │ (None, 57, 5)     │          0 │ -                 │
│ (InputLayer)        │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ neural_graph_hidden │ (None, 57, 8)     │      2,720 │ atom_inputs[0][0… │
│ (NeuralGraphHidden) │                   │            │ bond_inputs[0][0… │
│                     │                   │            │ edge_inputs[0][0] │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ neural_graph_hidde… │ (None, 57, 8)     │        560 │ neural_graph_hid… │
│ (NeuralGraphHidden) │                   │            │ bond_inputs[0][0… │
│                     │                   │            │ edge_inputs[0][0] │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ neural_graph_output │ (None, 200)       │     13,800 │ atom_inputs[0][0… │
│ (NeuralGraphOutput) │                   │            │ bond_inputs[0][0… │
│                     │                   │            │ edge_inputs[0][0] │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ neural_graph_outpu… │ (None, 200)       │      3,000 │ neural_graph_hid… │
│ (NeuralGraphOutput) │                   │            │ bond_inputs[0][0… │
│                     │                   │            │ edge_inputs[0][0] │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ neural_graph_outpu… │ (None, 200)       │      3,000 │ neural_graph_hid… │
│ (NeuralGraphOutput) │                   │            │ bond_inputs[0][0… │
│                     │                   │            │ edge_inputs[0][0] │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ add (Add)           │ (None, 200)       │          0 │ neural_graph_out… │
│                     │                   │            │ neural_graph_out… │
│                     │                   │            │ neural_graph_out… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ dense_layer0        │ (None, 128)       │     25,728 │ add[0][0]         │
│ (Dense)             │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ dense_layer1        │ (None, 64)        │      8,256 │ dense_layer0[0][ │
│ (Dense)             │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ main_prediction     │ (None, 1)         │         65 │ dense_layer1[0][ │
│ (Dense)             │                   │            │                   │
└─────────────────────┴───────────────────┴────────────┴───────────────────┘
 Total params: 57,129 (223.16 KB)
 Trainable params: 57,129 (223.16 KB)
 Non-trainable params: 0 (0.00 B)
[7]:
<keras.src.callbacks.history.History at 0x1e6eaea49e0>

Predicting the density of the molecules in our test data and evaluating our model based on it.

[8]:
from chemml.utils import regression_metrics

y_pred = model.predict([xatoms_test,xbonds_test,xedges_test])
y_pred = y_scale.inverse_transform(y_pred)
metrics_df = regression_metrics(target_test, list(y_pred.reshape(-1,)))
mae = metrics_df['MAE'].values[0]
r_2 = metrics_df['r_squared'].values[0]

print("Mean Absolute Error = {} kg/m^3".format(mae.round(3)))
print("R squared = {}".format(r_2.round(3)))
4/4 ━━━━━━━━━━━━━━━━━━━━ 6s 1s/step
Mean Absolute Error = 15.303 kg/m^3
R squared = 0.945
[9]:
metrics_df
[9]:
ME MAE MSE RMSE MSLE RMSLE MAPE MaxAPE RMSPE MPE MaxAE deltaMaxE r_squared std
0 -6.17789 15.302616 419.635458 20.485006 0.000286 0.016918 1.238328 7.243467 1.715904 -0.526769 72.840308 111.701436 0.945338 87.617825