Build a simple ML model

This page provides an example of a simple machine learning model developed using the GUI. We use the sample dataset from our library - this dataset has the SMILES codes for 500 compounds and their respective Highest Occupied Molecular Orbital (HOMO) energy in eV. The compounds are represented as Morgan fingerprints which are available through in the RDkit library. A 0.25 to 0.75 test-train split is used to split the data into the training data and the testing data.

The model is fitted to Morgan fingerprints and the HOMO energies of the training data. We fit the training data to a Multi-Layer Perceptron Regressor which is available through scikit-learn. We aim to predict the HOMO energies for the testing dataset.

[1]:
from chemml.wrapper.notebook import ChemMLNotebook
ui = ChemMLNotebook()
The computation graph will be displayed here:

The ChemML Wrapper's config file has been successfully saved ...
    config file path: simple_ML_workflow.txt
    current directory: /mnt/c/Aatish/UB/Mr. Hachmann/master_chemml_wrapper_v2/chemml/docs
    what's next? run the ChemML Wrapper using the config file with the following codes:
        >>> from chemml.wrapper.engine import run
        >>> run(INPUT_FILE = 'path_to_the_config_file', OUTPUT_DIRECTORY = 'CMLWrapper_out')
... you can also create a python script of the above codes and run it on any cluster that ChemML is installed.

The workflow gives a precise representation of all the intermediate steps, blocks used to develop the model, the saved data and the inputs/outputs to each task. Once the workflow is finalized, we save the input script with our desired file name in .txt format.

The GUI provides the file details and the steps to be followed to run the script - as shown below

Note: In this case, we specify our desired output directory as ‘Simple_ML_workflow’.

[3]:
from chemml.wrapper.engine import run
run(INPUT_FILE = '/mnt/c/Aatish/UB/Mr. Hachmann/master_chemml_wrapper_v2/chemml/docs/simple_ML_workflow.txt', OUTPUT_DIRECTORY = 'Simple_ML_workflow')
=================================================
=================================================
Fri May  7 09:26:39 2021

parsing the input file: /mnt/c/Aatish/UB/Mr. Hachmann/master_chemml_wrapper_v2/chemml/docs/simple_ML_workflow.txt ...

=================================================

======= block#1: (chemml, load_cep_homo)
| run ...

| ... done!
| execution time: 3.31s (0h 0m 3.31s)
=======


======= block#2: (chemml, SaveFile)
| run ...

| ... done!
| execution time: 0.03s (0h 0m 0.03s)
=======


======= block#3: (chemml, RDKitFingerprint)
| run ...

| ... done!
| execution time: 1.03s (0h 0m 1.03s)
=======


======= block#4: (chemml, SaveFile)
| run ...

| ... done!
| execution time: 0.11s (0h 0m 0.11s)
=======


======= block#5: (sklearn, train_test_split)
| run ...

| ... done!
| execution time: 0.02s (0h 0m 0.02s)
=======


======= block#6: (sklearn, MLPRegressor)
| run ...

/home/aatishpr/anaconda3/envs/v2_0.7/lib/python3.8/site-packages/sklearn/utils/validation.py:63: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  return f(*args, **kwargs)
/home/aatishpr/anaconda3/envs/v2_0.7/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
| ... done!
| execution time: 3.26s (0h 0m 3.26s)
=======


======= block#7: (sklearn, MLPRegressor)
| run ...

| ... done!
| execution time: 0.04s (0h 0m 0.04s)
=======


======= block#8: (chemml, SaveFile)
| run ...

| ... done!
| execution time: 0.01s (0h 0m 0.01s)
=======


======= block#9: (chemml, scatter2D)
| run ...

| ... done!
| execution time: 0.06s (0h 0m 0.06s)
=======


======= block#11: (chemml, decorator)
| run ...

| ... done!
| execution time: 0.01s (0h 0m 0.01s)
=======


======= block#10: (chemml, SavePlot)
| run ...

The Plot has been saved at:  Simple_ML_workflow/./dfy_actual_vs_dfy_predict.png
| ... done!
| execution time: 0.14s (0h 0m 0.14s)
=======


Total execution time: 8.03s (0h 0m 8.03s)
2021-05-07 09:26:47

We create a plot to compare the predicted values of the HOMO enegeries to the actual values of the HOMO energies.

[4]:
from IPython.display import Image
Image(filename='Simple_ML_workflow/./dfy_actual_vs_dfy_predict.png')
[4]:
../_images/ipython_notebooks_simple_ml_model_5_0.png

We plot a parity plot to comapre the model’s predicted values vs. the actual values. If the predicted results were fully accurate, we would have obtained clustering of the points along the equation y=x.