The following are the steps for using AutoML for a regression task:

Note: Setting the flag for featurization= ‘True’ generates represents molecules using 5 representation techniques.

  1. Requires an input pandas dataframe consisting of two columns:

    • SMILES strings

    • target property values

  2. Molecules are represented as:

    • coloumb matrix

    • rdkit morgan fingerprints

    • MACCs

    • rdkit hashed topological torsion

    • rdkit molecular descriptors (all)

  3. Screens through various sklearn regressor models:

Yields ‘n-best’ models, with optimized hyperparamters.

Returns dataframe of error metrics, machine learning model, algorithm, tuned hyperparameter values and featurization technique.

Load your data

[1]:
import pandas as pd
import numpy as np
from chemml.chem import Molecule
from chemml.datasets import load_organic_density
[2]:
molecules, target, dragon_subset = load_organic_density()
df=pd.concat([molecules, target], axis=1)
df = df.sample(25)
df
[2]:
smiles density_Kg/m3
188 n1ccc(cc1)c1scnc1c1ncccc1c1cccc2c1cccc2 1203.16
111 Cc1cc2ccccc2c(c1)c1sccc1c1cscc1 1199.41
0 C1CSC(CS1)c1ncc(s1)CC1CCCC1 1184.64
30 Cc1c(cc2c(c1c1cnccn1)cccc2)c1cscn1 1213.76
328 c1ccc(nc1)c1nnc(s1)Sc1cccs1 1374.07
270 Oc1ccc(c2c1cccc2c1ncsc1)c1ccco1 1290.56
68 SC1CCC(C1c1cnccn1)C1CSCCS1 1238.45
293 OC1NCN(CN1)c1ccc(cc1)c1scnn1 1366.07
379 C1CSC(CS1)c1ccccc1c1ccc(cc1)c1ccco1 1193.12
253 c1cnc(cn1)c1ccc2c(c1)cccc2c1csc(n1)c1ccco1 1258.63
107 n1ccc(cc1)c1cscc1c1cccc2c1cc(cc2)c1cccnc1 1186.29
115 c1cnc(cn1)c1csc(n1)C1(CCCC1)c1ccncc1 1209.59
15 CC1CCC(C1)C1CCCC1c1cccs1 1005.60
218 CC1(CCCC1)c1ccc(s1)c1scnn1 1199.65
86 c1cnc(cn1)c1coc(c1)c1nccc(c1)c1cccc2c1cccc2 1209.81
391 c1cnc(cn1)c1nc(c(s1)c1nncs1)c1cnccn1 1403.86
28 s1cnc(c1)c1cc(cc2c1cccc2)c1csc(c1)c1scnc1 1322.37
366 n1ccc(cc1)C1NCNCN1c1cncc(c1)c1cccnc1 1235.90
309 C1NC(NC(N1)c1nncs1)c1ccc2c(c1)cc(cc2)c1cccnc1 1282.84
401 C1CCC(C1)(c1cncs1)c1cc2ccccc2c(c1)c1cscn1 1211.16
25 Cc1c2ccccc2ccc1C1(CSCCS1)C1NCNCN1 1232.82
216 Cc1ccc2c(c1C1NCNCN1C1NCNCN1)cccc2 1196.18
469 C1SCC(SC1)c1nsnc1C1(CCCC1)c1ccco1 1270.09
47 o1ccc(c1)CC1CCCC1c1cccs1 1088.77
186 SC1CCC(C1)C1CSCC(S1)c1cccnc1 1198.04

Run autoML for a regression task

[3]:
from chemml.autoML import ModelScreener
MS = ModelScreener(df, target="density_Kg/m3", featurization=True, smiles="smiles",
                   screener_type="regressor", output_file="testing.txt")
scores = MS.screen_models(n_best=4)
featurizing molecules in batches of 2 ...
25/25 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15s 614ms/step
Merging batch features ...    [DONE]
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:45] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:46] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:47] DEPRECATION WARNING: please use MorganGenerator
[09:29:48] DEPRECATION WARNING: please use MorganGenerator
[09:29:48] DEPRECATION WARNING: please use MorganGenerator
[09:29:48] DEPRECATION WARNING: please use MorganGenerator
[09:29:48] DEPRECATION WARNING: please use MorganGenerator
[09:29:48] DEPRECATION WARNING: please use MorganGenerator
[09:29:48] DEPRECATION WARNING: please use MorganGenerator
[09:29:48] DEPRECATION WARNING: please use MorganGenerator
[09:29:48] DEPRECATION WARNING: please use MorganGenerator
[09:29:48] DEPRECATION WARNING: please use MorganGenerator
[09:29:48] DEPRECATION WARNING: please use MorganGenerator
[09:29:48] DEPRECATION WARNING: please use MorganGenerator
[09:29:48] DEPRECATION WARNING: please use MorganGenerator
[09:29:48] DEPRECATION WARNING: please use MorganGenerator
[09:29:48] DEPRECATION WARNING: please use MorganGenerator
[09:29:48] DEPRECATION WARNING: please use MorganGenerator
split done!



--- 1460.8564009666443 seconds ---
split done!



--- 1754.8138766288757 seconds ---
split done!



--- 556.0454123020172 seconds ---
split done!



--- 2021.2931699752808 seconds ---
split done!



--- 450.1350498199463 seconds ---
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 4
      1 from chemml.autoML import ModelScreener
      2 MS = ModelScreener(df, target="density_Kg/m3", featurization=True, smiles="smiles",
      3                    screener_type="regressor", output_file="testing.txt")
----> 4 scores = MS.screen_models(n_best=4)

File c:\users\nitin\documents\ub\hachmann_group\chemml_dev_nitin\chemml\chemml\autoML\model_screener.py:459, in ModelScreener.screen_models(self, n_best)
    456     print("\n--- %s seconds ---" % (time.time() - start_time))
    458 # aggregate scores list
--> 459 best_models = self.aggregate_scores(scores_list=scores_list_final, n_best=n_best)
    461 return best_models

File c:\users\nitin\documents\ub\hachmann_group\chemml_dev_nitin\chemml\chemml\autoML\model_screener.py:374, in ModelScreener.aggregate_scores(self, scores_list, n_best)
    350 def aggregate_scores(self,  scores_list, n_best):
    351     """ 
    352     This function aggregates a list of scores, combines them into a pandas dataframe, sorts them by
    353     RMSE in ascending order, and returns the top n_best scores.
   (...)
    370         the top n_best scores from the combined scores list, sorted by RMSE in ascending order.
    371     """
--> 374     scores_combined = pd.concat(scores_list, ignore_index=True)
    376     if self.screener_type == "regressor":
    377         self.scores_combined = scores_combined.sort_values(by='RMSE', ascending=True)

File c:\Users\nitin\anaconda3\envs\chemml_dev_env\Lib\site-packages\pandas\core\reshape\concat.py:382, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    379 elif copy and using_copy_on_write():
    380     copy = False
--> 382 op = _Concatenator(
    383     objs,
    384     axis=axis,
    385     ignore_index=ignore_index,
    386     join=join,
    387     keys=keys,
    388     levels=levels,
    389     names=names,
    390     verify_integrity=verify_integrity,
    391     copy=copy,
    392     sort=sort,
    393 )
    395 return op.get_result()

File c:\Users\nitin\anaconda3\envs\chemml_dev_env\Lib\site-packages\pandas\core\reshape\concat.py:445, in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    442 self.verify_integrity = verify_integrity
    443 self.copy = copy
--> 445 objs, keys = self._clean_keys_and_objs(objs, keys)
    447 # figure out what our result ndim is going to be
    448 ndims = self._get_ndims(objs)

File c:\Users\nitin\anaconda3\envs\chemml_dev_env\Lib\site-packages\pandas\core\reshape\concat.py:507, in _Concatenator._clean_keys_and_objs(self, objs, keys)
    504     objs_list = list(objs)
    506 if len(objs_list) == 0:
--> 507     raise ValueError("No objects to concatenate")
    509 if keys is None:
    510     objs_list = list(com.not_none(*objs_list))

ValueError: No objects to concatenate
[4]:
scores
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 scores

NameError: name 'scores' is not defined

Save scores to csv

[5]:
scores.to_csv("autoML_test.csv",index=False)