The following are the steps for using AutoML for a regression task:
Note: Setting the flag for featurization= ‘True’ generates represents molecules using 5 representation techniques.
Requires an input pandas dataframe consisting of two columns:
SMILES strings
target property values
Molecules are represented as:
coloumb matrix
rdkit morgan fingerprints
MACCs
rdkit hashed topological torsion
rdkit molecular descriptors (all)
Screens through various sklearn regressor models:
Yields ‘n-best’ models, with optimized hyperparamters.
Returns dataframe of error metrics, machine learning model, algorithm, tuned hyperparameter values and featurization technique.
Load your data
[1]:
import pandas as pd
import numpy as np
from chemml.chem import Molecule
from chemml.datasets import load_organic_density
C:\Users\nitin\Documents\UB\Hachmann_Group\chemml_dev_nitin\nitinmad\chemml\chemml\datasets\base.py:2: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
[2]:
molecules, target, dragon_subset = load_organic_density()
df=pd.concat([molecules, target], axis=1)
df = df.sample(25, random_state=42)
df
[2]:
| smiles | density_Kg/m3 | |
|---|---|---|
| 361 | c1scc(n1)c1ncc(s1)c1c(ccc2c1cccc2)c1cncs1 | 1328.35 |
| 73 | S1CC(SC(C1)C1CCCC1)c1cocc1c1ccccc1 | 1143.80 |
| 374 | N1CNC(NC1)c1sccc1C1(CSCCS1)c1cccs1 | 1351.33 |
| 155 | c1nc(c(s1)c1nsnc1)c1csc(n1)c1cccs1 | 1466.57 |
| 104 | Oc1ccc2c(c1c1cocc1)c(ccc2)c1ccccn1 | 1207.67 |
| 394 | Oc1ccc(s1)N1CNC(NC1)c1cnccn1 | 1355.15 |
| 377 | n1csc(n1)C1(CCCC1)c1csc(c1)c1scnc1 | 1311.30 |
| 124 | Oc1ccoc1c1ccnc(c1)c1ccc2c(c1)cccc2 | 1226.34 |
| 68 | SC1CCC(C1c1cnccn1)C1CSCCS1 | 1238.45 |
| 450 | o1ccc(c1)c1sc(cc1c1ccsc1)c1cccc2c1cccc2 | 1246.33 |
| 9 | c1ccc(nc1)c1csc(n1)Oc1ccc2c(c1)cccc2 | 1250.51 |
| 194 | c1cnc(cn1)c1nccnc1c1ccc(o1)c1nccs1 | 1321.35 |
| 406 | C1NC(NC(N1)c1cc(cc2c1cccc2)c1ccccn1)C1CCCC1 | 1106.21 |
| 84 | Sc1oc(c(c1)c1cnccn1)N1CNCNC1 | 1340.03 |
| 371 | n1ccc(cc1)c1nccc(c1c1cccs1)c1ccccn1 | 1190.34 |
| 388 | Sc1ccc2c(c1)c(ccc2)c1cc2ccccc2cc1c1scnc1 | 1215.30 |
| 495 | C1NCN(CN1)c1c(cncc1c1ccco1)c1cocc1 | 1267.09 |
| 30 | Cc1c(cc2c(c1c1cnccn1)cccc2)c1cscn1 | 1213.76 |
| 316 | c1cnc(cn1)C1(CSCC(S1)c1cscn1)c1ccc2c(c1)cccc2 | 1282.68 |
| 408 | Sc1cc(c(o1)c1cscc1)S | 1393.37 |
| 490 | c1nc(c(s1)c1ccco1)Cc1cccc2c1cccc2 | 1218.16 |
| 491 | Cc1csc(n1)c1ccc(cn1)c1cccs1 | 1271.32 |
| 280 | Sc1ccc(nc1)c1ccsc1c1cccc2c1cccc2 | 1227.39 |
| 356 | C1CCC(C1)c1ncc(s1)c1csc(n1)c1nsnc1 | 1335.61 |
| 76 | OC1(CCCC1C1CCCC1)c1cnccn1 | 1072.32 |
Run autoML for a regression task
[3]:
from chemml.autoML import ModelScreener
MS = ModelScreener(df, target="density_Kg/m3", featurization=True, smiles="smiles",
screener_type="regressor", output_file="testing.txt")
scores = MS.screen_models(n_best=10, multi_core=False)
featurizing molecules in batches of 2 ...
25/25 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19s 758ms/step
Merging batch features ... [DONE]
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
split done!
Single-core complete
--- Screening complete for feature set CoulombMatrix, time taken: 24.063 seconds ---
split done!
Single-core complete
--- Screening complete for feature set morganfingerprints_radius3, time taken: 26.558 seconds ---
split done!
Single-core complete
--- Screening complete for feature set MACCS_radius3, time taken: 20.941 seconds ---
split done!
Single-core complete
--- Screening complete for feature set hashedtopologicaltorsion_radius3, time taken: 23.584 seconds ---
split done!
Single-core complete
--- Screening complete for feature set rdkit_descriptors, time taken: 19.822 seconds ---
split done!
Single-core complete
--- Screening complete for feature set mord_descriptors, time taken: 25.504 seconds ---
[4]:
scores
[4]:
| ME | MAE | MSE | RMSE | MSLE | RMSLE | MAPE | MaxAPE | RMSPE | MPE | MaxAE | deltaMaxE | r_squared | std | time(seconds) | Model | parameters | Feature | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20 | 0.158392 | 3.541820 | 16.228670 | 4.028482 | 0.000010 | 0.003213 | 0.281598 | 0.409798 | 0.321640 | 0.008400 | 5.075142 | 9.790137 | 0.988459 | 37.498218 | 4.714186 | ElasticNet | {'alpha': 0.04832930238571757, 'copy_X': True,... | rdkit_descriptors |
| 18 | 2.210615 | 3.721152 | 16.695492 | 4.086012 | 0.000010 | 0.003225 | 0.292638 | 0.480407 | 0.322172 | 0.178923 | 6.087195 | 8.353000 | 0.988127 | 37.498218 | 2.938873 | SVR | {'C': 6.2105263157894735, 'cache_size': 200, '... | rdkit_descriptors |
| 17 | 4.800724 | 4.800724 | 45.260966 | 6.727627 | 0.000028 | 0.005321 | 0.376698 | 0.901028 | 0.530160 | 0.376698 | 11.416831 | 10.625138 | 0.967811 | 37.498218 | 2.872375 | Ridge | {'alpha': 5.0, 'copy_X': True, 'fit_intercept'... | rdkit_descriptors |
| 19 | 0.096653 | 10.271165 | 119.063575 | 10.911626 | 0.000074 | 0.008620 | 0.808893 | 1.232328 | 0.865350 | -0.012659 | 15.261767 | 24.330125 | 0.915325 | 37.498218 | 4.118833 | Lasso | {'alpha': 0.06951927961775611, 'copy_X': True,... | rdkit_descriptors |
| 23 | 2.001862 | 11.444190 | 195.578785 | 13.984949 | 0.000123 | 0.011071 | 0.897954 | 1.586112 | 1.103684 | 0.187122 | 19.643200 | 33.806692 | 0.860909 | 37.498218 | 8.573116 | Lasso | {'alpha': 0.005455594781168523, 'copy_X': True... | mord_descriptors |
| 24 | 0.098434 | 15.575122 | 306.234840 | 17.499567 | 0.000191 | 0.013812 | 1.220997 | 1.898368 | 1.377200 | 0.044581 | 23.510333 | 42.064068 | 0.782212 | 37.498218 | 10.132495 | ElasticNet | {'alpha': 0.0026366508987303605, 'copy_X': Tru... | mord_descriptors |
| 15 | 8.346180 | 18.830843 | 618.279153 | 24.865220 | 0.000397 | 0.019930 | 1.492118 | 3.217256 | 1.966877 | 0.652719 | 40.765533 | 54.517137 | 0.560293 | 37.498218 | 7.560452 | Lasso | {'alpha': 0.02335721469090124, 'copy_X': True,... | hashedtopologicaltorsion_radius3 |
| 14 | 24.377798 | 24.377798 | 738.728924 | 27.179568 | 0.000467 | 0.021607 | 1.909181 | 2.866370 | 2.134430 | 1.909181 | 35.498564 | 27.813276 | 0.474632 | 37.498218 | 6.243020 | Ridge | {'alpha': 24.6, 'copy_X': True, 'fit_intercept... | hashedtopologicaltorsion_radius3 |
| 22 | 17.411413 | 17.522432 | 784.644116 | 28.011500 | 0.000531 | 0.023046 | 1.412099 | 3.903867 | 2.261465 | 1.403741 | 48.347442 | 48.513971 | 0.441978 | 37.498218 | 5.094646 | SVR | {'C': 16.63157894736842, 'cache_size': 200, 'c... | mord_descriptors |
| 21 | 17.421228 | 17.602028 | 788.271260 | 28.076169 | 0.000534 | 0.023100 | 1.418320 | 3.912741 | 2.266680 | 1.404709 | 48.457341 | 48.728541 | 0.439398 | 37.498218 | 5.041715 | Ridge | {'alpha': 0.1, 'copy_X': True, 'fit_intercept'... | mord_descriptors |
Save scores to csv
[5]:
scores.to_csv("autoML_test.csv",index=False)