The following are the steps for using AutoML for a regression task:

Note: Setting the flag for featurization= ‘True’ generates represents molecules using 5 representation techniques.

  1. Requires an input pandas dataframe consisting of two columns:

    • SMILES strings

    • target property values

  2. Molecules are represented as:

    • coloumb matrix

    • rdkit morgan fingerprints

    • MACCs

    • rdkit hashed topological torsion

    • rdkit molecular descriptors (all)

  3. Screens through various sklearn regressor models:

Yields ‘n-best’ models, with optimized hyperparamters.

Returns dataframe of error metrics, machine learning model, algorithm, tuned hyperparameter values and featurization technique.

Load your data

[1]:
import pandas as pd
import numpy as np
from chemml.chem import Molecule
from chemml.datasets import load_organic_density
C:\Users\nitin\Documents\UB\Hachmann_Group\chemml_dev_nitin\nitinmad\chemml\chemml\datasets\base.py:2: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
[2]:
molecules, target, dragon_subset = load_organic_density()
df=pd.concat([molecules, target], axis=1)
df = df.sample(25, random_state=42)
df
[2]:
smiles density_Kg/m3
361 c1scc(n1)c1ncc(s1)c1c(ccc2c1cccc2)c1cncs1 1328.35
73 S1CC(SC(C1)C1CCCC1)c1cocc1c1ccccc1 1143.80
374 N1CNC(NC1)c1sccc1C1(CSCCS1)c1cccs1 1351.33
155 c1nc(c(s1)c1nsnc1)c1csc(n1)c1cccs1 1466.57
104 Oc1ccc2c(c1c1cocc1)c(ccc2)c1ccccn1 1207.67
394 Oc1ccc(s1)N1CNC(NC1)c1cnccn1 1355.15
377 n1csc(n1)C1(CCCC1)c1csc(c1)c1scnc1 1311.30
124 Oc1ccoc1c1ccnc(c1)c1ccc2c(c1)cccc2 1226.34
68 SC1CCC(C1c1cnccn1)C1CSCCS1 1238.45
450 o1ccc(c1)c1sc(cc1c1ccsc1)c1cccc2c1cccc2 1246.33
9 c1ccc(nc1)c1csc(n1)Oc1ccc2c(c1)cccc2 1250.51
194 c1cnc(cn1)c1nccnc1c1ccc(o1)c1nccs1 1321.35
406 C1NC(NC(N1)c1cc(cc2c1cccc2)c1ccccn1)C1CCCC1 1106.21
84 Sc1oc(c(c1)c1cnccn1)N1CNCNC1 1340.03
371 n1ccc(cc1)c1nccc(c1c1cccs1)c1ccccn1 1190.34
388 Sc1ccc2c(c1)c(ccc2)c1cc2ccccc2cc1c1scnc1 1215.30
495 C1NCN(CN1)c1c(cncc1c1ccco1)c1cocc1 1267.09
30 Cc1c(cc2c(c1c1cnccn1)cccc2)c1cscn1 1213.76
316 c1cnc(cn1)C1(CSCC(S1)c1cscn1)c1ccc2c(c1)cccc2 1282.68
408 Sc1cc(c(o1)c1cscc1)S 1393.37
490 c1nc(c(s1)c1ccco1)Cc1cccc2c1cccc2 1218.16
491 Cc1csc(n1)c1ccc(cn1)c1cccs1 1271.32
280 Sc1ccc(nc1)c1ccsc1c1cccc2c1cccc2 1227.39
356 C1CCC(C1)c1ncc(s1)c1csc(n1)c1nsnc1 1335.61
76 OC1(CCCC1C1CCCC1)c1cnccn1 1072.32

Run autoML for a regression task

[3]:
from chemml.autoML import ModelScreener
MS = ModelScreener(df, target="density_Kg/m3", featurization=True, smiles="smiles",
                   screener_type="regressor", output_file="testing.txt")
scores = MS.screen_models(n_best=10, multi_core=False)
featurizing molecules in batches of 2 ...
25/25 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19s 758ms/step
Merging batch features ...    [DONE]
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use MorganGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
[11:07:28] DEPRECATION WARNING: please use TopologicalTorsionGenerator
split done!
Single-core complete


--- Screening complete for feature set CoulombMatrix, time taken: 24.063 seconds ---
split done!
Single-core complete


--- Screening complete for feature set morganfingerprints_radius3, time taken: 26.558 seconds ---
split done!
Single-core complete


--- Screening complete for feature set MACCS_radius3, time taken: 20.941 seconds ---
split done!
Single-core complete


--- Screening complete for feature set hashedtopologicaltorsion_radius3, time taken: 23.584 seconds ---
split done!
Single-core complete


--- Screening complete for feature set rdkit_descriptors, time taken: 19.822 seconds ---
split done!
Single-core complete


--- Screening complete for feature set mord_descriptors, time taken: 25.504 seconds ---
[4]:
scores
[4]:
ME MAE MSE RMSE MSLE RMSLE MAPE MaxAPE RMSPE MPE MaxAE deltaMaxE r_squared std time(seconds) Model parameters Feature
20 0.158392 3.541820 16.228670 4.028482 0.000010 0.003213 0.281598 0.409798 0.321640 0.008400 5.075142 9.790137 0.988459 37.498218 4.714186 ElasticNet {'alpha': 0.04832930238571757, 'copy_X': True,... rdkit_descriptors
18 2.210615 3.721152 16.695492 4.086012 0.000010 0.003225 0.292638 0.480407 0.322172 0.178923 6.087195 8.353000 0.988127 37.498218 2.938873 SVR {'C': 6.2105263157894735, 'cache_size': 200, '... rdkit_descriptors
17 4.800724 4.800724 45.260966 6.727627 0.000028 0.005321 0.376698 0.901028 0.530160 0.376698 11.416831 10.625138 0.967811 37.498218 2.872375 Ridge {'alpha': 5.0, 'copy_X': True, 'fit_intercept'... rdkit_descriptors
19 0.096653 10.271165 119.063575 10.911626 0.000074 0.008620 0.808893 1.232328 0.865350 -0.012659 15.261767 24.330125 0.915325 37.498218 4.118833 Lasso {'alpha': 0.06951927961775611, 'copy_X': True,... rdkit_descriptors
23 2.001862 11.444190 195.578785 13.984949 0.000123 0.011071 0.897954 1.586112 1.103684 0.187122 19.643200 33.806692 0.860909 37.498218 8.573116 Lasso {'alpha': 0.005455594781168523, 'copy_X': True... mord_descriptors
24 0.098434 15.575122 306.234840 17.499567 0.000191 0.013812 1.220997 1.898368 1.377200 0.044581 23.510333 42.064068 0.782212 37.498218 10.132495 ElasticNet {'alpha': 0.0026366508987303605, 'copy_X': Tru... mord_descriptors
15 8.346180 18.830843 618.279153 24.865220 0.000397 0.019930 1.492118 3.217256 1.966877 0.652719 40.765533 54.517137 0.560293 37.498218 7.560452 Lasso {'alpha': 0.02335721469090124, 'copy_X': True,... hashedtopologicaltorsion_radius3
14 24.377798 24.377798 738.728924 27.179568 0.000467 0.021607 1.909181 2.866370 2.134430 1.909181 35.498564 27.813276 0.474632 37.498218 6.243020 Ridge {'alpha': 24.6, 'copy_X': True, 'fit_intercept... hashedtopologicaltorsion_radius3
22 17.411413 17.522432 784.644116 28.011500 0.000531 0.023046 1.412099 3.903867 2.261465 1.403741 48.347442 48.513971 0.441978 37.498218 5.094646 SVR {'C': 16.63157894736842, 'cache_size': 200, 'c... mord_descriptors
21 17.421228 17.602028 788.271260 28.076169 0.000534 0.023100 1.418320 3.912741 2.266680 1.404709 48.457341 48.728541 0.439398 37.498218 5.041715 Ridge {'alpha': 0.1, 'copy_X': True, 'fit_intercept'... mord_descriptors

Save scores to csv

[5]:
scores.to_csv("autoML_test.csv",index=False)