Hyperparameter Optimization using chemml.optimization.GeneticAlgorithm

We use a sample dataset from ChemML library which has the SMILES codes and Dragon molecular descriptors for 500 small organic molecules with their densities in \(kg/m^3\).

For more information on Genetic Algorithm, please refer to our paper

[1]:
from chemml.datasets import load_organic_density
_,density,features = load_organic_density()

print(density.shape, features.shape)
density, features = density.values, features.values

from sklearn.preprocessing import StandardScaler
scalerx = StandardScaler()
features = scalerx.fit_transform(features)
density = scalerx.fit_transform(density)
2021-11-09 18:23:58.580965: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-09 18:23:58.581035: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
/mnt/c/Aatish/UB/Mr. Hachmann/master_chemml_wrapper_v2/chemml/chemml/datasets/base.py:87: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
  features = df.drop(['smiles', 'density_Kg/m3'],1)
(500, 1) (500, 200)

Defining hyperparameter space

Lets consider kernel ridge regression from scikit-learn for training. The hyperparameters of interest are: alpha, kernel and degree.

The space variable is a tuple of dictionaries for each hyperparameter. The dictionary is specified as:

{'name' : {'type' : <range>}}

An additional mutation key, with its value as: (mean, standard deviation) of a Gaussian distribution, is also required for the ‘uniform’ hyperparameter type.

[2]:
from sklearn.kernel_ridge import KernelRidge
space = (
        {'alpha'   :   {'uniform' : (0.1, 10), 'mutation': (0,1)}},
        {'kernels' :   {'choice'  : ['rbf', 'sigmoid', 'polynomial', 'linear']}},
        {'degree'  :   {'int'     : (1,5)}} )

Defining objective function

The objective function is defined as a function that receives one ‘individual’ of the genetic algorithm’s population that is an ordered list of the hyperparameters defined in the space variable. Within the objective function, the user does all the required calculations and returns the metric (as a tuple) that is supposed to be optimized. If multiple metrics are returned, all the metrics are optimized according to the fitness defined in the initialization of the Genetic Algorithm class.

[3]:
from sklearn.metrics import mean_absolute_error
from chemml.utils import regression_metrics
def obj(individual):
    krr = KernelRidge(alpha=individual[0], kernel=individual[1], degree=individual[2])
    krr.fit(features[:400], density[:400])
    pred = krr.predict(features[400:])
    mae = regression_metrics(density[400:],pred)['MAE'].values[0]
    return mae

Optimize the model

[4]:
from chemml.optimization import GeneticAlgorithm
import warnings
warnings.filterwarnings('ignore')

ga = GeneticAlgorithm(evaluate=obj, space=space, fitness=("min", ),
                    pop_size = 8, crossover_size=6, mutation_size=2, algorithm=3)
fitness_df, final_best_hyperparameters = ga.search(n_generations=5)

ga.search returns:

  • a dataframe with the best individuals of each generation along with their fitness values and the time taken to evaluate the model

  • a dictionary containing the best individual

[5]:
fitness_df
[5]:
Best_individual Fitness_values Time (hours)
0 (2.928571428571429, linear, 1) 0.102514 0.000390
1 (1.1057547599530448, linear, 1) 0.095582 0.000230
2 (1.1057547599530448, linear, 1) 0.095582 0.000312
3 (1.1057547599530448, linear, 1) 0.095582 0.000143
4 (1.1057547599530448, linear, 1) 0.095582 0.000309
[6]:
print(final_best_hyperparameters)
{'alpha': 1.1057547599530448, 'kernels': 'linear', 'degree': 1}

Resume optimization

The Genetic Algorithm can resume the search for a combination of the best hyperparameters from the last checkpoint. This feature can be useful when the objective function is computationally expensive.

[7]:
fitness_df_resume, final_best_hyperparameters_resume = ga.search(n_generations=5)
[8]:
fitness_df_resume
[8]:
Best_individual Fitness_values Time (hours)
0 (1.1057547599530448, linear, 1) 0.095582 0.000271
1 (1.1057547599530448, linear, 1) 0.095582 0.000260
2 (1.1057547599530448, linear, 1) 0.095582 0.000471
3 (1.1057547599530448, linear, 1) 0.095582 0.000116
4 (1.0210089362161252, linear, 4) 0.095203 0.000316