Hyperparameter Optimization using `chemml.optimization.GeneticAlgorithm`

We use a sample dataset from ChemML library which has the SMILES codes and Dragon molecular descriptors for 500 small organic molecules with their densities in \(kg/m^3\).

For more information on Genetic Algorithm, please refer to our paper

[1]:

from chemml.datasets import load_organic_density
_,density,features = load_organic_density()

print(density.shape, features.shape)
density, features = density.values, features.values

from sklearn.preprocessing import StandardScaler
scalerx = StandardScaler()
features = scalerx.fit_transform(features)
density = scalerx.fit_transform(density)

(500, 1) (500, 200)

Defining hyperparameter space

Lets consider kernel ridge regression from scikit-learn for training. The hyperparameters of interest are: alpha, kernel and degree.

The space variable is a tuple of dictionaries for each hyperparameter. The dictionary is specified as:

{'name' : {'type' : <range>}}

An additional mutation key, with its value as: (mean, standard deviation) of a Gaussian distribution, is also required for the ‘uniform’ hyperparameter type.

[2]:

from sklearn.kernel_ridge import KernelRidge
space = (
        {'alpha'   :   {'uniform' : (0.1, 10), 'mutation': (0,1)}},
        {'kernels' :   {'choice'  : ['rbf', 'sigmoid', 'polynomial', 'linear']}},
        {'degree'  :   {'int'     : (1,5)}} )

Defining objective function

The objective function is defined as a function that receives one ‘individual’ of the genetic algorithm’s population that is an ordered list of the hyperparameters defined in the space variable. Within the objective function, the user does all the required calculations and returns the metric (as a tuple) that is supposed to be optimized. If multiple metrics are returned, all the metrics are optimized according to the fitness defined in the initialization of the Genetic Algorithm class.

[3]:

from sklearn.metrics import mean_absolute_error
from chemml.utils import regression_metrics
def obj(individual):
    krr = KernelRidge(alpha=individual[0], kernel=individual[1], degree=individual[2])
    krr.fit(features[:400], density[:400])
    pred = krr.predict(features[400:])
    mae = regression_metrics(density[400:],pred)['MAE'].values[0]
    return mae

Optimize the model

[4]:

from chemml.optimization import GeneticAlgorithm
import warnings
warnings.filterwarnings('ignore')

ga = GeneticAlgorithm(evaluate=obj, space=space, fitness=("min", ),
                    pop_size = 8, crossover_size=6, mutation_size=2, algorithm=3)
fitness_df, final_best_hyperparameters = ga.search(n_generations=5)

ga.search returns:

a dataframe with the best individuals of each generation along with their fitness values and the time taken to evaluate the model
a dictionary containing the best individual

[5]:

fitness_df

[5]:

	Best_individual	Fitness_values	Time (hours)
0	(2.928571428571429, linear, 2)	0.102514	0.000042
1	(1.9083405267238138, linear, 3)	0.099425	0.000033
2	(1.9083405267238138, linear, 3)	0.099425	0.000039
3	(1.9083405267238138, linear, 3)	0.099425	0.000033
4	(1.9083405267238138, linear, 3)	0.099425	0.000060

[6]:

print(final_best_hyperparameters)

{'alpha': 1.9083405267238138, 'kernels': 'linear', 'degree': 3}

Resume optimization

The Genetic Algorithm can resume the search for a combination of the best hyperparameters from the last checkpoint. This feature can be useful when the objective function is computationally expensive.

[7]:

fitness_df_resume, final_best_hyperparameters_resume = ga.search(n_generations=5)

[8]:

fitness_df_resume

[8]:

	Best_individual	Fitness_values	Time (hours)
0	(1.9083405267238138, linear, 3)	0.099425	0.000052
1	(1.8043386147076927, linear, 5)	0.098999	0.000036
2	(1.6658103431498519, linear, 1)	0.098386	0.000145
3	(1.6658103431498519, linear, 1)	0.098386	0.000039
4	(1.6658103431498519, linear, 1)	0.098386	0.000072

Hyperparameter Optimization using chemml.optimization.GeneticAlgorithm

Defining hyperparameter space

Defining objective function

Optimize the model

Resume optimization

Hyperparameter Optimization using `chemml.optimization.GeneticAlgorithm`