Genetic Algorithm GUI tutorial

A python implementation of real-valued, genetic algorithm for solving hyperparameter optimization problems.

We recommend going through the Hyperparameter Optimization using chemml.optimization.GeneticAlgorithm section of the documentation to get a better understanding of the parameters required to run the Genetic Algorithm. In this tutorial we showcase the procedure to optimize the hyperparameters of a MLPRegressor model.

We use a sample dataset from the ChemML library which has the SMILES codes and the HOMO energies (eV). We represent the data using Morgan Fingerprints (represented and a vector using 1024 bits). We split the data using a 90:10 (training data: testing data) split. We predict the HOMO energies of the test data set using the MLPRegressor model with optmized hyperparameters.

Five text files containing python code are required to run the Genetic Algorithm using the GUI.

Text files required: - evaluate - error_metric - single_obj - space - test_hyperparameters

evaluate

The file ga_eval.txt contains python code which evaluates every individual (tuple of hyperparameters) and yields the results of the error metric for each generation. Once model type is selected in the Wrapper parameters section of the GUI, the model is defined in this section code. For each generation, the hyperparemeters of the model are set and the model is passed to the objective function (single_obj) which evaluates the model.

The hyperparameter optimization process can be a time consuming one. Hence, for our convinience, we use a temporary file to guage and print the progress of the iterations. This is done using this section of the code:

#count iterations of GA
count=open("tmp.txt", "a")
count.write("GA search iteration in process... \n")
count.close()
file = open("tmp.txt","r")
Counter = 0
# Reading number of lines from file
Content = file.read()
CoList = Content.split("\n")
for i in CoList:
    if i:
        Counter += 1
print("GA search iteration in process... ",Counter)

Note: The ``tmp.txt`` file is deleted once the GA search is completed

``ga_eval.txt`` has the following python code:

def ga_eval(indi):

    layers = [indi[i] for i in range(2,5) if indi[i] != 0]
    #count iterations of GA
    count=open("tmp.txt", "a")
    count.write("GA search iteration in process... \n")
    count.close()
    file = open("tmp.txt","r")
    Counter = 0
    # Reading number of lines from file
    Content = file.read()
    CoList = Content.split("\n")
    for i in CoList:
        if i:
            Counter += 1
    print("GA search iteration in process... ",Counter)
    mlp = MLPRegressor(alpha=np.exp(indi[0]), activation=indi[1],
    hidden_layer_sizes=tuple(layers),learning_rate='invscaling', max_iter=10,early_stopping=True)
    ga_search = single_obj(mlp=mlp, x=X.values, y=Y.values,n_splits=n_splits)
    f=open("GA.txt", "a")
    f.write("%f %s %d %d %d %f \n" %(float(np.exp(indi[0])), str(indi[1]), int(indi[2]), int(indi[3]),
    int(indi[4]),float(ga_search)))
    f.close()
    return ga_search

error_metric

The file error_metric.txt contains python code which returns the user defined error metric score (in this case Mean Absolute Error) for the data. The error metric returned in this function will define the criteria on which the “best model” is evaluated.

def error_metric(y_true,y_pred):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    ndata = len(y_true)
    y_mean = np.mean(y_true)
    e = y_true - y_pred
    ae = np.absolute(e)
    se = np.square(e)
    var = np.mean(np.square(y_true - y_mean))
    MAE = np.mean(ae)
    return MAE

single_obj

The file single_obj.txt contains python code which fits the model to the dataset using the hyperparameters defined in the ga_eval function. The training data is split into n-splits (defined by the user) and an error metric for each fold is obtained.

This function returns the mean error_metric score of the Kfold(s) specified.

``single_obj.txt`` has the following python code:

def single_obj(mlp, x, y, n_splits=n_splits):
    n_splits=n_splits
    kf = KFold(n_splits)                    # cross validation based on Kfold (creates 5 validation train-test sets)
    accuracy_kfold = []
    for training, testing in kf.split(x):
        mlp.fit(x[training], y[training])
        y_pred = mlp.predict(x[testing])
        y_pred, y_act =y_pred.reshape(-1,1), y[testing].reshape(-1,1)
        model_accuracy=mae(y_act,y_pred)                             # evaluation metric:  mae
        accuracy_kfold.append(model_accuracy)                         # creates list of accuracies for each fold
    #print("def single_obj - completed")
    return np.mean(accuracy_kfold)

space

The file space.txt contains python code which defines the space variable.

space = ({'alpha': {'uniform': [np.log(0.0001), np.log(0.1)], 'mutation': [0, 1]}},{'activation': {'choice': ['identity', 'logistic', 'tanh', 'relu']}},{'neurons1':  {'choice': range(0,220,20)}},{'neurons2':  {'choice': range(0,220,20)}},{'neurons3':  {'choice': range(0,220,20)}})

test_hyperparameters

The file test_hyperparameters.txt contains python code which fits the model, having the best hyperparameters obtained using the GA, to the training dataset. The model is used to predict the values of the testing data set. The function returns a error metric score (in this case MAE) of the predicted values and actual values.

``test_hyperparameters.txt`` has the following python code:

def test_hyp(mlp, x, y, xtest, ytest):
    mlp.fit(x, y)
    ypred = mlp.predict(xtest)
    acc=mae(ytest,ypred)
    # print(" test_hyp completed ")
    return np.mean(acc)
[1]:
from chemml.wrapper.notebook import ChemMLNotebook
ui = ChemMLNotebook()
The computation graph will be displayed here:

The ChemML Wrapper's config file has been successfully saved ...
    config file name: chemML_config.txt
    current directory: /mnt/c/Aatish/UB/Mr. Hachmann/master_chemml_wrapper_v2
    what's next? run the ChemML Wrapper using the config file with the following codes:
        >>> from chemml.wrapper.engine import run
        >>> run(INPUT_FILE = 'path_to_the_config_file', OUTPUT_DIRECTORY = 'CMLWrapper_out')
... you can also create a python script of the above codes and run it on any cluster that ChemML is installed.
[3]:
from chemml.wrapper.engine import run
run(INPUT_FILE = '/mnt/c/Aatish/UB/Mr. Hachmann/master_chemml_wrapper_v2/chemML_config.txt', OUTPUT_DIRECTORY = 'ga_out')
=================================================
=================================================
Thu Oct 21 13:15:40 2021

parsing the input file: /mnt/c/Aatish/UB/Mr. Hachmann/master_chemml_wrapper_v2/chemML_config.txt ...

1   Task: (Input,datasets)
        <<<<<<<
        host = chemml
        function = load_cep_homo
        >>>>>>>
        smiles -> send (id=0)
        homo -> send (id=4)
         :nothing to receive:

2   Task: (Output,file)
        <<<<<<<
        host = chemml
        function = SaveFile
        format = smi
        header = False
        filename = smiles
        >>>>>>>
        filepath -> send (id=1)
        df <- recv (id=0)

3   Task: (Represent,molecular descriptors)
        <<<<<<<
        host = chemml
        function = RDKitFingerprint
        >>>>>>>
        df -> send (id=2)
        df -> send (id=3)
        molfile <- recv (id=1)

4   Task: (Output,file)
        <<<<<<<
        host = chemml
        function = SaveFile
        filename = fps_rdkfp
        >>>>>>>
         :nothing to send:
        df <- recv (id=2)

5   Task: (Prepare,split)
        <<<<<<<
        host = sklearn
        function = train_test_split
        >>>>>>>
        dfx_train -> send (id=5)
        dfy_train -> send (id=6)
        dfx_test -> send (id=7)
        dfy_test -> send (id=8)
        dfx <- recv (id=3)
        dfy <- recv (id=4)

6   Task: (Optimize,genetic algorithm)
        <<<<<<<
        host = chemml
        function = GA
        algorithm = 3
        ml_model = MLPRegressor
        evaluate = ./chemml/chemml/datasets/GA_files/ga_eval.txt
        space = ./chemml/chemml/datasets/GA_files/space.txt
        error_metric = ./chemml/chemml/datasets/GA_files/error_metric.txt
        test_hyperparameters = ./chemml/chemml/datasets/GA_files/test_hyperparameters.txt
        single_obj = ./chemml/chemml/datasets/GA_files/single_obj.txt
        fitness = (<built-in function min>,)
        pop_size = 5
        crossover_size = 2
        mutation_size = 2
        n_splits = 5
        n_generations = 5
        >>>>>>>
        best_ind_df -> send (id=9)
        best_individual -> send (id=10)
        dfx_train <- recv (id=5)
        dfy_train <- recv (id=6)
        dfx_test <- recv (id=7)
        dfy_test <- recv (id=8)

7   Task: (Output,file)
        <<<<<<<
        host = chemml
        function = SaveFile
        filename = best_ind_df
        >>>>>>>
         :nothing to send:
        df <- recv (id=9)

8   Task: (Output,file)
        <<<<<<<
        host = chemml
        function = SaveFile
        filename = best_individual
        >>>>>>>
         :nothing to send:
        df <- recv (id=10)

=================================================

======= block#1: (chemml, load_cep_homo)
| run ...

| ... done!
| execution time: 4.78s (0h 0m 4.78s)
=======


======= block#2: (chemml, SaveFile)
| run ...

| ... done!
| execution time: 0.05s (0h 0m 0.05s)
=======


======= block#3: (chemml, RDKitFingerprint)
| run ...

| ... done!
| execution time: 1.83s (0h 0m 1.83s)
=======


======= block#4: (chemml, SaveFile)
| run ...

| ... done!
| execution time: 0.14s (0h 0m 0.14s)
=======


======= block#5: (sklearn, train_test_split)
| run ...

| ... done!
| execution time: 0.03s (0h 0m 0.03s)
=======


======= block#6: (chemml, GA)
| run ...

Hyperparameter optimization is a time consuming process - do not shutdown Kernel....

Total GA search iterations =  25
GA search iteration in process...  1
GA search iteration in process...  2
GA search iteration in process...  3
GA search iteration in process...  4
GA search iteration in process...  5
GA search iteration in process...  6
GA search iteration in process...  7
GA search iteration in process...  8
GA search iteration in process...  9
GA search iteration in process...  10
GA search iteration in process...  11
GA search iteration in process...  12
GA search iteration in process...  13
GA search iteration in process...  14
GA search iteration in process...  15
GA search iteration in process...  16
GA search iteration in process...  17
GA search iteration in process...  18
GA search iteration in process...  19
GA search iteration in process...  20
GA search iteration in process...  21
GA search iteration in process...  22
GA search iteration in process...  23
GA search iteration in process...  24
GA search iteration in process...  25
GeneticAlgorithm - complete!


genetic algorithm results for each generation:
                             Best_individual  Fitness_values  Time (hours)
0  (-7.483401552230648, tanh, 140, 80, 120)        0.390616      0.008216
1  (-7.483401552230648, tanh, 140, 80, 120)        0.390616      0.009518
2  (-7.483401552230648, tanh, 140, 80, 120)        0.390616      0.010988
3  (-7.483401552230648, tanh, 140, 80, 120)        0.390616      0.011194
4  (-7.483401552230648, tanh, 140, 80, 120)        0.390616      0.010027

best particle:  {'alpha': -7.483401552230648, 'activation': 'tanh', 'neurons1': 140, 'neurons2': 80, 'neurons3': 120}

Calculating accuracy on test data....


Test set error_metric (default = MAE) for the best GA hyperparameter:  0.7548308245753313

| ... done!
| execution time: 216.86s (0h 3m 36.86s)
=======


======= block#7: (chemml, SaveFile)
| run ...

| ... done!
| execution time: 0.03s (0h 0m 0.03s)
=======


======= block#8: (chemml, SaveFile)
| run ...

| ... done!
| execution time: 0.04s (0h 0m 0.04s)
=======


Total execution time: 223.80s (0h 3m 43.80s)
2021-10-21 13:19:23