Optimization module

The chemml.optimization module includes (please click on links adjacent to function names for more information):

GeneticAlgorithm: GeneticAlgorithm()
ActiveLearning: ActiveLearning()

class chemml.optimization.ActiveLearning(model_creator, U, target_layer, train_size=100, test_size=100, test_type='passive', batch_size=[10], history=2)

The implementation of active learning of regression models using BEMCM and QBC methods and approaches for distribution shift alleviations. This algorithm assumes that you have a pool of unlabeled data points and a limited budget to label them. Thus, we combine the efficiency of the machine learning models with our active learning approach to suggest optimal number of calculations to provide labeled data.

The implementation of this algorithm follows an interactive approach. In other words, we often ask you to provide labels for the selected data points.

Parameters

model_creatorFunctionType

It’s a function that returns the model. We call this function a couple of times during the search to build fresh models with random weights. Note that you should also compile your model inside the function. We don’t provide options to compile the model. The compile (e.g., for Keras models) defines the loss function, the optimizer/learning rate, and the metrics.

Uarray-like

The features/descriptors of unlabeled candidates that are available to be labeled.

target_layerstr or list or FunctionType

If str, it’s the name of a layer of the Keras model that is linearly mapped to the outputs. If list, it’s a list of str that each element corresponds to the name of layers. If a function, it should be able to receive a model that will be created using the ‘model_creator’ and the X inputs, and returns the outputs of the linear layer.

train_sizeint, optional (default = 100)

It represents the absolute number of train samples that must be selected as the initial training set. The search will begin with this many training samples and labels are required immediately. Please choose a number based on:

your budget.

the minumum number that you think is enough to train your model.

test_sizeint, optional (default = 100)

It represents the absolute number of test samples that will be held out for the evaluation of the model in all rounds of your active learning search. Note the test set will be acquired before search begins and won’t be updated later during search. Please choose a number based on:

your budget.

the diversity of the pool of candidates (U).

test_typestr, optional (default = ‘passive’)

The value must be either ‘passive’ or ‘active’. If passive, test set will be sampled randomly at the initialization. If active, test set will be sampled randomly at each round.

batch_sizelist, optional (default = [10])

This is a list of maxumum three non-negative int values. Each value specifies the number of data points that our active learning approaches should query every round. The order of active learning approaches are as follows:

Batch Expected Model Change Maximization (BEMCM)

Query By Committee (QBC)

Distribution Shift Alleviation (DSA)

Note that the last method (i.e., DSA) is a complement to the first two methods and can not be specified alone.

historyint, optional (default = 2)

This parameter must be an integer and greater than one. It specifies the number of previous active learning rounds to memorize for the distribution shift alleviation (DSA) approach.

Attributes

querieslist

This list provides information regarding the indices of queried candidates. for each element of the list:

The index-0 is a short description.

The index-1 is an array of indices.

query_numberint

The number of rounds you have run the active learning search method.

U_indicesndarray

This is an array of the remaining unlabeled indices.

train_indicesndarray

This is an array of all candidates’ indices that are used as the training data.

test_indicesndarray

This is an array of all candidates’ indices that are used as the test data.

Y_predndarray

The predicted Y values at the current stage. These values will be updated after each run of search method.

resultspandas.DataFrame

The final results of the active learning approach.

random_resultspandas.DataFrame

The final results of the random search.

Methods

initialize deposit search random_search visualize get_target_layer

Notes

You won’t be able to resume the search unless you deposit the requested labeled data.

deposit(indices, Y)

This function helps you to deposit the data for candidates that were queried by initialize or search functions.

Parameters

indices: array-like: A 1-dimensional array of indices that was queried by initialize or search methods. You can deposit the data partially and it doesn’t have to be the entire array that is queried.
Y: array-like: The 2-dimensional labels of the data points as it will be used for the training of the model. The first dimension of the array should be equal to the number of indices. Y must be at least 2 dimensional.

Returns

checkbool: True, if deposited properly. False, otherwise.

get_target_layer(model, X)

The main function to get the latent features from the linear layer of the keras model.

Returns

target_layerarray-like: The concatenated array of the specified hidden layers by parameter target_layer.

ignore(indices)

If you found out that the experimental setup or computational research on some of the candidates is not feasible, just pass a list of their indices here and we remove them from the list of queries.

Parameters

indicesarray-like: A 1D array of all the indices that should be removed from the list of queries.

initialize(random_state=90)

The function to initialize the training and test set for the search. You can run this function only once before starting the search.

Parameters

random_stateint or RandomState, optional (default = 90): The random state will be directly passed to the sklearn.model_selection.ShuffleSplit extra info at: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn-model-selection-shufflesplit

Returns

train_indicesarray-like: The training set indices (Python 0-index) from the pool of candidates (U). This is a 1D array.
test_indicesarray-like: The test set indices (Python 0-index) from the pool of candidates (U).

random_search(Y, test_type='passive', scale=True, n_evaluation=10, random_state=90, **kwargs)

This function randomly select same number of data points as the active learning rounds and store the results.

Parameters

Y: array-like: The 2-dimensional label for all the candidates in the pool. Basically, you won’t run this method unless you have the labels for all your samples. Otherwise, trust us and perform an active learning search.
test_type: str, optional (default = ‘passive’): The parameter value must be either ‘passive’ or ‘active’. If passive, the initial randomly selected test set in the initialize method will be used for evaluation. If active, the current test set of active learning approach will be used for evaluation. Thus, if the test_type in active learning method is ‘passive’, you should run active and random search back to back and then deposit the data. This way you make sure both active and random search are tested on the same test sets.
scale: bool or list, optional (default = True): if True, sklearn.preprocessing.StandardScaler will be used to scale X and Y before training. You can also pass a list of two scaler instances that perform sklearn-style fit_transform and transform methods for the X and Y, respectively.
n_evaluation: int, optional (default = 3): number of times to repeat training of the model and evaluation on test set.
random_state: int or RandomState, optional (default = 90): The random state will be directly passed to the sklearn.model_selection methods. Please find additional info at: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
kwargs: Any argument (except input data) that should be passed to the model’s fit method.

Attributes

random_results: pandas dataframe: The results from the random sampling to provide the baseline for your active learning search. You need to

Notes

This method replicate the active learning training size with random sampling approach. Thus, you can run

this function only if the results is not empty, i.e., you have run the active learning search at least once.

search(n_evaluation=3, ensemble='bootstrap', n_ensemble=4, normalize_input=True, normalize_internal=False, random_state=90, **kwargs)

The main function to start or continue an active learning search. The bootstrap approach is used to generate an ensemble of models that estimate the prediction distribution of the candidates’ labels.

Parameters

n_evaluationint, optional (default = 3)

number of times to repeat training of the model and evaluation on test set.

ensemblestr, optional (default = ‘bootstrap’)

The sampling method to create n ensembles and estimate the predictive distributions.

‘bootstrap’: standard bootstrap method (random choice with replacement)
‘shuffle’ : sklearn.model_selection.ShuffleSplit
‘kfold’ : sklearn.model_selection.KFold

The ‘shuffle’ and ‘kfold’ methods draw samples that are smaller than training set.

n_ensembleint, optional (default = 5)

The size of the ensemble based on bootstrapping approach.

normalize_inputbool or list, optional (default = True)

if True, sklearn.preprocessing.StandardScaler will be used to normalize X and Y before training. You can also pass a list of two scaler instances that perform sklearn-style fit_transform and transform methods for the X and Y, respectively.

normalize_internalbool, optional (default = False)

if True, the internal variables for estimation of gradients will be normalized.

random_stateint or RandomState, optional (default = 90)

The random state will be directly passed to the sklearn.model_selection.KFold or ShuffleSplit Additional info at: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

kwargs

Any argument (except input data) that should be passed to the model’s fit method.

Returns

train_indicesarray-like: The training set indices (Python 0-index) from the pool of candidates (U). This is a 1D array.

visualize(Y=None)

This function plot distribution of labels and principal components of the features for the last round of the active learning search. Note that this function uses the prediction values in the attribute Y_pred. This attribute will be updated after each round of search. Thus, we recommend you run visualize right after each call of search method to get a trajectory of the active learning process.

Parameters

Y: array-like, optional (default = None): The 2-dimensional label for all the candidates in the pool (in case you have them!!!). If you have all the labels, we will be able to produce additional cool visualizations.

Returns

list: A list of matplotlib.figure.Figure or tuples. This object contains information about the plot

class chemml.optimization.GeneticAlgorithm(evaluate, space, fitness=('Max',), pop_size=50, crossover_size=30, mutation_size=20, crossover_type='Blend', fused_cutoff=5, mutation_prob=0.6, algorithm=3, initial_population=None)

A python implementation of real-valued, genetic algorithm for solving optimization problems.

Parameters

evaluatefunction

The objective function that has to be optimized. The first parameter of the objective function is a list of the trial values of the hyper-parameters in the order in which they are declared in the space variable. The objective function should always return a tuple with the metric/metrics for single/multi-objective optimization.

spacetuple,

A tuple of dict objects specifying the hyper-parameter space to search in. Each hyper-parameter should be a python dict object with the name of the hyper-parameter as the key. Value is also a dict object with one mandatory key among: ‘uniform’, ‘int’ and ‘choice’ for defining floating point, integer and choice variables respectively. Values for these keys should be a list defining the valid hyper-parameter search space (lower and upper bounds for ‘int’ and ‘uniform’, and all valid choices for ‘choice’). For uniform, a ‘mutation’ key is also required for which the value is [mean, standard deviation] for the gaussian distribution. Example:

({‘alpha’: {‘uniform’: [0.001, 1],
‘mutation’: [0, 1]}},

{‘layers’: {‘int’: [1, 3]}}, {‘neurons’: {‘choice’: range(0,200,20)}})

fitnesstuple, optional (default = (‘Max’,)

A tuple of string(s) for Maximizing (Max) or minimizing (Min) the objective function(s).

pop_sizeinteger, optional (default = 50)

Size of the population

crossover_sizeint, optional (default = 30)

Number of individuals to select for crossover.

mutation_sizeint, optional (default = 20)

Number of individuals to select for mutation.

crossover_typestring, optional (default = “Blend”)

Type of crossover: SinglePoint, DoublePoint, Blend, Uniform

mutation_probfloat, optional (default = 0.4)

Probability of mutation.

algorithmint, optional (default=1)

The algorithm to use for the search. Look at the ‘search’ method for a description of the various algorithms.

Algorithm 1:
Initial population is instantiated. Roulette wheel selection is used for selecting individuals for crossover and mutation. The initial population, crossovered and mutated individuals form the pool of individuals from which the best n members are selected as the initial population for the next generation, where n is the size of population.

Algorithm 2:
Same as algorithm 1 but when selecting individuals for next generation, n members are selected using Roulette wheel selection.

Algorithm 3:
Same as algorithm 1 but when selecting individuals for next generation, best members from each of the three pools (initital population, crossover and mutation) are selected according to the input parameters in the search method.

Algorithm 4:
Same as algorithm 1 but mutation population is selected from the crossover population and not from the parents directly.

initial_populationlist, optional (default=None)

The initial population for the algorithm to start with. If not provided, initial population is randomly generated.

search(n_generations=20, early_stopping=10, init_ratio=0.35, crossover_ratio=0.35)

Parameters

n_generationsinteger, optional (default = 20): An integer for the number of generations to evolve the population for.
early_stoppingint, optional (default=10): Integer specifying the maximum number of generations for which the algorithm can select the same best individual, after which the search terminates.
init_ratiofloat, optional (default = 0.4): Fraction of initial population to select for next generation. Required only for algorithm 3.
crossover_ratiofloat, optional (default = 0.3): Fraction of crossover population to select for next generation. Required only for algorithm 3.

Attributes

populationlist,: list of individuals from the final generation
fitness_dictdict,: dictionary of all individuals evaluated by the algorithm

Returns

best_ind_dfpandas dataframe: A pandas dataframe of best individuals of each generation
best_inddict,: The best individual after the last generation.