{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hyperparameter Optimization using `chemml.optimization.GeneticAlgorithm`\n",
    "\n",
    "We use a sample dataset from ChemML library which has the SMILES codes and Dragon molecular descriptors for 500 small organic molecules with their densities in $kg/m^3$. \n",
    "\n",
    "For more information on Genetic Algorithm, please refer to our [paper](https://doi.org/10.26434/chemrxiv.9782387.v1) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(500, 1) (500, 200)\n"
     ]
    }
   ],
   "source": [
    "from chemml.datasets import load_organic_density\n",
    "_,density,features = load_organic_density()\n",
    "\n",
    "print(density.shape, features.shape)\n",
    "density, features = density.values, features.values\n",
    "\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "scalerx = StandardScaler()\n",
    "features = scalerx.fit_transform(features)\n",
    "density = scalerx.fit_transform(density)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Defining hyperparameter space"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets consider [kernel ridge regression from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html) for training. The hyperparameters of interest are: alpha, kernel and degree.\n",
    "\n",
    "The space variable is a tuple of dictionaries for each hyperparameter. The dictionary is specified as:\n",
    "\n",
    "`{'name' : {'type' : <range>}}`\n",
    "\n",
    "An additional mutation key, with its value as: (mean, standard deviation) of a Gaussian distribution, is also required for the ‘uniform’ hyperparameter type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.kernel_ridge import KernelRidge\n",
    "space = (\n",
    "        {'alpha'   :   {'uniform' : (0.1, 10), 'mutation': (0,1)}},\n",
    "        {'kernels' :   {'choice'  : ['rbf', 'sigmoid', 'polynomial', 'linear']}},\n",
    "        {'degree'  :   {'int'     : (1,5)}} )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Defining objective function\n",
    "\n",
    "The objective function is defined as a function that receives one ‘individual’ of the genetic algorithm’s population that is an ordered list of the hyperparameters defined in the space variable. Within the objective function, the user does all the required calculations and returns the metric (as a tuple) that is supposed to be optimized. If multiple metrics are returned, all the metrics are optimized according to the fitness defined in the initialization of the Genetic Algorithm class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_absolute_error\n",
    "from chemml.utils import regression_metrics\n",
    "def obj(individual):\n",
    "    krr = KernelRidge(alpha=individual[0], kernel=individual[1], degree=individual[2])\n",
    "    krr.fit(features[:400], density[:400])\n",
    "    pred = krr.predict(features[400:])\n",
    "    mae = regression_metrics(density[400:],pred)['MAE'].values[0]\n",
    "    return mae"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Optimize the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from chemml.optimization import GeneticAlgorithm\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "ga = GeneticAlgorithm(evaluate=obj, space=space, fitness=(\"min\", ),\n",
    "                    pop_size = 8, crossover_size=6, mutation_size=2, algorithm=3)\n",
    "fitness_df, final_best_hyperparameters = ga.search(n_generations=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`ga.search` returns:\n",
    "\n",
    "- a dataframe with the best individuals of each generation along with their fitness values and the time taken to evaluate the model\n",
    "\n",
    "- a dictionary containing the best individual "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Best_individual</th>\n",
       "      <th>Fitness_values</th>\n",
       "      <th>Time (hours)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>(2.928571428571429, linear, 2)</td>\n",
       "      <td>0.102514</td>\n",
       "      <td>0.000042</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>(1.9083405267238138, linear, 3)</td>\n",
       "      <td>0.099425</td>\n",
       "      <td>0.000033</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>(1.9083405267238138, linear, 3)</td>\n",
       "      <td>0.099425</td>\n",
       "      <td>0.000039</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>(1.9083405267238138, linear, 3)</td>\n",
       "      <td>0.099425</td>\n",
       "      <td>0.000033</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>(1.9083405267238138, linear, 3)</td>\n",
       "      <td>0.099425</td>\n",
       "      <td>0.000060</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   Best_individual  Fitness_values  Time (hours)\n",
       "0   (2.928571428571429, linear, 2)        0.102514      0.000042\n",
       "1  (1.9083405267238138, linear, 3)        0.099425      0.000033\n",
       "2  (1.9083405267238138, linear, 3)        0.099425      0.000039\n",
       "3  (1.9083405267238138, linear, 3)        0.099425      0.000033\n",
       "4  (1.9083405267238138, linear, 3)        0.099425      0.000060"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fitness_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'alpha': 1.9083405267238138, 'kernels': 'linear', 'degree': 3}\n"
     ]
    }
   ],
   "source": [
    "print(final_best_hyperparameters)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Resume optimization \n",
    "\n",
    "The Genetic Algorithm can resume the search for a combination of the best hyperparameters from the last checkpoint. This feature can be useful when the objective function is computationally expensive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "fitness_df_resume, final_best_hyperparameters_resume = ga.search(n_generations=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Best_individual</th>\n",
       "      <th>Fitness_values</th>\n",
       "      <th>Time (hours)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>(1.9083405267238138, linear, 3)</td>\n",
       "      <td>0.099425</td>\n",
       "      <td>0.000052</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>(1.8043386147076927, linear, 5)</td>\n",
       "      <td>0.098999</td>\n",
       "      <td>0.000036</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>(1.6658103431498519, linear, 1)</td>\n",
       "      <td>0.098386</td>\n",
       "      <td>0.000145</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>(1.6658103431498519, linear, 1)</td>\n",
       "      <td>0.098386</td>\n",
       "      <td>0.000039</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>(1.6658103431498519, linear, 1)</td>\n",
       "      <td>0.098386</td>\n",
       "      <td>0.000072</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   Best_individual  Fitness_values  Time (hours)\n",
       "0  (1.9083405267238138, linear, 3)        0.099425      0.000052\n",
       "1  (1.8043386147076927, linear, 5)        0.098999      0.000036\n",
       "2  (1.6658103431498519, linear, 1)        0.098386      0.000145\n",
       "3  (1.6658103431498519, linear, 1)        0.098386      0.000039\n",
       "4  (1.6658103431498519, linear, 1)        0.098386      0.000072"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fitness_df_resume"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "chemml_dev_env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.5"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}