Preprocessing module

The chemml.preprocessing module includes (please click on links adjacent to function names for more information):

MissingValues: MissingValues()
ConstantColumns: ConstantColumns()
Outliers: Outliers()

chemml.preprocessing.ConstantColumns(df)

remove constant columns

Parameters

df: pandas dataframe: input dataframe

Returns

df: pandas dataframe

chemml.preprocessing.MissingValues(df, strategy='ignore_row', string_as_null=True, inf_as_null=True, missing_values=None)

find missing values and interpolate/replace or remove them.

Parameters

df : pandas dataframe

strategy: string, optional (default=”ignore_row”)

list of strategies: - interpolate: interpolate based on sorted target values - zero: set to the zero - ignore_row: remove the entire row in data and target - ignore_column: remove the entire column in data and target

string_as_null: boolean, optional (default=True): If True non numeric elements are considered to be null in computations.
missing_values: list, optional (default=None): where you define specific formats of missing values. It is a list of string, float or integer values.
inf_as_null: boolean, optional (default=True): If True inf and -inf elements are considered to be null in computations.

Returns

dataframe

Notes

mask is a binary vector whose length is the number of rows/indices in the df. The index of each bit shows if the row/column in the same position has been removed or not. The goal is keeping track of removed rows/columns to change the target data frame or other input data frames based on that. The mask can later be used in the transform method to change other data frames in the same way.

chemml.preprocessing.Outliers(df, m=2.0, strategy='median')

remove all rows where the values of a certain column are within an specified standard deviation from mean/median.

Parameters

df: pandas dataframe: input dataframe
m: float, optional (default=3.0): the outlier threshold with respect to the standard deviation
strategy: string, optional (default=’median’): available options: ‘mean’ and ‘median’ Values of each column will be compared to the ‘mean’ or ‘median’ of that column.

Returns

dataframe

Notes

We highly recommend you to remove constant columns first and then remove outliers.