Chem module

The chemml.chem module includes (please click on links adjacent to function names for more information):
class chemml.chem.BagofBonds(const=1.0, n_jobs=-1, verbose=True)

The implementation of bag of bonds version of coulomb matrix by katja Hansen et. al. 2015, JPCL.

Parameters

constfloat, optional (default = 1.0)

The constant value for coordinates unit conversion to atomic unit if const=1.0, returns atomic unit if const=0.529, returns Angstrom

const/|Ri-Rj|, which denominator is the euclidean distance between two atoms

n_jobsint, optional(default=-1)

The number of parallel processes. If -1, uses all the available processes.

verbosebool, optional(default=True)

The verbosity of messages.

Attributes

header_list of header for the bag of bonds data frame

contains one nuclear charge (represents single atom) or a tuple of two nuclear charges (represents a bond)

Examples

>>> from chemml.datasets import load_xyz_polarizability
>>> from chemml.chem import BagofBonds
>>> coordinates, y = load_xyz_polarizability()
>>> bob = BagofBonds(const= 1.0)
>>> features = bob.represent(coordinates)
concat_mol_features(bbs_info)

This function concatenates a list of molecules features from parallel run

Parameters

bbs_infolist or tuple

The list or tuple of features and keys

Returns

featuresdata frame

A single dataframe of all features

represent(molecules)

provides bag of bonds representation for input molecules.

Parameters

moleculeschemml.chem.Molecule object or array

If list, it must be a list of chemml.chem.Molecule objects, otherwise we raise a ValueError. In addition, all the molecule objects must provide the XYZ information. Please make sure the XYZ geometry has been stored or optimized in advance.

Returns

featurespandas data frame, shape: (n_molecules, max_length_of_combinations)

The bag of bond features.

class chemml.chem.CoulombMatrix(cm_type='SC', max_n_atoms='auto', nPerm=3, const=1, n_jobs=-1, verbose=True)

The implementation of coulomb matrix descriptors by Matthias Rupp et. al. 2012, PRL (All 3 different variations).

Parameters

cm_typestr, optional (default=’SC’)
The coulomb matrix type, one of the following types:
  • ‘Unsorted_Matrix’ or ‘UM’

  • ‘Unsorted_Triangular’ or ‘UT’

  • ‘Eigenspectrum’ or ‘E’

  • ‘Sorted_Coulomb’ or ‘SC’

  • ‘Random_Coulomb’ or ‘RC’

max_n_atomsint or ‘auto’, optional (default = ‘auto’)

Set the maximum number of atoms per molecule (to which all representations will be padded). If ‘auto’, we find it based on all input molecules.

nPermint, optional (default = 3)

Number of permutation of coulomb matrix per molecule for Random_Coulomb (RC) type of representation.

constfloat, optional (default = 1)

The constant value for coordinates unit conversion to atomic unit example: atomic unit -> const=1, Angstrom -> const=0.529 const/|Ri-Rj|, which denominator is the euclidean distance between atoms i and j

n_jobsint, optional(default=-1)

The number of parallel processes. If -1, uses all the available processes.

verbosebool, optional(default=True)

The verbosity of messages.

Attributes

n_molecules_int

Total number of molecules.

max_n_atoms_int

Maximum number of atoms in all molecules.

Examples

>>> from chemml.chem import CoulombMatrix, Molecule
>>> m1 = Molecule('c1ccc1', 'smiles')
>>> m2 = Molecule('CNC', 'smiles')
>>> m3 = Molecule('CC', 'smiles')
>>> m4 = Molecule('CCC', 'smiles')
>>> molecules = [m1, m2, m3, m4]
>>> for mol in molecules:   mol.to_xyz(optimizer='UFF')
>>> cm = CoulombMatrix(cm_type='SC', n_jobs=-1)
>>> features = cm.represent(molecules)
static concat_dataframes(mol_tensors_list)

Concatenates a list of molecule tensors

Parameters

mol_tensors_list: list

list of molecule tensors

Returns

featuresdataframe

a single feature dataframe

represent(molecules)

provides coulomb matrix representation for input molecules.

Parameters

moleculeschemml.chem.Molecule object or array

If list, it must be a list of chemml.chem.Molecule objects, otherwise we raise a ValueError. In addition, all the molecule objects must provide the XYZ information. Please make sure the XYZ geometry has been stored or optimized in advance.

Returns

featuresPandas DataFrame

A data frame with same number of rows as number of molecules will be returned. The exact shape of the dataframe depends on the type of CM as follows:

  • shape of Unsorted_Matrix (UM): (n_molecules, max_n_atoms**2)

  • shape of Unsorted_Triangular (UT): (n_molecules, max_n_atoms*(max_n_atoms+1)/2)

  • shape of eigenspectrums (E): (n_molecules, max_n_atoms)

  • shape of Sorted_Coulomb (SC): (n_molecules, max_n_atoms*(max_n_atoms+1)/2)

  • shape of Random_Coulomb (RC): (n_molecules, nPerm * max_n_atoms * (max_n_atoms+1)/2)

class chemml.chem.Dragon(CheckUpdates=True, SaveLayout=True, PreserveTemporaryProjects=True, ShowWorksheet=False, Decimal_Separator='.', Missing_String='NaN', DefaultMolFormat='1', HelpBrowser='/usr/bin/xdg-open', RejectUnusualValence=False, Add2DHydrogens=False, MaxSRforAllCircuit='19', MaxSR='35', MaxSRDetour='30', MaxAtomWalkPath='2000', LogPathWalk=True, LogEdge=True, Weights=('Mass', 'VdWVolume', 'Electronegativity', 'Polarizability', 'Ionization', 'I-State'), SaveOnlyData=False, SaveLabelsOnSeparateFile=False, SaveFormatBlock='%b-%n.txt', SaveFormatSubBlock='%b-%s-%n-%m.txt', SaveExcludeMisVal=False, SaveExcludeAllMisVal=False, SaveExcludeConst=False, SaveExcludeNearConst=False, SaveExcludeStdDev=False, SaveStdDevThreshold='0.0001', SaveExcludeCorrelated=False, SaveCorrThreshold='0.95', SaveExclusionOptionsToVariables=False, SaveExcludeMisMolecules=False, SaveExcludeRejectedMolecules=False, blocks=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], SaveStdOut=False, SaveProject=False, SaveProjectFile='Dragon_project.drp', SaveFile=True, SaveType='singlefile', SaveFilePath='Dragon_descriptors.txt', logMode='file', logFile='Dragon_log.txt', external=False, fileName=None, delimiter=',', consecutiveDelimiter=False, MissingValue='NaN', RejectDisconnectedStrucuture=False, RetainBiggestFragment=False, DisconnectedCalculationOption='0', RoundCoordinates=True, RoundWeights=True, RoundDescriptorValues=True, knimemode=False)

An interface to Dragon 6 and 7 chemoinformatics software. Dragon is a commercial software and you should provide

Parameters

versionint, optional (default=7)

The version of available Dragon on the user’s system. (available versions: 6 or 7)

Weightslist, optional (default=[“Mass”,”VdWVolume”,”Electronegativity”,”Polarizability”,”Ionization”,”I-State”])

A list of weights to be used

blockslist, optional (default = list(range(1,31)))

A list of integers as descriptor blocks’ id. There are totally 29 and 30 blocks available in version 6 and 7, respectively. This module is not atimed to cherry pick descriptors in each block. For doing so, please use Script Wizard in Drgon GUI.

externalboolean, optional (default=False)

If True, include external variables at the end of each saved file.

Notes

The documentation for the rest of parameters can be found in the following links:

Examples

>>> import pandas as pd
>>> from chemml.chem import Dragon
>>> drg = Dragon()
>>> df = drg.represent(mol_list, output_directory='./', dropna=False)
represent(mol_list, output_directory='./', dropna=True)

Parameters

mol_list: list

list of chemml.chem.Molecule objects

output_directory: str

output directory to save dragon scripts

dropna: bool

Drops all columns with any NaN value.

Returns

class chemml.chem.Molecule(input_mol, input_type, engine='rdkit', **kwargs)

The central class to construct a molecule from different chemical input formats. This module is built on top of RDKit and OpenBabel python API. We join the forces and strength of these two cheminformatic libraries for a consistent user experience.

Almost all the molecular descriptors and molecule-based ML models require the chemical informatin as a Molecule object. Several methods are available in this module to facilitate the manipulation of chemical data.

Parameters

inputstr

The representation string or path to a file.

input_typestr

The input type. The available types are enlisted here:

  • smiles: The input must be SMILES representation of a molecule.

  • smarts: The input must be SMARTS representation of a molecule.

  • inchi: The input must be InChi representation of a molecule.

  • xyz: The input must be the path to an xyz file.

  • mol2: The input must be the path to an mol2 file.

kwargs :
The corresponding RDKit arguments for each of the input types:

Notes

  • The molecule will be created as an RDKit molecule object.

  • The molecule object will be stored and available as rdkit_molecule attribute.

  • If you load a molecule from its SMARTS string, there is high probability that you can’t convert it to other

    types due to the abstract description of the molecules by SMARTS representation.

Attributes

rdkit_moleculeobject

The rdkit.Chem.rdchem.Mol object

smilesstr

The SMILES string that you get by running the to_smiles method.

smartsstr

The SMARTS string that you get by running the to_smarts method.

inchistr

The InChi string that you get by running the to_inchi method.

xyzinstance of <class ‘chemml.chem.molecule.XYZ’>

The class object that stores the 3D info. The available attributes in the class are ‘geometry’, ‘atomic_numbers’, and ‘atomic_symbols’.

Examples

>>> from chemml.chem import Molecule
>>> caffeine_smiles = 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'
>>> caffeine_smarts = '[#6]-[#7]1:[#6]:[#7]:[#6]2:[#6]:1:[#6](=[#8]):[#7](:[#6](=[#8]):[#7]:2-[#6])-[#6]'
>>> caffeine_inchi = 'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3'
>>> mol = Molecule(caffeine_smiles, 'smiles')
>>> mol
<Molecule(rdkit_molecule : <rdkit.Chem.rdchem.Mol object at 0x11060a8a0>,
      creator        : ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'),
      smiles         : 'Cn1c(=O)c2c(ncn2C)n(C)c1=O',
      smarts         : None,
      inchi          : None,
      xyz            : None)>
>>> mol.smiles # this is the canonical SMILES by default
'Cn1c(=O)c2c(ncn2C)n(C)c1=O'
>>> mol.creator
('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C')
>>> mol.to_smiles(kekuleSmiles=True)
>>> mol
<Molecule(rdkit_molecule : <rdkit.Chem.rdchem.Mol object at 0x11060a8a0>,
      creator        : ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'),
      smiles         : 'CN1C(=O)C2=C(N=CN2C)N(C)C1=O',
      smarts         : None,
      inchi          : None,
      xyz            : None)>
>>> mol.smiles # the kukule SMILES is not canonical
'CN1C(=O)C2=C(N=CN2C)N(C)C1=O'
>>> mol.inchi is None
True
>>> mol.to_inchi()
>>> mol
<Molecule(rdkit_molecule : <rdkit.Chem.rdchem.Mol object at 0x11060a8a0>,
      creator        : ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'),
      smiles         : 'CN1C(=O)C2=C(N=CN2C)N(C)C1=O',
      smarts         : None,
      inchi          : 'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3',
      xyz            : None)>
>>> mol.inchi
'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3'
>>> mol.to_smarts(isomericSmiles=False)
>>> mol.smarts
'[#6]-[#7]1:[#6]:[#7]:[#6]2:[#6]:1:[#6](=[#8]):[#7](:[#6](=[#8]):[#7]:2-[#6])-[#6]'
>>>
>>> # add hydrogens and recreate smiles and inchi
>>> mol.hydrogens('add')
>>> mol.to_smiles()
>>> mol.to_inchi()
<Molecule(rdkit_molecule : <rdkit.Chem.rdchem.Mol object at 0x11060ab70>,
      creator        : ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'),
      smiles         : '[H]c1nc2c(c(=O)n(C([H])([H])[H])c(=O)n2C([H])([H])[H])n1C([H])([H])[H]',
      smarts         : '[#6]-[#7]1:[#6]:[#7]:[#6]2:[#6]:1:[#6](=[#8]):[#7](:[#6](=[#8]):[#7]:2-[#6])-[#6]',
      inchi          : 'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3',
      xyz            : None)>
>>> # note that by addition of hydrogens, the smiles string changed, but not inchi
>>> mol.to_xyz()
ValueError: The conformation has not been built yet. Maybe due to the 2D representation of the creator.
You should set the optimizer value if you wish to embed and optimize the 3D geometries.
>>> mol.to_xyz('MMFF', maxIters=300, mmffVariant='MMFF94s')
>>> mol
<Molecule(rdkit_molecule : <rdkit.Chem.rdchem.Mol object at 0x11060ab70>,
      creator        : ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'),
      smiles         : '[H]c1nc2c(c(=O)n(C([H])([H])[H])c(=O)n2C([H])([H])[H])n1C([H])([H])[H]',
      smarts         : '[#6]-[#7]1:[#6]:[#7]:[#6]2:[#6]:1:[#6](=[#8]):[#7](:[#6](=[#8]):[#7]:2-[#6])-[#6]',
      inchi          : 'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3',
      xyz            : <chemml.chem.molecule.XYZ object at 0x1105f34e0>)>
>>> mol.xyz
<chemml.chem.molecule.XYZ object at 0x1105f34e0>
>>> mol.xyz.geometry
array([[-3.13498321,  1.08078307,  0.33372515],
   [-2.15703638,  0.0494926 ,  0.0969075 ],
   [-2.41850776, -1.27323453, -0.14538583],
   [-1.30933543, -1.96359887, -0.32234393],
   [-0.31298208, -1.04463051, -0.1870366 ],
   [-0.80033326,  0.20055608,  0.07079013],
   [ 0.02979071,  1.33923464,  0.25484381],
   [-0.41969338,  2.45755985,  0.48688003],
   [ 1.39083332,  1.04039479,  0.14253256],
   [ 1.93405927, -0.23430839, -0.12320318],
   [ 3.15509616, -0.39892779, -0.20417755],
   [ 1.03516373, -1.28860035, -0.28833489],
   [ 1.51247526, -2.63123373, -0.56438513],
   [ 2.34337825,  2.12037198,  0.31039958],
   [-3.03038469,  1.84426033, -0.44113804],
   [-2.95880047,  1.50667225,  1.32459816],
   [-4.13807215,  0.64857048,  0.29175711],
   [-3.4224011 , -1.6776074 , -0.18199971],
   [ 2.60349515, -2.67375289, -0.61674561],
   [ 1.10339164, -2.96582894, -1.52299068],
   [ 1.17455601, -3.30144289,  0.23239402],
   [ 2.94406381,  2.20916763, -0.60086251],
   [ 1.86156872,  3.07985237,  0.51337935],
   [ 3.01465788,  1.87625024,  1.14039627]])
>>> mol.xyz.atomic_numbers
array([[6],
   [7],
   [6],
   [7],
   [6],
   [6],
   [6],
   [8],
   [7],
   [6],
   [8],
   [7],
   [6],
   [6],
   [1],
   [1],
   [1],
   [1],
   [1],
   [1],
   [1],
   [1],
   [1],
   [1]])
>>> mol.xyz.atomic_symbols
array([['C'],
       ['N'],
       ['C'],
       ['N'],
       ['C'],
       ['C'],
       ['C'],
       ['O'],
       ['N'],
       ['C'],
       ['O'],
       ['N'],
       ['C'],
       ['C'],
       ['H'],
       ['H'],
       ['H'],
       ['H'],
       ['H'],
       ['H'],
       ['H'],
       ['H'],
       ['H'],
       ['H']], dtype='<U1')
hydrogens(action='add', **kwargs)

This function adds/removes hydrogens to/from a prebuilt molecule object.

Parameters

actionstr

Either ‘add’ or ‘remove’, to add hydrogns or remove them from the rdkit molecule.

kwargs :

The arguments that can be passed to the rdkit functions: - Chem.AddHs: documentation at http://rdkit.org/docs/source/rdkit.Chem.rdmolops.html?highlight=addhs#rdkit.Chem.rdmolops.AddHs - Chem.RemoveHs: documentation at http://rdkit.org/docs/source/rdkit.Chem.rdmolops.html?highlight=addhs#rdkit.Chem.rdmolops.RemoveHs

Notes

  • The rdkit or pybel molecule object must be created in advance.

  • Only rdkit or pybel molecule object will be modified in place.

  • If you remove hydrogens from molecules, the atomic 3D coordinates might not be accurate for the conversion to xyz representation.

to_inchi(**kwargs)

This function creates and stores the InChi string for a pre-built molecule.

Parameters

kwargs :

The arguments that can be passed to the rdkit.Chem.MolToInchi function (will be used only if rdkit molecule is available). The documentation is available at: http://rdkit.org/docs/source/rdkit.Chem.inchi.html?highlight=inchi#rdkit.Chem.inchi.MolToInchi

Notes

  • The rdkit or pybel molecule object must be created in advance.

  • The molecule will be modified in place.

to_mol2(filename=None)

This function creates and stores the xyz coordinates for a pre-built molecule object.

Parameters

optimizerNone or str, optional (default: None)

The geometries will be extracted from the available source of 3D structure (if any). For openbabel:

[‘uff’, ‘mmff94’, ‘mmff94s’, ‘ghemical’]

For rdkit:

Otherwise, any of the ‘UFF’ or ‘MMFF’ force fileds should be passed to embed and optimize geometries using ‘rdkit.Chem.AllChem.UFFOptimizeMolecule’ or ‘rdkit.Chem.AllChem.MMFFOptimizeMolecule’ methods, respectively.

kwargs :

The arguments that can be passed to the corresponding forcefileds. The documentation is available at:

Notes

  • The geometry will be stored in the xyz attribute.

  • The molecule object must be created in advance.

  • The hydrogens won’t be added to the molecule automatically. You should add it manually using hydrogens method.

  • If the molecule object has been built using 2D representations (e.g., SMILES or InChi), the conformer

doesn’t exist and you nedd to set the optimizer parameter to any of the force fields. - If the 3D info exist but you still need to run optimization, the 3D structure will be embedded from scratch (i.e., the current atom coordinates will be removed.)

to_smarts(**kwargs)

This function creates and stores the SMARTS string for a pre-built molecule.

Parameters

kwargs :

All the arguments that can be passed to the rdkit.Chem.MolToSmarts function. The documentation is available at: http://rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html#rdkit.Chem.rdmolfiles.MolToSmarts

Notes

  • The rdkit or pybel molecule object must be created in advance.

  • If only pybel molecule is available, we create an rdkit molecule using its SMILES representation, and then create the SMARTS string using rdkit arguments.

  • The molecule will be modified in place.

to_smiles(**kwargs)

This function creates and stores the SMILES string for a pre-built molecule.

Parameters

kwargs :

The arguments for the rdkit.Chem.MolToSmiles function. The documentation is available at: http://rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html#rdkit.Chem.rdmolfiles.MolToSmiles

Notes

  • The rdkit or pybel molecule object must be created in advance.

  • If only pybel molecule is available, we create an rdkit molecule using its SMILES representation, and then recreate the SMILES string using rdkit arguments.

  • The molecule will be modified in place.

  • For rdkit molecule the SMILES string is canocical by default, unless when one requests kekuleSmiles.

to_xyz(optimizer=None, steps=500, filename=None, **kwargs)

This function creates and stores the xyz coordinates for a pre-built molecule object.

Parameters

optimizerNone or str, optional (default: None)

The geometries will be extracted from the available source of 3D structure (if any). For openbabel:

[‘uff’, ‘mmff94’, ‘mmff94s’, ‘ghemical’]

For rdkit:

Otherwise, any of the ‘UFF’ or ‘MMFF’ force fileds should be passed to embed and optimize geometries using ‘rdkit.Chem.AllChem.UFFOptimizeMolecule’ or ‘rdkit.Chem.AllChem.MMFFOptimizeMolecule’ methods, respectively.

kwargs :

The arguments that can be passed to the corresponding forcefileds. The documentation is available at:

Notes

  • The geometry will be stored in the xyz attribute.

  • The molecule object must be created in advance.

  • The hydrogens won’t be added to the molecule automatically. You should add it manually using hydrogens method.

  • If the molecule object has been built using 2D representations (e.g., SMILES or InChi), the conformer

doesn’t exist and you nedd to set the optimizer parameter to any of the force fields. - If the 3D info exist but you still need to run optimization, the 3D structure will be embedded from scratch (i.e., the current atom coordinates will be removed.)

visualize(filename=None, **kwargs)

This function visualizes the molecule. If both rdkit and pybel objects are avaialble, the rdkit object will be used for visualization.

Parameters

filename: str, optional (default = None)

This is the path to the file that you want write the image in it. Tkinter and Python Imaging Library are required for writing the image.

kwargs:

any extra parameter that you want to pass to the rdkit or pybel draw tool. Additional information at:

Returns

figobject

You will be able to display this object, e.g., inside the Jupyter Notebook.

class chemml.chem.Mordred(ignore_3D=True, selected_descriptors=False)

A wrapper class for generating Mordred molecular descriptors.

This class provides an interface to create Mordred molecular descriptors, which are an open-source alternative to Dragon descriptors. It allows for the calculation of various molecular properties and features based on chemical structures.

Attributes:

calc (Calculator): A Mordred Calculator object for descriptor generation.

Parameters:

ignore_3D (bool): If True, ignore 3D descriptor generation. Default is True. selected_descriptors (bool): If True, generate only selected descriptors. Default is False.

Methods:

represent(mol_list, output_directory=’./’, dropna=True, quiet=True, remove_corr=False):

Generates molecular descriptors for a list of molecules.

Notes:

Examples:

>>> import pandas as pd
>>> from chemml.chem import Mordred
>>> mord = Mordred()
>>> df = mord.represent(mol_list)
represent(mol_list, output_directory='./', dropna=True, quiet=True, remove_corr=False)

Generate Mordred molecular descriptors for a list of molecules.

This method calculates Mordred descriptors for the provided molecules and returns them as a pandas DataFrame. It can handle input in the form of SMILES strings or ChemML Molecule objects.

Parameters:

mol_listlist, str, or Molecule

Input molecules. Can be one of the following: - List of SMILES strings - List of ChemML Molecule objects - Single SMILES string - Single ChemML Molecule object

output_directorystr, optional

Directory to save generated descriptors. Default is ‘./’.

dropnabool, optional

If True, drop rows with NaN values. Default is True.

quietbool, optional

If True, suppress Mordred’s output messages. Default is True.

remove_corrbool, optional

If True, remove highly correlated descriptors (correlation > 0.95). Default is False.

Returns:

pandas.DataFrame

A DataFrame containing the calculated Mordred descriptors. Each row represents a molecule, and each column represents a descriptor. The ‘SMILES’ column is added to identify the molecules.

Raises:

Exception

If the input SMILES strings are not in a valid format or if there’s an issue with ChemML Molecule objects.

Notes:

  • The method handles various input formats flexibly, converting them to RDKit molecule objects internally.

  • Infinite values in descriptors are replaced with NaN.

  • If remove_corr is True, highly correlated descriptors (correlation > 0.95) are removed to reduce redundancy.

Examples:

>>> mord = Mordred()
>>> smiles_list = ['CC', 'CCO', 'CCCO']
>>> df = mord.represent(smiles_list, dropna=True, remove_corr=True)
>>> print(df.shape)
class chemml.chem.PadelDesc

A class for generating molecular descriptors using PaDEL-Descriptor via PaDELPy.

This class provides functionality to calculate a wide range of molecular descriptors for chemical compounds using the PaDEL-Descriptor software through its Python wrapper.

Methods:
represent(mol_list, output_directory=’./’, dropna=True, remove_corr=False):

Generates molecular descriptors for a list of molecules.

Examples:
>>> padel_desc = PadelDesc()
>>> smiles_list = ['CC', 'CCO', 'CCCO']
>>> df = padel_desc.represent(smiles_list)
represent(mol_list, output_directory='./', dropna=True, remove_corr=False)

Generate PaDEL molecular descriptors for a list of molecules.

This method calculates PaDEL descriptors for the provided molecules and returns them as a pandas DataFrame.

Parameters:

mol_listlist or str

Input molecules. Can be one of the following: - List of SMILES strings - Single SMILES string

output_directorystr, optional

Directory to save generated descriptors. Default is ‘./’.

dropnabool, optional

If True, drop columns with NaN values. Default is True.

remove_corrbool, optional

If True, remove highly correlated descriptors (correlation > 0.95). Default is False.

Returns:

pandas.DataFrame

A DataFrame containing the calculated PaDEL descriptors. Each row represents a molecule, and each column represents a descriptor. The ‘SMILES’ column is added to identify the molecules.

Raises:

ValueError

If the input SMILES strings are not in a valid format.

class chemml.chem.RDKDesc

A class for generating molecular descriptors using RDKit.

This class provides functionality to calculate a wide range of molecular descriptors for chemical compounds using the RDKit library.

Attributes:

descriptor_list (list): A list of available descriptor names.

Methods:
represent(mol_list, output_directory=’./’, dropna=True, remove_corr=False):

Generates molecular descriptors for a list of molecules.

Examples:
>>> from rdkit import Chem
>>> rdkit_desc = RDKitDescriptors()
>>> smiles_list = ['CC', 'CCO', 'CCCO']
>>> df = rdkit_desc.represent(smiles_list)
represent(mol_list, output_directory='./', dropna=True, remove_corr=False)

Generate RDKit molecular descriptors for a list of molecules.

This method calculates RDKit descriptors for the provided molecules and returns them as a pandas DataFrame.

Parameters:

mol_listlist or str

Input molecules. Can be one of the following: - List of SMILES strings - Single SMILES string

output_directorystr, optional

Directory to save generated descriptors. Default is ‘./’.

dropnabool, optional

If True, drop columns with NaN values. Default is True.

remove_corrbool, optional

If True, remove highly correlated descriptors (correlation > 0.95). Default is False.

Returns:

pandas.DataFrame

A DataFrame containing the calculated RDKit descriptors. Each row represents a molecule, and each column represents a descriptor. The ‘SMILES’ column is added to identify the molecules.

Raises:

ValueError

If the input SMILES strings are not in a valid format.

class chemml.chem.RDKitFingerprint(fingerprint_type='Morgan', vector='bit', n_bits=1024, radius=2, **kwargs)

This is an interface to the available molecular fingerprints in the RDKit package.

Parameters

fingerprint_typestr, optional (default=’Morgan’)
The type of fingerprint. Available fingerprint types:
  • ‘hashed_atom_pair’ or ‘hap’

  • ‘MACCS’ or ‘maccs’

  • ‘morgan’

  • ‘hashed_topological_torsion’ or ‘htt’

  • ‘topological_torsion’ or ‘tt’

vectorstr, optional (default = ‘bit’)
Available options for vector:
  • ‘int’represent counts for each fragment instead of bits

    It is not available for ‘MACCS’.

  • ‘bit’only zeros and ones

    It is not available for ‘Topological_torsion’.

n_bitsint, optional (default = 1024)

It sets number of elements/bits in the ‘bit’ type of fingerprint vectors. Not availble for:

  • ‘MACCS’ - (MACCS keys have a fixed length of 167 bits)

  • ‘Topological_torsion’ - doesn’t return a bit vector at all.

radiusint, optional (default = 2)

only applicable if calculating ‘Morgan’ fingerprint.

kwargs :

Any additional argument that should be passed to the rdkit fingerprint function.

Attributes

n_molecules_int

The number of molecules that are received.

fps_list

The list of rdkit fingerprint objects.

load_sparse(file)

This function enables you to load sparse matrix with the .npz format and convert it to a pandas dataframe.

Parameters

filestr

Must be a path to the file with .npz format.

Returns

featurespandas.DataFrame

The dense dataframe of the passed sparse file.

represent(molecules)

The main function to provide fingerprint representation of input molecule(s).

Parameters

moleculeschemml.chem.Molecule object or list

It must be an instance of chemml.chem.Molecule object or a list of those objects, otherwise a ValueError will be raised. If smiles representation of the molecule (or rdkit molecule object) is not available, we convert the molecule to smiles automatically. However, the automatic conversion may ignore your manual settings, for example removed hydrogens, kekulized, or canonical smiles.

Returns

featurespandas.DataFrame

A 2-dimensional pandas dataframe of fingerprint features with same number of rows as number of molecules.

store_sparse(file, features)

This function helps you to store higly sparse fingerprint feature sets using .npz format for memory efficiency and less store/load time. Another method of this class, load_sparse, enables you to load your .npz files and convert it back to pandas dataframe.

Parameters

filestr

Must be a path to the file with .npz format.

featurespandas DataFrame

Must be the pandas dataframe as you receive it from represent method.

class chemml.chem.XYZ(geometry, atomic_numbers, atomic_symbols)

This class stores the information that is typically carried by standard XYZ files.

Parameters

geometryndarray

The numpy array of shape (number_of_atoms, 3). It stores the xyz coordinates for each atom of the molecule.

atomic_numbersndarray

The numpy array of shape (number_of_atoms, 1). It stores the atomic numbers of each atom in the molecule (in the same order as geometry).

atomic_symbolsndarray

The numpy array of shape (number_of_atoms, 1). It stores the atomic symbols of each atom in the molecule (in the same order as geometry).

chemml.chem.atom_features(atom)

This function encodes the RDKit atom to a binary vector.

Parameters

bondrdkit.Chem.rdchem.Bond

The bond must be an RDKit Bond object.

Returns

featuresarray

A binary array with length 6 that specifies the type of bond, if it is a single/double/triple/aromatic bond, a conjugated bond or belongs to a molecular ring.

chemml.chem.bond_features(bond)

This function encodes the RDKit bond to a binary vector.

Parameters

bondrdkit.Chem.rdchem.Bond

The bond must be an RDKit Bond object.

Returns

featuresarray

A binary array with length 6 that specifies the type of bond, if it is a single/double/triple/aromatic bond, a conjugated bond or belongs to a molecular ring.

chemml.chem.num_atom_features()

This function returns the number of atomic features that are available by this module.

Returns

n_featuresint

length of atomic feature vector.

chemml.chem.num_bond_features()

This function returns the number of bond features that are available by this module.

Returns

n_featuresint

length of bond feature vector.

chemml.chem.tensorise_molecules(molecules, max_degree=5, max_atoms=None, n_jobs=-1, batch_size=3000, verbose=True)

Takes a list of molecules and provides tensor representation of atom and bond features. This representation is based on the “convolutional networks on graphs for learning molecular fingerprints” by David Duvenaud et al., NIPS 2015.

Parameters

moleculeschemml.chem.Molecule object or array

If list, it must be a list of chemml.chem.Molecule objects, otherwise we raise a ValueError. In addition, all the molecule objects must provide the SMILES representation. We try to create the SMILES representation if it’s not available.

max_degreeint, optional (default=5)

The maximum number of neighbour per atom that each molecule can have (to which all molecules will be padded), use ‘None’ for auto

max_atomsint, optional (default=None)

The maximum number of atoms per molecule (to which all molecules will be padded), use ‘None’ for auto

n_jobsint, optional(default=-1)

The number of parallel processes. If -1, uses all the available processes.

batch_sizeint, optional(default=3000)

The number of molecules per process, bigger chunksize is preffered as each process will preallocate np.arrays

verbosebool, optional(default=True)

The verbosity of messages.

Notes

It is not recommended to set max_degree to None/auto when using NeuralGraph layers. Max_degree determines the number of trainable parameters and is essentially a hyperparameter. While models can be rebuilt using different max_atoms, they cannot be rebuild for different values of max_degree, as the architecture will be different.

For organic molecules max_degree=5 is a good value (Duvenaud et. al, 2015)

Returns

atomsarray

An atom feature array of shape (molecules, max_atoms, atom_features)

bondsarray

A bonds array of shape (molecules, max_atoms, max_degree)

edges : array A connectivity array of shape (molecules, max_atoms, max_degree, bond_features)