Chem module

The chemml.chem module includes (please click on links adjacent to function names for more information):

Molecule: Molecule()
XYZ: XYZ()
CoulombMatrix: CoulombMatrix()
BagofBonds: BagofBonds()
RDKitFingerprint: RDKitFingerprint()
atom_features: atom_features()
bond_features: bond_features()
tensorise_molecules: tensorise_molecules()
Dragon: Dragon()

class chemml.chem.BagofBonds(const=1.0, n_jobs=-1, verbose=True)

The implementation of bag of bonds version of coulomb matrix by katja Hansen et. al. 2015, JPCL.

Parameters

constfloat, optional (default = 1.0)

The constant value for coordinates unit conversion to atomic unit if const=1.0, returns atomic unit if const=0.529, returns Angstrom

const/|Ri-Rj|, which denominator is the euclidean distance between two atoms

n_jobsint, optional(default=-1)

The number of parallel processes. If -1, uses all the available processes.

verbosebool, optional(default=True)

The verbosity of messages.

Attributes

header_list of header for the bag of bonds data frame: contains one nuclear charge (represents single atom) or a tuple of two nuclear charges (represents a bond)

Examples

>>> from chemml.datasets import load_xyz_polarizability
>>> from chemml.chem import BagofBonds

>>> coordinates, y = load_xyz_polarizability()
>>> bob = BagofBonds(const= 1.0)
>>> features = bob.represent(coordinates)

concat_mol_features(bbs_info)

This function concatenates a list of molecules features from parallel run

Parameters

bbs_infolist or tuple: The list or tuple of features and keys

Returns

featuresdata frame: A single dataframe of all features

represent(molecules)

provides bag of bonds representation for input molecules.

Parameters

moleculeschemml.chem.Molecule object or array: If list, it must be a list of chemml.chem.Molecule objects, otherwise we raise a ValueError. In addition, all the molecule objects must provide the XYZ information. Please make sure the XYZ geometry has been stored or optimized in advance.

Returns

featurespandas data frame, shape: (n_molecules, max_length_of_combinations): The bag of bond features.

class chemml.chem.CoulombMatrix(cm_type='SC', max_n_atoms='auto', nPerm=3, const=1, n_jobs=-1, verbose=True)

The implementation of coulomb matrix descriptors by Matthias Rupp et. al. 2012, PRL (All 3 different variations).

Parameters

cm_typestr, optional (default=’SC’)

The coulomb matrix type, one of the following types:

‘Unsorted_Matrix’ or ‘UM’
‘Unsorted_Triangular’ or ‘UT’
‘Eigenspectrum’ or ‘E’
‘Sorted_Coulomb’ or ‘SC’
‘Random_Coulomb’ or ‘RC’

max_n_atomsint or ‘auto’, optional (default = ‘auto’)

Set the maximum number of atoms per molecule (to which all representations will be padded). If ‘auto’, we find it based on all input molecules.

nPermint, optional (default = 3)

Number of permutation of coulomb matrix per molecule for Random_Coulomb (RC) type of representation.

constfloat, optional (default = 1)

The constant value for coordinates unit conversion to atomic unit example: atomic unit -> const=1, Angstrom -> const=0.529 const/|Ri-Rj|, which denominator is the euclidean distance between atoms i and j

n_jobsint, optional(default=-1)

The number of parallel processes. If -1, uses all the available processes.

verbosebool, optional(default=True)

The verbosity of messages.

Attributes

n_molecules_int: Total number of molecules.
max_n_atoms_int: Maximum number of atoms in all molecules.

Examples

>>> from chemml.chem import CoulombMatrix, Molecule

>>> m1 = Molecule('c1ccc1', 'smiles')
>>> m2 = Molecule('CNC', 'smiles')
>>> m3 = Molecule('CC', 'smiles')
>>> m4 = Molecule('CCC', 'smiles')

>>> molecules = [m1, m2, m3, m4]

>>> for mol in molecules:   mol.to_xyz(optimizer='UFF')
>>> cm = CoulombMatrix(cm_type='SC', n_jobs=-1)
>>> features = cm.represent(molecules)

static concat_dataframes(mol_tensors_list)

Concatenates a list of molecule tensors

Parameters

mol_tensors_list: list
list of molecule tensors

Returns

featuresdataframe: a single feature dataframe

represent(molecules)

provides coulomb matrix representation for input molecules.

Parameters

moleculeschemml.chem.Molecule object or array: If list, it must be a list of chemml.chem.Molecule objects, otherwise we raise a ValueError. In addition, all the molecule objects must provide the XYZ information. Please make sure the XYZ geometry has been stored or optimized in advance.

Returns

featuresPandas DataFrame

A data frame with same number of rows as number of molecules will be returned. The exact shape of the dataframe depends on the type of CM as follows:

shape of Unsorted_Matrix (UM): (n_molecules, max_n_atoms**2)

shape of Unsorted_Triangular (UT): (n_molecules, max_n_atoms*(max_n_atoms+1)/2)

shape of eigenspectrums (E): (n_molecules, max_n_atoms)

shape of Sorted_Coulomb (SC): (n_molecules, max_n_atoms*(max_n_atoms+1)/2)

shape of Random_Coulomb (RC): (n_molecules, nPerm * max_n_atoms * (max_n_atoms+1)/2)

class chemml.chem.Dragon(CheckUpdates=True, SaveLayout=True, PreserveTemporaryProjects=True, ShowWorksheet=False, Decimal_Separator='.', Missing_String='NaN', DefaultMolFormat='1', HelpBrowser='/usr/bin/xdg-open', RejectUnusualValence=False, Add2DHydrogens=False, MaxSRforAllCircuit='19', MaxSR='35', MaxSRDetour='30', MaxAtomWalkPath='2000', LogPathWalk=True, LogEdge=True, Weights=('Mass', 'VdWVolume', 'Electronegativity', 'Polarizability', 'Ionization', 'I-State'), SaveOnlyData=False, SaveLabelsOnSeparateFile=False, SaveFormatBlock='%b-%n.txt', SaveFormatSubBlock='%b-%s-%n-%m.txt', SaveExcludeMisVal=False, SaveExcludeAllMisVal=False, SaveExcludeConst=False, SaveExcludeNearConst=False, SaveExcludeStdDev=False, SaveStdDevThreshold='0.0001', SaveExcludeCorrelated=False, SaveCorrThreshold='0.95', SaveExclusionOptionsToVariables=False, SaveExcludeMisMolecules=False, SaveExcludeRejectedMolecules=False, blocks=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], SaveStdOut=False, SaveProject=False, SaveProjectFile='Dragon_project.drp', SaveFile=True, SaveType='singlefile', SaveFilePath='Dragon_descriptors.txt', logMode='file', logFile='Dragon_log.txt', external=False, fileName=None, delimiter=',', consecutiveDelimiter=False, MissingValue='NaN', RejectDisconnectedStrucuture=False, RetainBiggestFragment=False, DisconnectedCalculationOption='0', RoundCoordinates=True, RoundWeights=True, RoundDescriptorValues=True, knimemode=False)

An interface to Dragon 6 and 7 chemoinformatics software. Dragon is a commercial software and you should provide

Parameters

versionint, optional (default=7): The version of available Dragon on the user’s system. (available versions: 6 or 7)
Weightslist, optional (default=[“Mass”,”VdWVolume”,”Electronegativity”,”Polarizability”,”Ionization”,”I-State”]): A list of weights to be used
blockslist, optional (default = list(range(1,31))): A list of integers as descriptor blocks’ id. There are totally 29 and 30 blocks available in version 6 and 7, respectively. This module is not atimed to cherry pick descriptors in each block. For doing so, please use Script Wizard in Drgon GUI.
externalboolean, optional (default=False): If True, include external variables at the end of each saved file.

Notes

The documentation for the rest of parameters can be found in the following links:

http://www.talete.mi.it/help/dragon_help/index.html

https://chm.kode-solutions.net/products_dragon_tutorial.php

Examples

>>> import pandas as pd
>>> from chemml.chem import Dragon
>>> drg = Dragon()
>>> df = drg.represent(mol_list, output_directory='./', dropna=False)

represent(mol_list, output_directory='./', dropna=True)

Parameters

mol_list: list: list of chemml.chem.Molecule objects
output_directory: str: output directory to save dragon scripts
dropna: bool: Drops all columns with any NaN value.

Returns

class chemml.chem.Molecule(input_mol, input_type, engine='rdkit', **kwargs)

The central class to construct a molecule from different chemical input formats. This module is built on top of RDKit and OpenBabel python API. We join the forces and strength of these two cheminformatic libraries for a consistent user experience.

Almost all the molecular descriptors and molecule-based ML models require the chemical informatin as a Molecule object. Several methods are available in this module to facilitate the manipulation of chemical data.

Parameters

inputstr

The representation string or path to a file.

input_typestr

The input type. The available types are enlisted here:

smiles: The input must be SMILES representation of a molecule.

smarts: The input must be SMARTS representation of a molecule.

inchi: The input must be InChi representation of a molecule.

xyz: The input must be the path to an xyz file.

mol2: The input must be the path to an mol2 file.

kwargs :

The corresponding RDKit arguments for each of the input types:

Notes

The molecule will be created as an RDKit molecule object.

The molecule object will be stored and available as rdkit_molecule attribute.

If you load a molecule from its SMARTS string, there is high probability that you can’t convert it to other
types due to the abstract description of the molecules by SMARTS representation.

Attributes

rdkit_moleculeobject: The rdkit.Chem.rdchem.Mol object
smilesstr: The SMILES string that you get by running the to_smiles method.
smartsstr: The SMARTS string that you get by running the to_smarts method.
inchistr: The InChi string that you get by running the to_inchi method.
xyzinstance of <class ‘chemml.chem.molecule.XYZ’>: The class object that stores the 3D info. The available attributes in the class are ‘geometry’, ‘atomic_numbers’, and ‘atomic_symbols’.

Examples

>>> from chemml.chem import Molecule
>>> caffeine_smiles = 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'
>>> caffeine_smarts = '[#6]-[#7]1:[#6]:[#7]:[#6]2:[#6]:1:[#6](=[#8]):[#7](:[#6](=[#8]):[#7]:2-[#6])-[#6]'
>>> caffeine_inchi = 'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3'
>>> mol = Molecule(caffeine_smiles, 'smiles')
>>> mol
<Molecule(rdkit_molecule : <rdkit.Chem.rdchem.Mol object at 0x11060a8a0>,
      creator        : ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'),
      smiles         : 'Cn1c(=O)c2c(ncn2C)n(C)c1=O',
      smarts         : None,
      inchi          : None,
      xyz            : None)>
>>> mol.smiles # this is the canonical SMILES by default
'Cn1c(=O)c2c(ncn2C)n(C)c1=O'
>>> mol.creator
('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C')
>>> mol.to_smiles(kekuleSmiles=True)
>>> mol
<Molecule(rdkit_molecule : <rdkit.Chem.rdchem.Mol object at 0x11060a8a0>,
      creator        : ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'),
      smiles         : 'CN1C(=O)C2=C(N=CN2C)N(C)C1=O',
      smarts         : None,
      inchi          : None,
      xyz            : None)>
>>> mol.smiles # the kukule SMILES is not canonical
'CN1C(=O)C2=C(N=CN2C)N(C)C1=O'
>>> mol.inchi is None
True
>>> mol.to_inchi()
>>> mol
<Molecule(rdkit_molecule : <rdkit.Chem.rdchem.Mol object at 0x11060a8a0>,
      creator        : ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'),
      smiles         : 'CN1C(=O)C2=C(N=CN2C)N(C)C1=O',
      smarts         : None,
      inchi          : 'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3',
      xyz            : None)>
>>> mol.inchi
'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3'
>>> mol.to_smarts(isomericSmiles=False)
>>> mol.smarts
'[#6]-[#7]1:[#6]:[#7]:[#6]2:[#6]:1:[#6](=[#8]):[#7](:[#6](=[#8]):[#7]:2-[#6])-[#6]'
>>>
>>> # add hydrogens and recreate smiles and inchi
>>> mol.hydrogens('add')
>>> mol.to_smiles()
>>> mol.to_inchi()
<Molecule(rdkit_molecule : <rdkit.Chem.rdchem.Mol object at 0x11060ab70>,
      creator        : ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'),
      smiles         : '[H]c1nc2c(c(=O)n(C([H])([H])[H])c(=O)n2C([H])([H])[H])n1C([H])([H])[H]',
      smarts         : '[#6]-[#7]1:[#6]:[#7]:[#6]2:[#6]:1:[#6](=[#8]):[#7](:[#6](=[#8]):[#7]:2-[#6])-[#6]',
      inchi          : 'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3',
      xyz            : None)>
>>> # note that by addition of hydrogens, the smiles string changed, but not inchi
>>> mol.to_xyz()
ValueError: The conformation has not been built yet. Maybe due to the 2D representation of the creator.
You should set the optimizer value if you wish to embed and optimize the 3D geometries.
>>> mol.to_xyz('MMFF', maxIters=300, mmffVariant='MMFF94s')
>>> mol
<Molecule(rdkit_molecule : <rdkit.Chem.rdchem.Mol object at 0x11060ab70>,
      creator        : ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'),
      smiles         : '[H]c1nc2c(c(=O)n(C([H])([H])[H])c(=O)n2C([H])([H])[H])n1C([H])([H])[H]',
      smarts         : '[#6]-[#7]1:[#6]:[#7]:[#6]2:[#6]:1:[#6](=[#8]):[#7](:[#6](=[#8]):[#7]:2-[#6])-[#6]',
      inchi          : 'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3',
      xyz            : <chemml.chem.molecule.XYZ object at 0x1105f34e0>)>
>>> mol.xyz
<chemml.chem.molecule.XYZ object at 0x1105f34e0>
>>> mol.xyz.geometry
array([[-3.13498321,  1.08078307,  0.33372515],
   [-2.15703638,  0.0494926 ,  0.0969075 ],
   [-2.41850776, -1.27323453, -0.14538583],
   [-1.30933543, -1.96359887, -0.32234393],
   [-0.31298208, -1.04463051, -0.1870366 ],
   [-0.80033326,  0.20055608,  0.07079013],
   [ 0.02979071,  1.33923464,  0.25484381],
   [-0.41969338,  2.45755985,  0.48688003],
   [ 1.39083332,  1.04039479,  0.14253256],
   [ 1.93405927, -0.23430839, -0.12320318],
   [ 3.15509616, -0.39892779, -0.20417755],
   [ 1.03516373, -1.28860035, -0.28833489],
   [ 1.51247526, -2.63123373, -0.56438513],
   [ 2.34337825,  2.12037198,  0.31039958],
   [-3.03038469,  1.84426033, -0.44113804],
   [-2.95880047,  1.50667225,  1.32459816],
   [-4.13807215,  0.64857048,  0.29175711],
   [-3.4224011 , -1.6776074 , -0.18199971],
   [ 2.60349515, -2.67375289, -0.61674561],
   [ 1.10339164, -2.96582894, -1.52299068],
   [ 1.17455601, -3.30144289,  0.23239402],
   [ 2.94406381,  2.20916763, -0.60086251],
   [ 1.86156872,  3.07985237,  0.51337935],
   [ 3.01465788,  1.87625024,  1.14039627]])
>>> mol.xyz.atomic_numbers
array([[6],
   [7],
   [6],
   [7],
   [6],
   [6],
   [6],
   [8],
   [7],
   [6],
   [8],
   [7],
   [6],
   [6],
   [1],
   [1],
   [1],
   [1],
   [1],
   [1],
   [1],
   [1],
   [1],
   [1]])
>>> mol.xyz.atomic_symbols
array([['C'],
       ['N'],
       ['C'],
       ['N'],
       ['C'],
       ['C'],
       ['C'],
       ['O'],
       ['N'],
       ['C'],
       ['O'],
       ['N'],
       ['C'],
       ['C'],
       ['H'],
       ['H'],
       ['H'],
       ['H'],
       ['H'],
       ['H'],
       ['H'],
       ['H'],
       ['H'],
       ['H']], dtype='<U1')

hydrogens(action='add', **kwargs)

This function adds/removes hydrogens to/from a prebuilt molecule object.

Parameters

actionstr: Either ‘add’ or ‘remove’, to add hydrogns or remove them from the rdkit molecule.
kwargs :: The arguments that can be passed to the rdkit functions: - Chem.AddHs: documentation at http://rdkit.org/docs/source/rdkit.Chem.rdmolops.html?highlight=addhs#rdkit.Chem.rdmolops.AddHs - Chem.RemoveHs: documentation at http://rdkit.org/docs/source/rdkit.Chem.rdmolops.html?highlight=addhs#rdkit.Chem.rdmolops.RemoveHs

Notes

The rdkit or pybel molecule object must be created in advance.

Only rdkit or pybel molecule object will be modified in place.

If you remove hydrogens from molecules, the atomic 3D coordinates might not be accurate for the conversion to xyz representation.

to_inchi(**kwargs)

This function creates and stores the InChi string for a pre-built molecule.

Parameters

kwargs :: The arguments that can be passed to the rdkit.Chem.MolToInchi function (will be used only if rdkit molecule is available). The documentation is available at: http://rdkit.org/docs/source/rdkit.Chem.inchi.html?highlight=inchi#rdkit.Chem.inchi.MolToInchi

Notes

The rdkit or pybel molecule object must be created in advance.

The molecule will be modified in place.

to_mol2(filename=None)

This function creates and stores the xyz coordinates for a pre-built molecule object.

Parameters

optimizerNone or str, optional (default: None)

The geometries will be extracted from the available source of 3D structure (if any). For openbabel:

[‘uff’, ‘mmff94’, ‘mmff94s’, ‘ghemical’]

For rdkit:: Otherwise, any of the ‘UFF’ or ‘MMFF’ force fileds should be passed to embed and optimize geometries using ‘rdkit.Chem.AllChem.UFFOptimizeMolecule’ or ‘rdkit.Chem.AllChem.MMFFOptimizeMolecule’ methods, respectively.

kwargs :

The arguments that can be passed to the corresponding forcefileds. The documentation is available at:

UFFOptimizeMolecule: http://rdkit.org/docs/source/rdkit.Chem.rdForceFieldHelpers.html?highlight=mmff#rdkit.Chem.rdForceFieldHelpers.UFFOptimizeMolecule

MMFFOptimizeMolecule: http://rdkit.org/docs/source/rdkit.Chem.rdForceFieldHelpers.html?highlight=mmff#rdkit.Chem.rdForceFieldHelpers.MMFFOptimizeMolecule

Notes

The geometry will be stored in the xyz attribute.

The molecule object must be created in advance.

The hydrogens won’t be added to the molecule automatically. You should add it manually using hydrogens method.

If the molecule object has been built using 2D representations (e.g., SMILES or InChi), the conformer

doesn’t exist and you nedd to set the optimizer parameter to any of the force fields. - If the 3D info exist but you still need to run optimization, the 3D structure will be embedded from scratch (i.e., the current atom coordinates will be removed.)

to_smarts(**kwargs)

This function creates and stores the SMARTS string for a pre-built molecule.

Parameters

kwargs :: All the arguments that can be passed to the rdkit.Chem.MolToSmarts function. The documentation is available at: http://rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html#rdkit.Chem.rdmolfiles.MolToSmarts

Notes

The rdkit or pybel molecule object must be created in advance.

If only pybel molecule is available, we create an rdkit molecule using its SMILES representation, and then create the SMARTS string using rdkit arguments.

The molecule will be modified in place.

to_smiles(**kwargs)

This function creates and stores the SMILES string for a pre-built molecule.

Parameters

kwargs :: The arguments for the rdkit.Chem.MolToSmiles function. The documentation is available at: http://rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html#rdkit.Chem.rdmolfiles.MolToSmiles

Notes

The rdkit or pybel molecule object must be created in advance.

If only pybel molecule is available, we create an rdkit molecule using its SMILES representation, and then recreate the SMILES string using rdkit arguments.

The molecule will be modified in place.

For rdkit molecule the SMILES string is canocical by default, unless when one requests kekuleSmiles.

to_xyz(optimizer=None, steps=500, filename=None, **kwargs)

This function creates and stores the xyz coordinates for a pre-built molecule object.

Parameters

optimizerNone or str, optional (default: None)

The geometries will be extracted from the available source of 3D structure (if any). For openbabel:

[‘uff’, ‘mmff94’, ‘mmff94s’, ‘ghemical’]

For rdkit:: Otherwise, any of the ‘UFF’ or ‘MMFF’ force fileds should be passed to embed and optimize geometries using ‘rdkit.Chem.AllChem.UFFOptimizeMolecule’ or ‘rdkit.Chem.AllChem.MMFFOptimizeMolecule’ methods, respectively.

kwargs :

The arguments that can be passed to the corresponding forcefileds. The documentation is available at:

UFFOptimizeMolecule: http://rdkit.org/docs/source/rdkit.Chem.rdForceFieldHelpers.html?highlight=mmff#rdkit.Chem.rdForceFieldHelpers.UFFOptimizeMolecule

MMFFOptimizeMolecule: http://rdkit.org/docs/source/rdkit.Chem.rdForceFieldHelpers.html?highlight=mmff#rdkit.Chem.rdForceFieldHelpers.MMFFOptimizeMolecule

Notes

The geometry will be stored in the xyz attribute.

The molecule object must be created in advance.

The hydrogens won’t be added to the molecule automatically. You should add it manually using hydrogens method.

If the molecule object has been built using 2D representations (e.g., SMILES or InChi), the conformer

doesn’t exist and you nedd to set the optimizer parameter to any of the force fields. - If the 3D info exist but you still need to run optimization, the 3D structure will be embedded from scratch (i.e., the current atom coordinates will be removed.)

visualize(filename=None, **kwargs)

This function visualizes the molecule. If both rdkit and pybel objects are avaialble, the rdkit object will be used for visualization.

Parameters

filename: str, optional (default = None)

This is the path to the file that you want write the image in it. Tkinter and Python Imaging Library are required for writing the image.

kwargs:

any extra parameter that you want to pass to the rdkit or pybel draw tool. Additional information at:

https://www.rdkit.org/docs/source/rdkit.Chem.Draw.html

http://openbabel.org/docs/dev/UseTheLibrary/Python_PybelAPI.html#pybel.Molecule.draw

Returns

figobject: You will be able to display this object, e.g., inside the Jupyter Notebook.

class chemml.chem.Mordred(ignore_3D=True, selected_descriptors=False)

A wrapper class for generating Mordred molecular descriptors.

This class provides an interface to create Mordred molecular descriptors, which are an open-source alternative to Dragon descriptors. It allows for the calculation of various molecular properties and features based on chemical structures.

Attributes:

calc (Calculator): A Mordred Calculator object for descriptor generation.

Parameters:

ignore_3D (bool): If True, ignore 3D descriptor generation. Default is True. selected_descriptors (bool): If True, generate only selected descriptors. Default is False.

Methods:

represent(mol_list, output_directory=’./’, dropna=True, quiet=True, remove_corr=False):
Generates molecular descriptors for a list of molecules.

Notes:

Requires installation of Mordred descriptors as described in: https://github.com/mordred-descriptor/mordred

By default, all available descriptors are generated.

Examples:

>>> import pandas as pd
>>> from chemml.chem import Mordred
>>> mord = Mordred()
>>> df = mord.represent(mol_list)

represent(mol_list, output_directory='./', dropna=True, quiet=True, remove_corr=False)

Generate Mordred molecular descriptors for a list of molecules.

This method calculates Mordred descriptors for the provided molecules and returns them as a pandas DataFrame. It can handle input in the form of SMILES strings or ChemML Molecule objects.

Parameters:

mol_listlist, str, or Molecule: Input molecules. Can be one of the following: - List of SMILES strings - List of ChemML Molecule objects - Single SMILES string - Single ChemML Molecule object
output_directorystr, optional: Directory to save generated descriptors. Default is ‘./’.
dropnabool, optional: If True, drop rows with NaN values. Default is True.
quietbool, optional: If True, suppress Mordred’s output messages. Default is True.
remove_corrbool, optional: If True, remove highly correlated descriptors (correlation > 0.95). Default is False.

Returns:

pandas.DataFrame: A DataFrame containing the calculated Mordred descriptors. Each row represents a molecule, and each column represents a descriptor. The ‘SMILES’ column is added to identify the molecules.

Raises:

Exception: If the input SMILES strings are not in a valid format or if there’s an issue with ChemML Molecule objects.

Notes:

The method handles various input formats flexibly, converting them to RDKit molecule objects internally.
Infinite values in descriptors are replaced with NaN.
If remove_corr is True, highly correlated descriptors (correlation > 0.95) are removed to reduce redundancy.

Examples:

>>> mord = Mordred()
>>> smiles_list = ['CC', 'CCO', 'CCCO']
>>> df = mord.represent(smiles_list, dropna=True, remove_corr=True)
>>> print(df.shape)

class chemml.chem.PadelDesc

A class for generating molecular descriptors using PaDEL-Descriptor via PaDELPy.

This class provides functionality to calculate a wide range of molecular descriptors for chemical compounds using the PaDEL-Descriptor software through its Python wrapper.

Methods:

represent(mol_list, output_directory=’./’, dropna=True, remove_corr=False):: Generates molecular descriptors for a list of molecules.

Examples:

>>> padel_desc = PadelDesc()
>>> smiles_list = ['CC', 'CCO', 'CCCO']
>>> df = padel_desc.represent(smiles_list)

represent(mol_list, output_directory='./', dropna=True, remove_corr=False)

Generate PaDEL molecular descriptors for a list of molecules.

This method calculates PaDEL descriptors for the provided molecules and returns them as a pandas DataFrame.

Parameters:

mol_listlist or str: Input molecules. Can be one of the following: - List of SMILES strings - Single SMILES string
output_directorystr, optional: Directory to save generated descriptors. Default is ‘./’.
dropnabool, optional: If True, drop columns with NaN values. Default is True.
remove_corrbool, optional: If True, remove highly correlated descriptors (correlation > 0.95). Default is False.

Returns:

pandas.DataFrame: A DataFrame containing the calculated PaDEL descriptors. Each row represents a molecule, and each column represents a descriptor. The ‘SMILES’ column is added to identify the molecules.

Raises:

ValueError: If the input SMILES strings are not in a valid format.

class chemml.chem.RDKDesc

A class for generating molecular descriptors using RDKit.

This class provides functionality to calculate a wide range of molecular descriptors for chemical compounds using the RDKit library.

Attributes:

descriptor_list (list): A list of available descriptor names.

Methods:

represent(mol_list, output_directory=’./’, dropna=True, remove_corr=False):: Generates molecular descriptors for a list of molecules.

Examples:

>>> from rdkit import Chem
>>> rdkit_desc = RDKitDescriptors()
>>> smiles_list = ['CC', 'CCO', 'CCCO']
>>> df = rdkit_desc.represent(smiles_list)

represent(mol_list, output_directory='./', dropna=True, remove_corr=False)

Generate RDKit molecular descriptors for a list of molecules.

This method calculates RDKit descriptors for the provided molecules and returns them as a pandas DataFrame.

Parameters:

mol_listlist or str: Input molecules. Can be one of the following: - List of SMILES strings - Single SMILES string
output_directorystr, optional: Directory to save generated descriptors. Default is ‘./’.
dropnabool, optional: If True, drop columns with NaN values. Default is True.
remove_corrbool, optional: If True, remove highly correlated descriptors (correlation > 0.95). Default is False.

Returns:

pandas.DataFrame: A DataFrame containing the calculated RDKit descriptors. Each row represents a molecule, and each column represents a descriptor. The ‘SMILES’ column is added to identify the molecules.

Raises:

ValueError: If the input SMILES strings are not in a valid format.

class chemml.chem.RDKitFingerprint(fingerprint_type='Morgan', vector='bit', n_bits=1024, radius=2, **kwargs)

This is an interface to the available molecular fingerprints in the RDKit package.

Parameters

fingerprint_typestr, optional (default=’Morgan’)

The type of fingerprint. Available fingerprint types:

‘hashed_atom_pair’ or ‘hap’
‘MACCS’ or ‘maccs’
‘morgan’
‘hashed_topological_torsion’ or ‘htt’
‘topological_torsion’ or ‘tt’

vectorstr, optional (default = ‘bit’)

Available options for vector:

‘int’represent counts for each fragment instead of bits
It is not available for ‘MACCS’.
‘bit’only zeros and ones
It is not available for ‘Topological_torsion’.

n_bitsint, optional (default = 1024)

It sets number of elements/bits in the ‘bit’ type of fingerprint vectors. Not availble for:

‘MACCS’ - (MACCS keys have a fixed length of 167 bits)

‘Topological_torsion’ - doesn’t return a bit vector at all.

radiusint, optional (default = 2)

only applicable if calculating ‘Morgan’ fingerprint.

kwargs :

Any additional argument that should be passed to the rdkit fingerprint function.

Attributes

n_molecules_int: The number of molecules that are received.
fps_list: The list of rdkit fingerprint objects.

load_sparse(file)

This function enables you to load sparse matrix with the .npz format and convert it to a pandas dataframe.

Parameters

filestr: Must be a path to the file with .npz format.

Returns

featurespandas.DataFrame: The dense dataframe of the passed sparse file.

represent(molecules)

The main function to provide fingerprint representation of input molecule(s).

Parameters

moleculeschemml.chem.Molecule object or list: It must be an instance of chemml.chem.Molecule object or a list of those objects, otherwise a ValueError will be raised. If smiles representation of the molecule (or rdkit molecule object) is not available, we convert the molecule to smiles automatically. However, the automatic conversion may ignore your manual settings, for example removed hydrogens, kekulized, or canonical smiles.

Returns

featurespandas.DataFrame: A 2-dimensional pandas dataframe of fingerprint features with same number of rows as number of molecules.

store_sparse(file, features)

This function helps you to store higly sparse fingerprint feature sets using .npz format for memory efficiency and less store/load time. Another method of this class, load_sparse, enables you to load your .npz files and convert it back to pandas dataframe.

Parameters

filestr: Must be a path to the file with .npz format.
featurespandas DataFrame: Must be the pandas dataframe as you receive it from represent method.

class chemml.chem.XYZ(geometry, atomic_numbers, atomic_symbols)

This class stores the information that is typically carried by standard XYZ files.

Parameters

geometryndarray: The numpy array of shape (number_of_atoms, 3). It stores the xyz coordinates for each atom of the molecule.
atomic_numbersndarray: The numpy array of shape (number_of_atoms, 1). It stores the atomic numbers of each atom in the molecule (in the same order as geometry).
atomic_symbolsndarray: The numpy array of shape (number_of_atoms, 1). It stores the atomic symbols of each atom in the molecule (in the same order as geometry).

chemml.chem.atom_features(atom)

This function encodes the RDKit atom to a binary vector.

Parameters

bondrdkit.Chem.rdchem.Bond: The bond must be an RDKit Bond object.

Returns

featuresarray: A binary array with length 6 that specifies the type of bond, if it is a single/double/triple/aromatic bond, a conjugated bond or belongs to a molecular ring.

chemml.chem.bond_features(bond)

This function encodes the RDKit bond to a binary vector.

Parameters

bondrdkit.Chem.rdchem.Bond: The bond must be an RDKit Bond object.

Returns

featuresarray: A binary array with length 6 that specifies the type of bond, if it is a single/double/triple/aromatic bond, a conjugated bond or belongs to a molecular ring.

chemml.chem.num_atom_features()

This function returns the number of atomic features that are available by this module.

Returns

n_featuresint: length of atomic feature vector.

chemml.chem.num_bond_features()

This function returns the number of bond features that are available by this module.

Returns

n_featuresint: length of bond feature vector.

chemml.chem.tensorise_molecules(molecules, max_degree=5, max_atoms=None, n_jobs=-1, batch_size=3000, verbose=True)

Takes a list of molecules and provides tensor representation of atom and bond features. This representation is based on the “convolutional networks on graphs for learning molecular fingerprints” by David Duvenaud et al., NIPS 2015.

Parameters

moleculeschemml.chem.Molecule object or array: If list, it must be a list of chemml.chem.Molecule objects, otherwise we raise a ValueError. In addition, all the molecule objects must provide the SMILES representation. We try to create the SMILES representation if it’s not available.
max_degreeint, optional (default=5): The maximum number of neighbour per atom that each molecule can have (to which all molecules will be padded), use ‘None’ for auto
max_atomsint, optional (default=None): The maximum number of atoms per molecule (to which all molecules will be padded), use ‘None’ for auto
n_jobsint, optional(default=-1): The number of parallel processes. If -1, uses all the available processes.
batch_sizeint, optional(default=3000): The number of molecules per process, bigger chunksize is preffered as each process will preallocate np.arrays
verbosebool, optional(default=True): The verbosity of messages.

Notes

It is not recommended to set max_degree to None/auto when using NeuralGraph layers. Max_degree determines the number of trainable parameters and is essentially a hyperparameter. While models can be rebuilt using different max_atoms, they cannot be rebuild for different values of max_degree, as the architecture will be different.

For organic molecules max_degree=5 is a good value (Duvenaud et. al, 2015)

Returns

atomsarray
An atom feature array of shape (molecules, max_atoms, atom_features)

bondsarray
A bonds array of shape (molecules, max_atoms, max_degree)

edges : array A connectivity array of shape (molecules, max_atoms, max_degree, bond_features)