Chem module
- The chemml.chem module includes (please click on links adjacent to function names for more information):
Molecule:
Molecule()
XYZ:
XYZ()
CoulombMatrix:
CoulombMatrix()
BagofBonds:
BagofBonds()
RDKitFingerprint:
RDKitFingerprint()
atom_features:
atom_features()
bond_features:
bond_features()
tensorise_molecules:
tensorise_molecules()
Dragon:
Dragon()
- class chemml.chem.BagofBonds(const=1.0, n_jobs=- 1, verbose=True)
The implementation of bag of bonds version of coulomb matrix by katja Hansen et. al. 2015, JPCL.
- constfloat, optional (default = 1.0)
The constant value for coordinates unit conversion to atomic unit if const=1.0, returns atomic unit if const=0.529, returns Angstrom
const/|Ri-Rj|, which denominator is the euclidean distance between two atoms
- n_jobsint, optional(default=-1)
The number of parallel processes. If -1, uses all the available processes.
- verbosebool, optional(default=True)
The verbosity of messages.
- header_list of header for the bag of bonds data frame
contains one nuclear charge (represents single atom) or a tuple of two nuclear charges (represents a bond)
>>> from chemml.datasets import load_xyz_polarizability >>> from chemml.chem import BagofBonds
>>> coordinates, y = load_xyz_polarizability() >>> bob = BagofBonds(const= 1.0) >>> features = bob.represent(coordinates)
- concat_mol_features(bbs_info)
This function concatenates a list of molecules features from parallel run
- bbs_infolist or tuple
The list or tuple of features and keys
- featuresdata frame
A single dataframe of all features
- represent(molecules)
provides bag of bonds representation for input molecules.
- moleculeschemml.chem.Molecule object or array
If list, it must be a list of chemml.chem.Molecule objects, otherwise we raise a ValueError. In addition, all the molecule objects must provide the XYZ information. Please make sure the XYZ geometry has been stored or optimized in advance.
- featurespandas data frame, shape: (n_molecules, max_length_of_combinations)
The bag of bond features.
- class chemml.chem.CoulombMatrix(cm_type='SC', max_n_atoms='auto', nPerm=3, const=1, n_jobs=- 1, verbose=True)
The implementation of coulomb matrix descriptors by Matthias Rupp et. al. 2012, PRL (All 3 different variations).
- cm_typestr, optional (default=’SC’)
- The coulomb matrix type, one of the following types:
‘Unsorted_Matrix’ or ‘UM’
‘Unsorted_Triangular’ or ‘UT’
‘Eigenspectrum’ or ‘E’
‘Sorted_Coulomb’ or ‘SC’
‘Random_Coulomb’ or ‘RC’
- max_n_atomsint or ‘auto’, optional (default = ‘auto’)
Set the maximum number of atoms per molecule (to which all representations will be padded). If ‘auto’, we find it based on all input molecules.
- nPermint, optional (default = 3)
Number of permutation of coulomb matrix per molecule for Random_Coulomb (RC) type of representation.
- constfloat, optional (default = 1)
The constant value for coordinates unit conversion to atomic unit example: atomic unit -> const=1, Angstrom -> const=0.529 const/|Ri-Rj|, which denominator is the euclidean distance between atoms i and j
- n_jobsint, optional(default=-1)
The number of parallel processes. If -1, uses all the available processes.
- verbosebool, optional(default=True)
The verbosity of messages.
- n_molecules_int
Total number of molecules.
- max_n_atoms_int
Maximum number of atoms in all molecules.
>>> from chemml.chem import CoulombMatrix, Molecule
>>> m1 = Molecule('c1ccc1', 'smiles') >>> m2 = Molecule('CNC', 'smiles') >>> m3 = Molecule('CC', 'smiles') >>> m4 = Molecule('CCC', 'smiles')
>>> molecules = [m1, m2, m3, m4]
>>> for mol in molecules: mol.to_xyz(optimizer='UFF') >>> cm = CoulombMatrix(cm_type='SC', n_jobs=-1) >>> features = cm.represent(molecules)
- static concat_dataframes(mol_tensors_list)
Concatenates a list of molecule tensors
- mol_tensors_list: list
list of molecule tensors
- featuresdataframe
a single feature dataframe
- represent(molecules)
provides coulomb matrix representation for input molecules.
- moleculeschemml.chem.Molecule object or array
If list, it must be a list of chemml.chem.Molecule objects, otherwise we raise a ValueError. In addition, all the molecule objects must provide the XYZ information. Please make sure the XYZ geometry has been stored or optimized in advance.
- featuresPandas DataFrame
A data frame with same number of rows as number of molecules will be returned. The exact shape of the dataframe depends on the type of CM as follows:
shape of Unsorted_Matrix (UM): (n_molecules, max_n_atoms**2)
shape of Unsorted_Triangular (UT): (n_molecules, max_n_atoms*(max_n_atoms+1)/2)
shape of eigenspectrums (E): (n_molecules, max_n_atoms)
shape of Sorted_Coulomb (SC): (n_molecules, max_n_atoms*(max_n_atoms+1)/2)
shape of Random_Coulomb (RC): (n_molecules, nPerm * max_n_atoms * (max_n_atoms+1)/2)
- class chemml.chem.Dragon(CheckUpdates=True, SaveLayout=True, PreserveTemporaryProjects=True, ShowWorksheet=False, Decimal_Separator='.', Missing_String='NaN', DefaultMolFormat='1', HelpBrowser='/usr/bin/xdg-open', RejectUnusualValence=False, Add2DHydrogens=False, MaxSRforAllCircuit='19', MaxSR='35', MaxSRDetour='30', MaxAtomWalkPath='2000', LogPathWalk=True, LogEdge=True, Weights=('Mass', 'VdWVolume', 'Electronegativity', 'Polarizability', 'Ionization', 'I-State'), SaveOnlyData=False, SaveLabelsOnSeparateFile=False, SaveFormatBlock='%b-%n.txt', SaveFormatSubBlock='%b-%s-%n-%m.txt', SaveExcludeMisVal=False, SaveExcludeAllMisVal=False, SaveExcludeConst=False, SaveExcludeNearConst=False, SaveExcludeStdDev=False, SaveStdDevThreshold='0.0001', SaveExcludeCorrelated=False, SaveCorrThreshold='0.95', SaveExclusionOptionsToVariables=False, SaveExcludeMisMolecules=False, SaveExcludeRejectedMolecules=False, blocks=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], SaveStdOut=False, SaveProject=False, SaveProjectFile='Dragon_project.drp', SaveFile=True, SaveType='singlefile', SaveFilePath='Dragon_descriptors.txt', logMode='file', logFile='Dragon_log.txt', external=False, fileName=None, delimiter=',', consecutiveDelimiter=False, MissingValue='NaN', RejectDisconnectedStrucuture=False, RetainBiggestFragment=False, DisconnectedCalculationOption='0', RoundCoordinates=True, RoundWeights=True, RoundDescriptorValues=True, knimemode=False)
An interface to Dragon 6 and 7 chemoinformatics software. Dragon is a commercial software and you should provide
- versionint, optional (default=7)
The version of available Dragon on the user’s system. (available versions: 6 or 7)
- Weightslist, optional (default=[“Mass”,”VdWVolume”,”Electronegativity”,”Polarizability”,”Ionization”,”I-State”])
A list of weights to be used
- blockslist, optional (default = list(range(1,31)))
A list of integers as descriptor blocks’ id. There are totally 29 and 30 blocks available in version 6 and 7, respectively. This module is not atimed to cherry pick descriptors in each block. For doing so, please use Script Wizard in Drgon GUI.
- externalboolean, optional (default=False)
If True, include external variables at the end of each saved file.
- The documentation for the rest of parameters can be found in the following links:
>>> import pandas as pd >>> from chemml.chem import Dragon >>> drg = Dragon() >>> df = drg.represent(mol_list, output_directory='./', dropna=False)
- represent(mol_list, output_directory='./', dropna=True)
- mol_list: list
list of chemml.chem.Molecule objects
- output_directory: str
output directory to save dragon scripts
- dropna: bool
Drops all columns with any NaN value.
- class chemml.chem.Molecule(input_mol, input_type, engine='rdkit', **kwargs)
The central class to construct a molecule from different chemical input formats. This module is built on top of RDKit and OpenBabel python API. We join the forces and strength of these two cheminformatic libraris for a consistent user experience.
Almost all the molecular descriptors and molecule-based ML models require the chemical informatin as a Molecule object. Several methods are available in this module to facilitate the manipulation of chemical data.
- inputstr
The representation string or path to a file.
- input_typestr
The input type. The available types are enlisted here:
smiles: The input must be SMILES representation of a molecule.
smarts: The input must be SMARTS representation of a molecule.
inchi: The input must be InChi representation of a molecule.
xyz: The input must be the path to an xyz file.
mol2: The input must be the path to an mol2 file.
- kwargs :
- The corresponding RDKit arguments for each of the input types:
smiles: http://rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html#rdkit.Chem.rdmolfiles.MolFromSmiles
smarts: http://rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html#rdkit.Chem.rdmolfiles.MolFromSmarts
inchi: http://rdkit.org/docs/source/rdkit.Chem.inchi.html?highlight=inchi#rdkit.Chem.inchi.MolFromInchi
The molecule will be creatd as an RDKit molecule object.
The molecule object will be stored and available as rdkit_molecule attribute.
- If you load a molecule from its SMARTS string, there is high probability that you can’t convert it to other
types due to the abstract description of the molecules by SMARTS representation.
- rdkit_moleculeobject
The rdkit.Chem.rdchem.Mol object
- smilesstr
The SMILES string that you get by running the to_smiles method.
- smartsstr
The SMARTS string that you get by running the to_smarts method.
- inchistr
The InChi string that you get by running the to_inchi method.
- xyzinstance of <class ‘chemml.chem.molecule.XYZ’>
The class object that stores the 3D info. The available attributes in the class are ‘geometry’, ‘atomic_numbers’, and ‘atomic_symbols’.
>>> from chemml.chem import Molecule >>> caffeine_smiles = 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C' >>> caffeine_smarts = '[#6]-[#7]1:[#6]:[#7]:[#6]2:[#6]:1:[#6](=[#8]):[#7](:[#6](=[#8]):[#7]:2-[#6])-[#6]' >>> caffeine_inchi = 'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3' >>> mol = Molecule(caffeine_smiles, 'smiles') >>> mol <Molecule(rdkit_molecule : <rdkit.Chem.rdchem.Mol object at 0x11060a8a0>, creator : ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'), smiles : 'Cn1c(=O)c2c(ncn2C)n(C)c1=O', smarts : None, inchi : None, xyz : None)> >>> mol.smiles # this is the canonical SMILES by default 'Cn1c(=O)c2c(ncn2C)n(C)c1=O' >>> mol.creator ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C') >>> mol.to_smiles(kekuleSmiles=True) >>> mol <Molecule(rdkit_molecule : <rdkit.Chem.rdchem.Mol object at 0x11060a8a0>, creator : ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'), smiles : 'CN1C(=O)C2=C(N=CN2C)N(C)C1=O', smarts : None, inchi : None, xyz : None)> >>> mol.smiles # the kukule SMILES is not canonical 'CN1C(=O)C2=C(N=CN2C)N(C)C1=O' >>> mol.inchi is None True >>> mol.to_inchi() >>> mol <Molecule(rdkit_molecule : <rdkit.Chem.rdchem.Mol object at 0x11060a8a0>, creator : ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'), smiles : 'CN1C(=O)C2=C(N=CN2C)N(C)C1=O', smarts : None, inchi : 'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3', xyz : None)> >>> mol.inchi 'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3' >>> mol.to_smarts(isomericSmiles=False) >>> mol.smarts '[#6]-[#7]1:[#6]:[#7]:[#6]2:[#6]:1:[#6](=[#8]):[#7](:[#6](=[#8]):[#7]:2-[#6])-[#6]' >>> >>> # add hydrogens and recreate smiles and inchi >>> mol.hydrogens('add') >>> mol.to_smiles() >>> mol.to_inchi() <Molecule(rdkit_molecule : <rdkit.Chem.rdchem.Mol object at 0x11060ab70>, creator : ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'), smiles : '[H]c1nc2c(c(=O)n(C([H])([H])[H])c(=O)n2C([H])([H])[H])n1C([H])([H])[H]', smarts : '[#6]-[#7]1:[#6]:[#7]:[#6]2:[#6]:1:[#6](=[#8]):[#7](:[#6](=[#8]):[#7]:2-[#6])-[#6]', inchi : 'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3', xyz : None)> >>> # note that by addition of hydrogens, the smiles string changed, but not inchi >>> mol.to_xyz() ValueError: The conformation has not been built yet. Maybe due to the 2D representation of the creator. You should set the optimizer value if you wish to embed and optimize the 3D geometries. >>> mol.to_xyz('MMFF', maxIters=300, mmffVariant='MMFF94s') >>> mol <Molecule(rdkit_molecule : <rdkit.Chem.rdchem.Mol object at 0x11060ab70>, creator : ('SMILES', 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'), smiles : '[H]c1nc2c(c(=O)n(C([H])([H])[H])c(=O)n2C([H])([H])[H])n1C([H])([H])[H]', smarts : '[#6]-[#7]1:[#6]:[#7]:[#6]2:[#6]:1:[#6](=[#8]):[#7](:[#6](=[#8]):[#7]:2-[#6])-[#6]', inchi : 'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3', xyz : <chemml.chem.molecule.XYZ object at 0x1105f34e0>)> >>> mol.xyz <chemml.chem.molecule.XYZ object at 0x1105f34e0> >>> mol.xyz.geometry array([[-3.13498321, 1.08078307, 0.33372515], [-2.15703638, 0.0494926 , 0.0969075 ], [-2.41850776, -1.27323453, -0.14538583], [-1.30933543, -1.96359887, -0.32234393], [-0.31298208, -1.04463051, -0.1870366 ], [-0.80033326, 0.20055608, 0.07079013], [ 0.02979071, 1.33923464, 0.25484381], [-0.41969338, 2.45755985, 0.48688003], [ 1.39083332, 1.04039479, 0.14253256], [ 1.93405927, -0.23430839, -0.12320318], [ 3.15509616, -0.39892779, -0.20417755], [ 1.03516373, -1.28860035, -0.28833489], [ 1.51247526, -2.63123373, -0.56438513], [ 2.34337825, 2.12037198, 0.31039958], [-3.03038469, 1.84426033, -0.44113804], [-2.95880047, 1.50667225, 1.32459816], [-4.13807215, 0.64857048, 0.29175711], [-3.4224011 , -1.6776074 , -0.18199971], [ 2.60349515, -2.67375289, -0.61674561], [ 1.10339164, -2.96582894, -1.52299068], [ 1.17455601, -3.30144289, 0.23239402], [ 2.94406381, 2.20916763, -0.60086251], [ 1.86156872, 3.07985237, 0.51337935], [ 3.01465788, 1.87625024, 1.14039627]]) >>> mol.xyz.atomic_numbers array([[6], [7], [6], [7], [6], [6], [6], [8], [7], [6], [8], [7], [6], [6], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1]]) >>> mol.xyz.atomic_symbols array([['C'], ['N'], ['C'], ['N'], ['C'], ['C'], ['C'], ['O'], ['N'], ['C'], ['O'], ['N'], ['C'], ['C'], ['H'], ['H'], ['H'], ['H'], ['H'], ['H'], ['H'], ['H'], ['H'], ['H']], dtype='<U1')
- hydrogens(action='add', **kwargs)
This function adds/removes hydrogens to/from a prebuilt molecule object.
- actionstr
Either ‘add’ or ‘remove’, to add hydrogns or remove them from the rdkit molecule.
- kwargs :
The arguments that can be passed to the rdkit functions: - Chem.AddHs: documentation at http://rdkit.org/docs/source/rdkit.Chem.rdmolops.html?highlight=addhs#rdkit.Chem.rdmolops.AddHs - Chem.RemoveHs: documentation at http://rdkit.org/docs/source/rdkit.Chem.rdmolops.html?highlight=addhs#rdkit.Chem.rdmolops.RemoveHs
The rdkit or pybel molecule object must be created in advance.
Only rdkit or pybel molecule object will be modified in place.
If you remove hydrogens from molecules, the atomic 3D coordinates might not be accurate for the conversion to xyz representation.
- to_inchi(**kwargs)
This function creates and stores the InChi string for a pre-built molecule.
- kwargs :
The arguments that can be passed to the rdkit.Chem.MolToInchi function (will be used only if rdkit molecule is available). The documentation is available at: http://rdkit.org/docs/source/rdkit.Chem.inchi.html?highlight=inchi#rdkit.Chem.inchi.MolToInchi
The rdkit or pybel molecule object must be created in advance.
The molecule will be modified in place.
- to_mol2(filename=None)
This function creates and stores the xyz coordinates for a pre-built molecule object.
- optimizerNone or str, optional (default: None)
The geometries will be extracted from the available source of 3D structure (if any). For openbabel:
[‘uff’, ‘mmff94’, ‘mmff94s’, ‘ghemical’]
- For rdkit:
Otherwise, any of the ‘UFF’ or ‘MMFF’ force fileds should be passed to embed and optimize geometries using ‘rdkit.Chem.AllChem.UFFOptimizeMolecule’ or ‘rdkit.Chem.AllChem.MMFFOptimizeMolecule’ methods, respectively.
- kwargs :
The arguments that can be passed to the corresponding forcefileds. The documentation is available at:
UFFOptimizeMolecule: http://rdkit.org/docs/source/rdkit.Chem.rdForceFieldHelpers.html?highlight=mmff#rdkit.Chem.rdForceFieldHelpers.UFFOptimizeMolecule
MMFFOptimizeMolecule: http://rdkit.org/docs/source/rdkit.Chem.rdForceFieldHelpers.html?highlight=mmff#rdkit.Chem.rdForceFieldHelpers.MMFFOptimizeMolecule
The geometry will be stored in the xyz attribute.
The molecule object must be created in advance.
The hydrogens won’t be added to the molecule automatically. You should add it manually using hydrogens method.
If the molecule object has been built using 2D representations (e.g., SMILES or InChi), the conformer
doesn’t exist and you nedd to set the optimizer parameter to any of the force fields. - If the 3D info exist but you still need to run optimization, the 3D structure will be embedded from scratch (i.e., the current atom coordinates will be removed.)
- to_smarts(**kwargs)
This function creates and stores the SMARTS string for a pre-built molecule.
- kwargs :
All the arguments that can be passed to the rdkit.Chem.MolToSmarts function. The documentation is available at: http://rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html#rdkit.Chem.rdmolfiles.MolToSmarts
The rdkit or pybel molecule object must be created in advance.
If only pybel molecule is available, we create an rdkit molecule using its SMILES representation, and then create the SMARTS string using rdkit arguments.
The molecule will be modified in place.
- to_smiles(**kwargs)
This function creates and stores the SMILES string for a pre-built molecule.
- kwargs :
The arguments for the rdkit.Chem.MolToSmiles function. The documentation is available at: http://rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html#rdkit.Chem.rdmolfiles.MolToSmiles
The rdkit or pybel molecule object must be created in advance.
If only pybel molecule is available, we create an rdkit molecule using its SMILES representation, and then recreate the SMILES string using rdkit arguments.
The molecule will be modified in place.
For rdkit molecule the SMILES string is canocical by default, unless when one requests kekuleSmiles.
- to_xyz(optimizer=None, steps=500, filename=None, **kwargs)
This function creates and stores the xyz coordinates for a pre-built molecule object.
- optimizerNone or str, optional (default: None)
The geometries will be extracted from the available source of 3D structure (if any). For openbabel:
[‘uff’, ‘mmff94’, ‘mmff94s’, ‘ghemical’]
- For rdkit:
Otherwise, any of the ‘UFF’ or ‘MMFF’ force fileds should be passed to embed and optimize geometries using ‘rdkit.Chem.AllChem.UFFOptimizeMolecule’ or ‘rdkit.Chem.AllChem.MMFFOptimizeMolecule’ methods, respectively.
- kwargs :
The arguments that can be passed to the corresponding forcefileds. The documentation is available at:
UFFOptimizeMolecule: http://rdkit.org/docs/source/rdkit.Chem.rdForceFieldHelpers.html?highlight=mmff#rdkit.Chem.rdForceFieldHelpers.UFFOptimizeMolecule
MMFFOptimizeMolecule: http://rdkit.org/docs/source/rdkit.Chem.rdForceFieldHelpers.html?highlight=mmff#rdkit.Chem.rdForceFieldHelpers.MMFFOptimizeMolecule
The geometry will be stored in the xyz attribute.
The molecule object must be created in advance.
The hydrogens won’t be added to the molecule automatically. You should add it manually using hydrogens method.
If the molecule object has been built using 2D representations (e.g., SMILES or InChi), the conformer
doesn’t exist and you nedd to set the optimizer parameter to any of the force fields. - If the 3D info exist but you still need to run optimization, the 3D structure will be embedded from scratch (i.e., the current atom coordinates will be removed.)
- visualize(filename=None, **kwargs)
This function visualizes the molecule. If both rdkit and pybel objects are avaialble, the rdkit object will be used for visualization.
- filename: str, optional (default = None)
This is the path to the file that you want write the image in it. Tkinter and Python Imaging Library are required for writing the image.
- kwargs:
any extra parameter that you want to pass to the rdkit or pybel draw tool. Additional information at:
- figobject
You will be able to display this object, e.g., inside the Jupyter Notebook.
- class chemml.chem.RDKitFingerprint(fingerprint_type='Morgan', vector='bit', n_bits=1024, radius=2, **kwargs)
This is an interface to the available molecular fingerprints in the RDKit package.
- fingerprint_typestr, optional (default=’Morgan’)
- The type of fingerprint. Available fingerprint types:
‘hashed_atom_pair’ or ‘hap’
‘MACCS’ or ‘maccs’
‘morgan’
‘hashed_topological_torsion’ or ‘htt’
‘topological_torsion’ or ‘tt’
- vectorstr, optional (default = ‘bit’)
- Available options for vector:
- ‘int’represent counts for each fragment instead of bits
It is not available for ‘MACCS’.
- ‘bit’only zeros and ones
It is not available for ‘Topological_torsion’.
- n_bitsint, optional (default = 1024)
It sets number of elements/bits in the ‘bit’ type of fingerprint vectors. Not availble for:
‘MACCS’ - (MACCS keys have a fixed length of 167 bits)
‘Topological_torsion’ - doesn’t return a bit vector at all.
- radiusint, optional (default = 2)
only applicable if calculating ‘Morgan’ fingerprint.
- kwargs :
Any additional argument that should be passed to the rdkit fingerprint function.
- n_molecules_int
The number of molecules that are received.
- fps_list
The list of rdkit fingerprint objects.
- load_sparse(file)
This function enables you to load sparse matrix with the .npz format and convert it to a pandas dataframe.
- filestr
Must be a path to the file with .npz format.
- featurespandas.DataFrame
The dense dataframe of the passed sparse file.
- represent(molecules)
The main function to provide fingerprint representation of input molecule(s).
- moleculeschemml.chem.Molecule object or list
It must be an instance of chemml.chem.Molecule object or a list of those objects, otherwise a ValueError will be raised. If smiles representation of the molecule (or rdkit molecule object) is not available, we convert the molecule to smiles automatically. However, the automatic conversion may ignore your manual settings, for example removed hydrogens, kekulized, or canonical smiles.
- featurespandas.DataFrame
A 2-dimensional pandas dataframe of fingerprint features with same number of rows as number of molecules.
- store_sparse(file, features)
This function helps you to store higly sparse fingerprint feature sets using .npz format for memory efficiency and less store/load time. Another method of this class, load_sparse, enables you to load your .npz files and convert it back to pandas dataframe.
- filestr
Must be a path to the file with .npz format.
- featurespandas DataFrame
Must be the pandas dataframe as you receive it from represent method.
- class chemml.chem.XYZ(geometry, atomic_numbers, atomic_symbols)
This class stores the information that is typically carried by standard XYZ files.
- geometryndarray
The numpy array of shape (number_of_atoms, 3). It stores the xyz coordinates for each atom of the molecule.
- atomic_numbersndarray
The numpy array of shape (number_of_atoms, 1). It stores the atomic numbers of each atom in the molecule (in the same order as geometry).
- atomic_symbolsndarray
The numpy array of shape (number_of_atoms, 1). It stores the atomic symbols of each atom in the molecule (in the same order as geometry).
- chemml.chem.atom_features(atom)
This function encodes the RDKit atom to a binary vector.
- bondrdkit.Chem.rdchem.Bond
The bond must be an RDKit Bond object.
- featuresarray
A binary array with length 6 that specifies the type of bond, if it is a single/double/triple/aromatic bond, a conjugated bond or belongs to a molecular ring.
- chemml.chem.bond_features(bond)
This function encodes the RDKit bond to a binary vector.
- bondrdkit.Chem.rdchem.Bond
The bond must be an RDKit Bond object.
- featuresarray
A binary array with length 6 that specifies the type of bond, if it is a single/double/triple/aromatic bond, a conjugated bond or belongs to a molecular ring.
- chemml.chem.num_atom_features()
This function returns the number of atomic features that are available by this module.
- n_featuresint
length of atomic feature vector.
- chemml.chem.num_bond_features()
This function returns the number of bond features that are available by this module.
- n_featuresint
length of bond feature vector.
- chemml.chem.tensorise_molecules(molecules, max_degree=5, max_atoms=None, n_jobs=- 1, batch_size=3000, verbose=True)
Takes a list of molecules and provides tensor representation of atom and bond features. This representation is based on the “convolutional networks on graphs for learning molecular fingerprints” by David Duvenaud et al., NIPS 2015.
- moleculeschemml.chem.Molecule object or array
If list, it must be a list of chemml.chem.Molecule objects, otherwise we raise a ValueError. In addition, all the molecule objects must provide the SMILES representation. We try to create the SMILES representation if it’s not available.
- max_degreeint, optional (default=5)
The maximum number of neighbour per atom that each molecule can have (to which all molecules will be padded), use ‘None’ for auto
- max_atomsint, optional (default=None)
The maximum number of atoms per molecule (to which all molecules will be padded), use ‘None’ for auto
- n_jobsint, optional(default=-1)
The number of parallel processes. If -1, uses all the available processes.
- batch_sizeint, optional(default=3000)
The number of molecules per process, bigger chunksize is preffered as each process will preallocate np.arrays
- verbosebool, optional(default=True)
The verbosity of messages.
It is not recommended to set max_degree to None/auto when using NeuralGraph layers. Max_degree determines the number of trainable parameters and is essentially a hyperparameter. While models can be rebuilt using different max_atoms, they cannot be rebuild for different values of max_degree, as the architecture will be different.
For organic molecules max_degree=5 is a good value (Duvenaud et. al, 2015)
- atomsarray
An atom feature array of shape (molecules, max_atoms, atom_features)
- bondsarray
A bonds array of shape (molecules, max_atoms, max_degree)
edges : array A connectivity array of shape (molecules, max_atoms, max_degree, bond_features)