Initialization module

The chemml.initialization module includes (please click on links adjacent to function names for more information):
class chemml.initialization.SaveFile(filename, output_directory=None, record_time=False, format='csv', index=False, header=True)

Write DataFrame to a comma-seprated values(csv) file :param: output_directory: string, the output directory to save output files

fit(X, main_directory='.')

Write DataFrame to a comma-seprated values (csv) file Parameters ———- X: pandas DataFrame main_directory: string

if there is any main directory for entire chemml project

class chemml.initialization.Split(selection=1)

split data frame by columns

select: int or list, optional (default = 1)

if integer, it’s the number of columns to be selected from left side of data and returns as first data frame (X1). if list, it must contain the exact column names to be selected and returned as first data frame (X1). The remaining columns will be returned as second dataframe (X2) in both cases.

fit(X)

The main function to fit the split task to the input data frame X.

X: pandas dataframe

The input dataframe

pandas dataframe

X1: the dataframe resulted based on the selection parameter

pandas dataframe

X2: the dataframe of columns not selected as X1

class chemml.initialization.XYZreader(path_pattern, path_root=None, Z={'Ac': 89.0, 'Ag': 47.0, 'Al': 13.0, 'Am': 95.0, 'Ar': 18.0, 'As': 33.0, 'At': 85.0, 'Au': 79.0, 'B': 5.0, 'Ba': 56.0, 'Be': 4.0, 'Bh': 107.0, 'Bi': 83.0, 'Bk': 97.0, 'Br': 35.0, 'C': 6.0, 'Ca': 20.0, 'Cd': 48.0, 'Ce': 58.0, 'Cf': 98.0, 'Cl': 17.0, 'Cm': 96.0, 'Cn': 112.0, 'Co': 27.0, 'Cr': 24.0, 'Cs': 55.0, 'Cu': 29.0, 'Db': 105.0, 'Ds': 110.0, 'Dy': 66.0, 'Er': 68.0, 'Es': 99.0, 'Eu': 63.0, 'F': 9.0, 'Fe': 26.0, 'Fl': 114.0, 'Fm': 100.0, 'Fr': 87.0, 'Ga': 31.0, 'Gd': 64.0, 'Ge': 32.0, 'H': 1.0, 'He': 2.0, 'Hf': 72.0, 'Hg': 80.0, 'Ho': 67.0, 'Hs': 108.0, 'I': 53.0, 'In': 49.0, 'Ir': 77.0, 'K': 19.0, 'Kr': 36.0, 'La': 57.0, 'Li': 3.0, 'Lr': 103.0, 'Lu': 71.0, 'Lv': 116.0, 'Md': 101.0, 'Mg': 12.0, 'Mn': 25.0, 'Mo': 42.0, 'Mt': 109.0, 'N': 7.0, 'Na': 11.0, 'Nb': 41.0, 'Nd': 60.0, 'Ne': 10.0, 'Ni': 28.0, 'No': 102.0, 'Np': 93.0, 'O': 8.0, 'Os': 76.0, 'P': 15.0, 'Pa': 91.0, 'Pb': 82.0, 'Pd': 46.0, 'Pm': 61.0, 'Po': 84.0, 'Pr': 59.0, 'Pt': 78.0, 'Pu': 94.0, 'Ra': 88.0, 'Rb': 37.0, 'Re': 75.0, 'Rf': 104.0, 'Rg': 111.0, 'Rh': 45.0, 'Rn': 86.0, 'Ru': 44.0, 'S': 16.0, 'Sb': 51.0, 'Sc': 21.0, 'Se': 34.0, 'Sg': 106.0, 'Si': 14.0, 'Sm': 62.0, 'Sn': 50.0, 'Sr': 38.0, 'Ta': 73.0, 'Tb': 65.0, 'Tc': 43.0, 'Te': 52.0, 'Th': 90.0, 'Ti': 22.0, 'Tl': 81.0, 'Tm': 69.0, 'U': 92.0, 'Uuo': 118.0, 'Uup': 115.0, 'Uus': 117.0, 'Uut': 113.0, 'V': 23.0, 'W': 74.0, 'Xe': 54.0, 'Y': 39.0, 'Yb': 70.0, 'Zn': 30.0, 'Zr': 40.0}, reader='auto', skip_lines=[2, 0], path_only=False)

(XYZreader) Read molecules’ geometry (cartesian coordinates) from one or more XYZ files.

path_pattern: string or list of string

A pattern or a list of patterns. Each pattern consists of file name and path. The pattern can contain any special character in the following list:

  • : matches everything (zero or more characters)

? : matches any single character [seq] : matches any character in seq [!seq] : matches any character not in seq / : filename seperator

Note: seq can indicate a range of characters by giving two characters and separating them by a ‘-‘.

However, the range is limited to a single character. That’s why this parameter also accept a list of patterns for different length of characters. For example, range(1,30) = ‘[1-9]’ and ‘[1-9][0-9]’

The pattern matching is utilizes fnmatch library- Unix filename pattern matching. Note: The pattern must include the file format at the end. The only acceptable extension is ‘.xyz’.

path_root: string, optional (default = None)

fixed (with no special character) part of the path. If path is None, this function tries to open the file as a single file (without any pattern matching). It is a fixed path to subdirectories or files. Therefore, none of the above special characters except ‘/’ can be used. If not None, it determines the path that this function walk through and look for every file that matches the pattern. To start walking from the curent directory, the path should be ‘.’

Z: dictionary, optional (default = {‘Ru’: 44.0, ‘Re’: 75.0, ‘Rf’: 104.0, ‘Rg’: 111.0, ‘Ra’: 88.0, ‘Rb’: 37.0, ‘Rn’: 86.0, ‘Rh’: 45.0, ‘Be’: 4.0, ‘Ba’: 56.0, ‘Bh’: 107.0, ‘Bi’: 83.0, ‘Bk’: 97.0, ‘Br’: 35.0, ‘H’: 1.0, ‘P’: 15.0, ‘Os’: 76.0, ‘Es’: 99.0, ‘Hg’: 80.0, ‘Ge’: 32.0, ‘Gd’: 64.0, ‘Ga’: 31.0, ‘Pr’: 59.0, ‘Pt’: 78.0, ‘Pu’: 94.0, ‘C’: 6.0, ‘Pb’: 82.0, ‘Pa’: 91.0, ‘Pd’: 46.0, ‘Cd’: 48.0, ‘Po’: 84.0, ‘Pm’: 61.0, ‘Hs’: 108.0, ‘Uup’: 115.0, ‘Uus’: 117.0, ‘Uuo’: 118.0, ‘Ho’: 67.0, ‘Hf’: 72.0, ‘K’: 19.0, ‘He’: 2.0, ‘Md’: 101.0, ‘Mg’: 12.0, ‘Mo’: 42.0, ‘Mn’: 25.0, ‘O’: 8.0, ‘Mt’: 109.0, ‘S’: 16.0, ‘W’: 74.0, ‘Zn’: 30.0, ‘Eu’: 63.0, ‘Zr’: 40.0, ‘Er’: 68.0, ‘Ni’: 28.0, ‘No’: 102.0, ‘Na’: 11.0, ‘Nb’: 41.0, ‘Nd’: 60.0, ‘Ne’: 10.0, ‘Np’: 93.0, ‘Fr’: 87.0, ‘Fe’: 26.0, ‘Fl’: 114.0, ‘Fm’: 100.0, ‘B’: 5.0, ‘F’: 9.0, ‘Sr’: 38.0, ‘N’: 7.0, ‘Kr’: 36.0, ‘Si’: 14.0, ‘Sn’: 50.0, ‘Sm’: 62.0, ‘V’: 23.0, ‘Sc’: 21.0, ‘Sb’: 51.0, ‘Sg’: 106.0, ‘Se’: 34.0, ‘Co’: 27.0, ‘Cn’: 112.0, ‘Cm’: 96.0, ‘Cl’: 17.0, ‘Ca’: 20.0, ‘Cf’: 98.0, ‘Ce’: 58.0, ‘Xe’: 54.0, ‘Lu’: 71.0, ‘Cs’: 55.0, ‘Cr’: 24.0, ‘Cu’: 29.0, ‘La’: 57.0, ‘Li’: 3.0, ‘Lv’: 116.0, ‘Tl’: 81.0, ‘Tm’: 69.0, ‘Lr’: 103.0, ‘Th’: 90.0, ‘Ti’: 22.0, ‘Te’: 52.0, ‘Tb’: 65.0, ‘Tc’: 43.0, ‘Ta’: 73.0, ‘Yb’: 70.0, ‘Db’: 105.0, ‘Dy’: 66.0, ‘Ds’: 110.0, ‘I’: 53.0, ‘U’: 92.0, ‘Y’: 39.0, ‘Ac’: 89.0, ‘Ag’: 47.0, ‘Uut’: 113.0, ‘Ir’: 77.0, ‘Am’: 95.0, ‘Al’: 13.0, ‘As’: 33.0, ‘Ar’: 18.0, ‘Au’: 79.0, ‘At’: 85.0, ‘In’: 49.0})

A dictionary of nuclear charges with respect to the chemical symbols of all atom types in the xyz files.

reader: string, optional (default = ‘auto’)

Available options : ‘auto’ and ‘manual’ If ‘auto’, the openbabel readstring function creat the molecule object. The type of files for openbabel class has been set to ‘xyz’, thus the format of the file should also follow a typical xyz format. However, with ‘manual’ reader you can skip some lines from top or bottom of xyz files.

skip_lines: list of two integers, optional (default = [2,0])

Number of lines to skip (int) from top and bottom of the xyz files, respectively. Based on the original xyz format only 2 lines from top and zero from bottom of a file can be skipped. Thus, number of atoms in the first line can also be ignored. Only available for ‘manual’ reader.

max_n_atoms_: int

Maximum number of atoms in the pool of molecules. This can be useful if you want to set this parameter in the feature representation methods, e.g. Coulomb_Matrix.

Some pattern examples: (1) path_pattern: ‘Mydir/1f/1_opt.xyz’ path_root: None one file will be read: ‘Mydir/1f/1_opt.xyz’

(2) path_pattern: ‘[1,2,3,4]?/*_opt.xyz’ path_root: ‘Mydir’ sample files to be read: ‘Mydir/1f/1_opt.xyz’, ‘Mydir/2c/2_opt.xyz’, …

(3) path_pattern: ‘[!1,2]?/*_opt.xyz’ path_root: ‘.’ sample files to be read: ‘./3f/3_opt.xyz’, ‘Mydir/7c/7_opt.xyz’, …

(4) path_pattern: ‘*[‘f’,’c’]/*_opt.xyz’ path_root: ‘Mydir’ sample files to be read: ‘Mydir/1f/1_opt.xyz’, ‘Mydir/2c/2_opt.xyz’, …

(5) path_pattern: [‘[2-5]_opt.xyz’, ‘[1-9][2-5]_opt.xyz’] path_root: ‘Mydir/all’ sample files to be read: ‘Mydir/all/1f/1_opt.xyz’, ‘Mydir/all/2c/2_opt.xyz’, …

read()

read the XYZ files based on the path_pattern and path_root parameters and create a list of chemml.chem.Molecule objects.

molecules: list

A list of chemml.chem.Molecule objects