Add automatic data type inference to PandasTools.LoadSDF #7348

ichxw · 2024-04-10T21:05:15Z

The RDKit library's PandasTools.LoadSDF function currently lacks the ability to automatically detect the data types of columns when loading data from an SDF file into a Pandas DataFrame. Users have to manually specify the data types, which can be time-consuming and error-prone.

I propose adding an optional dtype parameter to PandasTools.LoadSDF, similar to the pd.read_csv() function in Pandas. This would allow the function to automatically infer the data types of the columns, reducing the manual effort required by the user.

It will has a few benefits:
Improved user experience and reduced errors
Increased efficiency when working with large or complex SDF files
Consistency with Pandas' pd.read_csv() function
This feature would be a valuable addition to the RDKit library, benefiting users who work with SDF data and Pandas DataFrames.
Thanks for your consideration.

joseerlang · 2024-04-18T16:50:27Z

Hi there,
Would something like the following make sense to you?

import rdkit
from rdkit.Chem import RDConfig,PandasTools
import os
from dtype_diet import report_on_dataframe, optimize_dtypes
sdfFile = os.path.join(RDConfig.RDDataDir,'NCI/first_200.props.sdf')
frame = PandasTools.LoadSDF(sdfFile,smilesName='SMILES',molColName='Molecule', includeFingerprints=True)
proposed_df = report_on_dataframe(frame)
new_df = optimize_dtypes(frame, proposed_df)
new_df =new_df.convert_dtypes(infer_objects=True)

ichxw · 2024-04-22T02:32:55Z

Hi there, Would something like the following make sense to you?

import rdkit from rdkit.Chem import RDConfig,PandasTools import os from dtype_diet import report_on_dataframe, optimize_dtypes sdfFile = os.path.join(RDConfig.RDDataDir,'NCI/first_200.props.sdf') frame = PandasTools.LoadSDF(sdfFile,smilesName='SMILES',molColName='Molecule', includeFingerprints=True) proposed_df = report_on_dataframe(frame) new_df = optimize_dtypes(frame, proposed_df) new_df =new_df.convert_dtypes(infer_objects=True)

Thanks. But it only converted data types from object to string or category. Below was the test results:
Data Type Before converting:

AMW                       object
CLOGP                     object
CP                        object
CR                        object
DAYLIGHT.FPG              object
DAYLIGHT_CLOGP            object
FP                        object
ISM                       object
LIPINSKI_VIOLATIONS       object
NUM_HACCEPTORS            object
NUM_HDONORS               object
NUM_HETEROATOMS           object
NUM_LIPINSKIHACCEPTORS    object
NUM_LIPINSKIHDONORS       object
NUM_RINGS                 object
NUM_ROTATABLEBONDS        object
NUM_ROTATABLEBONDS_O      object
P1                        object
SMILES                    object
ID                        object
Molecule                  object
dtype: object

Data Type After converting:

AMW                       string[python]
CLOGP                     string[python]
CP                        string[python]
CR                        string[python]
DAYLIGHT.FPG                    category
DAYLIGHT_CLOGP            string[python]
FP                        string[python]
ISM                       string[python]
LIPINSKI_VIOLATIONS             category
NUM_HACCEPTORS                  category
NUM_HDONORS                     category
NUM_HETEROATOMS                 category
NUM_LIPINSKIHACCEPTORS          category
NUM_LIPINSKIHDONORS             category
NUM_RINGS                       category
NUM_ROTATABLEBONDS              category
NUM_ROTATABLEBONDS_O            category
P1                              category
SMILES                    string[python]
ID                              category
Molecule                          object
dtype: object

In this case, I would prefer converting columns like AMW and CLOGP, ... to float, and LIPINSKI_VIOLATIONS and NUM_HACCEPTORS, ... integer.

ichxw added the enhancement label Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add automatic data type inference to PandasTools.LoadSDF #7348

Add automatic data type inference to PandasTools.LoadSDF #7348

ichxw commented Apr 10, 2024

joseerlang commented Apr 18, 2024

ichxw commented Apr 22, 2024

Add automatic data type inference to PandasTools.LoadSDF #7348

Add automatic data type inference to PandasTools.LoadSDF #7348

Comments

ichxw commented Apr 10, 2024

joseerlang commented Apr 18, 2024

ichxw commented Apr 22, 2024