Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add automatic data type inference to PandasTools.LoadSDF #7348

Open
ichxw opened this issue Apr 10, 2024 · 2 comments
Open

Add automatic data type inference to PandasTools.LoadSDF #7348

ichxw opened this issue Apr 10, 2024 · 2 comments

Comments

@ichxw
Copy link

ichxw commented Apr 10, 2024

The RDKit library's PandasTools.LoadSDF function currently lacks the ability to automatically detect the data types of columns when loading data from an SDF file into a Pandas DataFrame. Users have to manually specify the data types, which can be time-consuming and error-prone.

I propose adding an optional dtype parameter to PandasTools.LoadSDF, similar to the pd.read_csv() function in Pandas. This would allow the function to automatically infer the data types of the columns, reducing the manual effort required by the user.

It will has a few benefits:
Improved user experience and reduced errors
Increased efficiency when working with large or complex SDF files
Consistency with Pandas' pd.read_csv() function
This feature would be a valuable addition to the RDKit library, benefiting users who work with SDF data and Pandas DataFrames.
Thanks for your consideration.

@joseerlang
Copy link

Hi there,
Would something like the following make sense to you?

import rdkit
from rdkit.Chem import RDConfig,PandasTools
import os
from dtype_diet import report_on_dataframe, optimize_dtypes
sdfFile = os.path.join(RDConfig.RDDataDir,'NCI/first_200.props.sdf')
frame = PandasTools.LoadSDF(sdfFile,smilesName='SMILES',molColName='Molecule', includeFingerprints=True)
proposed_df = report_on_dataframe(frame)
new_df = optimize_dtypes(frame, proposed_df)
new_df =new_df.convert_dtypes(infer_objects=True)

@ichxw
Copy link
Author

ichxw commented Apr 22, 2024

Hi there, Would something like the following make sense to you?

import rdkit from rdkit.Chem import RDConfig,PandasTools import os from dtype_diet import report_on_dataframe, optimize_dtypes sdfFile = os.path.join(RDConfig.RDDataDir,'NCI/first_200.props.sdf') frame = PandasTools.LoadSDF(sdfFile,smilesName='SMILES',molColName='Molecule', includeFingerprints=True) proposed_df = report_on_dataframe(frame) new_df = optimize_dtypes(frame, proposed_df) new_df =new_df.convert_dtypes(infer_objects=True)

Thanks. But it only converted data types from object to string or category. Below was the test results:
Data Type Before converting:

AMW                       object
CLOGP                     object
CP                        object
CR                        object
DAYLIGHT.FPG              object
DAYLIGHT_CLOGP            object
FP                        object
ISM                       object
LIPINSKI_VIOLATIONS       object
NUM_HACCEPTORS            object
NUM_HDONORS               object
NUM_HETEROATOMS           object
NUM_LIPINSKIHACCEPTORS    object
NUM_LIPINSKIHDONORS       object
NUM_RINGS                 object
NUM_ROTATABLEBONDS        object
NUM_ROTATABLEBONDS_O      object
P1                        object
SMILES                    object
ID                        object
Molecule                  object
dtype: object

Data Type After converting:

AMW                       string[python]
CLOGP                     string[python]
CP                        string[python]
CR                        string[python]
DAYLIGHT.FPG                    category
DAYLIGHT_CLOGP            string[python]
FP                        string[python]
ISM                       string[python]
LIPINSKI_VIOLATIONS             category
NUM_HACCEPTORS                  category
NUM_HDONORS                     category
NUM_HETEROATOMS                 category
NUM_LIPINSKIHACCEPTORS          category
NUM_LIPINSKIHDONORS             category
NUM_RINGS                       category
NUM_ROTATABLEBONDS              category
NUM_ROTATABLEBONDS_O            category
P1                              category
SMILES                    string[python]
ID                              category
Molecule                          object
dtype: object

In this case, I would prefer converting columns like AMW and CLOGP, ... to float, and LIPINSKI_VIOLATIONS and NUM_HACCEPTORS, ... integer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants