Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: using docstring extensions causes several incorrect behaviors because DataFrame, Series, and BasePandasDataset cannot recognize each other #7138

Open
3 tasks done
mvashishtha opened this issue Mar 28, 2024 · 1 comment
Labels
bug 🦗 Something isn't working Enable plugin Fixes needed to enable external plugins P3 Very minor bugs, or features we can hopefully add some day.

Comments

@mvashishtha
Copy link
Collaborator

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import pandas
import numpy as np

import modin.pandas as pd
from modin.config import DocModule

assert pd.DataFrame([True]).all(axis=None) is np.bool_(True)

DocModule.put("modin.config.test.docs_module")

# Test for override
assert (
    pd.DataFrame.apply.__doc__
    == "This is a test of the documentation module for DataFrame."
)
# Test for pandas doc when method is not defined on the plugin module
assert pandas.DataFrame.isna.__doc__ in pd.DataFrame.isna.__doc__
assert pandas.DataFrame.isnull.__doc__ in pd.DataFrame.isnull.__doc__
# Test for override
assert (
    pd.Series.isna.__doc__
    == "This is a test of the documentation module for Series."
)
# Test for pandas doc when method is not defined on the plugin module
assert pandas.Series.isnull.__doc__ in pd.Series.isnull.__doc__
assert pandas.Series.apply.__doc__ in pd.Series.apply.__doc__
# Test for override
assert pd.read_csv.__doc__ == "Test override for functions on the module."
# Test for pandas doc when function is not defined on module.
assert pandas.read_table.__doc__ in pd.read_table.__doc__

assert pd.DataFrame([True]).all(axis=None) is np.bool_(True)

Issue Description

The reproducible example is a modified version of modin/config/test/test_envvars.py::test_doc_module at modin commit a616e4c.

The cause is that we redefine BasePandasDataset, Series, and DataFrame twice each:

  1. define BasePandasDataset, then Series, then DataFrame when we first load modin.pandas
  2. redefine BasePandasDataset when we do importlib.reload(pd.base) here because we're adding the doc module.
  3. redefine DataFrame when we do importlib.reload(pd.dataframe) here because we're adding the doc module.
  4. redefine Series when we do importlib.reload(pd.series) here because we're adding the doc module.

When we execute all, BasePandasDataset here, BasePandasDataset is the second BasePandasDataset class, but result is an instance of the first BasePandasDataset class because dataframe imported Series before the series module was reloaded. So result should be an instance of BasePandasDataset, but it's not, so we skip the extra reduction and end up with a Series result instead of a boolean.

There's no good order to do the reloads in because all three classes need to reference each other. We could do import modin.pandas.base as modin_base and always do base.BasePandasDataset instead of from modin.pandas.base import BsaePandasDataset, but that's hadr to read.

Proposed solution: instead of doing importlib.reload, we should update docstrings in place when we do DocModule.put().

Expected Behavior

all(axis=None) should always return a boolean. In general, BasePandasDataset, Series, and DataFrame should all be able to reference each other.

Error Logs

___________________________________________________________________ test_doc_module ____________________________________________________________________

    def test_doc_module():
        import pandas
        import numpy as np

        import modin.pandas as pd
        from modin.config import DocModule

        assert pd.DataFrame([True]).all(axis=None) is np.bool_(True)

        DocModule.put("modin.config.test.docs_module")

        # Test for override
        assert (
            pd.DataFrame.apply.__doc__
            == "This is a test of the documentation module for DataFrame."
        )
        # Test for pandas doc when method is not defined on the plugin module
        assert pandas.DataFrame.isna.__doc__ in pd.DataFrame.isna.__doc__
        assert pandas.DataFrame.isnull.__doc__ in pd.DataFrame.isnull.__doc__
        # Test for override
        assert (
            pd.Series.isna.__doc__
            == "This is a test of the documentation module for Series."
        )
        # Test for pandas doc when method is not defined on the plugin module
        assert pandas.Series.isnull.__doc__ in pd.Series.isnull.__doc__
        assert pandas.Series.apply.__doc__ in pd.Series.apply.__doc__
        # Test for override
        assert pd.read_csv.__doc__ == "Test override for functions on the module."
        # Test for pandas doc when function is not defined on module.
        assert pandas.read_table.__doc__ in pd.read_table.__doc__

>       assert pd.DataFrame([True]).all(axis=None) is np.bool_(True)
E       AssertionError: assert 0    True\ndtype: bool is True
E        +  where 0    True\ndtype: bool = <bound method BasePandasDataset.all of       0\n0  True>(axis=None)
E        +    where <bound method BasePandasDataset.all of       0\n0  True> =       0\n0  True.all
E        +      where       0\n0  True = <class 'modin.pandas.dataframe.DataFrame'>([True])
E        +        where <class 'modin.pandas.dataframe.DataFrame'> = <module 'modin.pandas' from '/Users/mvashishtha/sources/modin/modin/pandas/__init__.py'>.DataFrame
E        +  and   True = <class 'numpy.bool_'>(True)
E        +    where <class 'numpy.bool_'> = <module 'numpy' from '/Users/mvashishtha/miniconda3/envs/modin-latest/lib/python3.9/site-packages/numpy/__init__.py'>.bool_

modin/config/test/test_envvars.py:116: AssertionErro

Installed Versions

INSTALLED VERSIONS

commit : a616e4c
python : 3.9.18.final.0
python-bits : 64
OS : Darwin
OS-release : 23.4.0
Version : Darwin Kernel Version 23.4.0: Wed Feb 21 21:45:49 PST 2024; root:xnu-10063.101.15~2/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

Modin dependencies

modin : 0.28.0+35.ga616e4cc.dirty
ray : 2.8.0
dask : 2024.3.1
distributed : 2024.3.1
hdk : None

pandas dependencies

pandas : 2.2.1
numpy : 1.26.1
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.3
Cython : None
pytest : 8.1.1
hypothesis : None
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.1.0
html5lib : None
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.2
IPython : 8.17.2
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : 2024.2.0
fsspec : 2024.3.1
gcsfs : None
matplotlib : 3.8.1
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : 3.1.2
pandas_gbq : 0.22.0
pyarrow : 14.0.1
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.3.1
scipy : 1.11.3
sqlalchemy : 2.0.29
tables : 3.9.2
tabulate : None
xarray : 2024.2.0
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@mvashishtha mvashishtha added bug 🦗 Something isn't working P3 Very minor bugs, or features we can hopefully add some day. Enable plugin Fixes needed to enable external plugins labels Mar 28, 2024
@mvashishtha mvashishtha self-assigned this Apr 1, 2024
mvashishtha pushed a commit that referenced this issue Apr 1, 2024
Signed-off-by: sfc-gh-mvashishtha <mahesh.vashishtha@snowflake.com>
@mvashishtha mvashishtha removed their assignment Apr 1, 2024
@mvashishtha
Copy link
Collaborator Author

I have a fix for this on my fork of modin, but I can't make a PR at the moment: https://github.com/mvashishtha/modin/tree/7138-update-docstrings-in-place

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working Enable plugin Fixes needed to enable external plugins P3 Very minor bugs, or features we can hopefully add some day.
Projects
None yet
Development

No branches or pull requests

1 participant