Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#4605: Adding small query compiler #7259

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

arunjose696
Copy link
Collaborator

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Handle Empty/Small Data DataFrames as a separate case #4605
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

need_columns_reindex=need_columns_reindex,
)
else:
broadcasted_items = broadcasted_items

Check failure

Code scanning / CodeQL

Redundant assignment Error

This assignment assigns a variable to itself.
Comment on lines +439 to +445
# if len(df.columns) == 1 and df.columns[0] == "__reduced__":
# df = df["__reduced__"]

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
modin/pandas/dataframe.py Fixed Show fixed Hide fixed
modin/pandas/io.py Fixed Show fixed Hide fixed
modin/pandas/series.py Fixed Show fixed Hide fixed
@arunjose696 arunjose696 changed the title Adding small query compiler FEAT-#4605: Adding small query compiler May 13, 2024
@arunjose696 arunjose696 force-pushed the arun-sqc branch 3 times, most recently from f80e353 to 41bab97 Compare May 16, 2024 11:45
modin/pandas/base.py Fixed Show fixed Hide fixed
Comment on lines +623 to +626
if hasattr(pandas_frame, "_to_pandas"):
pandas_frame = pandas_frame._to_pandas()
if is_scalar(pandas_frame):
pandas_frame = pandas.DataFrame([pandas_frame])
elif not isinstance(pandas_frame, pandas.DataFrame):
pandas_frame = pandas.DataFrame(pandas_frame)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why so many conditions? Don't we always pass a pandas DataFrame in to this constructor?

Comment on lines +69 to +71
from modin.experimental.core.storage_formats.pandas.small_query_compiler import (
PlainPandasQueryCompiler,
)

Check notice

Code scanning / CodeQL

Cyclic import Note

Import of module
modin.experimental.core.storage_formats.pandas.small_query_compiler
begins an import cycle.
modin/pandas/dataframe.py Fixed Show fixed Hide fixed
Comment on lines +70 to +71
from modin.experimental.core.storage_formats.pandas.small_query_compiler import (
PlainPandasQueryCompiler,
)

Check notice

Code scanning / CodeQL

Cyclic import Note

Import of module
modin.experimental.core.storage_formats.pandas.small_query_compiler
begins an import cycle.
modin/pandas/series.py Fixed Show fixed Hide fixed
from pandas.core.dtypes.common import is_list_like, is_scalar

from modin.config.envvars import UsePlainPandasQueryCompiler
from modin.core.storage_formats.base.query_compiler import BaseQueryCompiler

Check notice

Code scanning / CodeQL

Cyclic import Note

Import of module
modin.core.storage_formats.base.query_compiler
begins an import cycle.
docs/conf.py Outdated Show resolved Hide resolved
modin/config/envvars.py Show resolved Hide resolved
@@ -47,7 +47,6 @@ def bin_ops_wrapper(df, other, *args, **kwargs):
"squeeze_other", False
)
squeeze_self = kwargs.pop("squeeze_self", False)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be reverted.

Comment on lines +289 to +290
if isinstance(self._query_compiler, PlainPandasQueryCompiler):
return self._query_compiler.to_pandas().iloc[indexer]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed? Isn't it sufficient the line below?

modin/pandas/dataframe.py Outdated Show resolved Hide resolved
if UsePlainPandasQueryCompiler.get():
return ModinObjects.DataFrame(query_compiler=PlainPandasQueryCompiler(df))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move this to BaseFactory._to_pandas?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you mean from_pandas?

modin/pandas/dataframe.py Outdated Show resolved Hide resolved
modin/pandas/series.py Outdated Show resolved Hide resolved
setup.cfg Outdated Show resolved Hide resolved
arunjose696 and others added 6 commits May 22, 2024 08:00
Signed-off-by: arunjose696 <arunjose696@gmail.com>
Signed-off-by: arunjose696 <arunjose696@gmail.com>
Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
modin/pandas/utils.py Fixed Show fixed Hide fixed
Copy link
Collaborator

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start on solving this problem! Is it possible to avoid so many of the test changes?

@@ -851,4 +851,11 @@ def _check_vars() -> None:
)


class UsePlainPandasQueryCompiler(EnvironmentVariable, type=bool):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is probably a little confusing for users. I suggest something like SmallDataframeMode. This can be set to None by default, and users can set it to "pandas" or some other option in the future (we may have some other single node options coming).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devin-petersohn, do you think VanillaPandasMode is a good option? Also, why do you think we should make this config of string type to have choices None/pandas/etc.? Wouldn't it be sufficient to have this config boolean - enable/disable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we may add polars mode. If this happens, we might also want to have an option for that. Making it a string keeps it open to other options. If we have pandas in the name, we can only use that mode for pandas execution. I'm open to other names, but I think we don't want to keep adding more and more configs if we have more options later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this sound like we may have multiple storage formats for a single execution? Do we really want to support this in future?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, yes I think this is something we could support in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devin-petersohn, do you think we could support automatic initialization with small qc depending on a data size threshold in future?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to rename UsePlainPandasQueryCompiler to NativeDataframeMode and SmallQueryCompiler to NativeQueryCompiler by sort of analogy with HdkOnNative we had previously.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum, a more complete definition of this class in the docstring is required.

@arunjose696
Copy link
Collaborator Author

Great start on solving this problem! Is it possible to avoid so many of the test changes?

The most changes in tests are disabling few checks as it wont be supported without partitions, and as the current changes dont yet support IO like pd.read_csv(), Is there something specific that should be avoided?

@devin-petersohn
Copy link
Collaborator

is there something specific that should be avoided?

Nothing specific, I was just trying to understand context. Thanks!

…rialized_dtypes to query compiler layer as in the code in multiple places the methods of private _modin_frame were used
Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
modin/pandas/dataframe.py Fixed Show fixed Hide fixed
modin/pandas/dataframe.py Fixed Show fixed Hide fixed
@arunjose696 arunjose696 force-pushed the arun-sqc branch 2 times, most recently from 631dbf2 to 7b2b723 Compare May 23, 2024 20:52
@@ -365,6 +365,37 @@ def copy(self):

# END Copy

def has_materialized_dtypes(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def has_materialized_dtypes(self):
@property
def has_materialized_dtypes(self):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also a good idea to change query compiler API in a separate pull request.

"""
self._modin_frame.set_dtypes_cache(dtypes)

def has_dtypes_cache(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def has_dtypes_cache(self) -> bool:
@property
def has_dtypes_cache(self) -> bool:

@@ -71,7 +71,7 @@ def corr_method(
np.repeat(pandas.api.types.pandas_dtype("float"), len(new_columns)),
index=new_columns,
)
elif numeric_only and qc._modin_frame.has_materialized_dtypes:
elif numeric_only and qc.has_materialized_dtypes():
old_dtypes = qc._modin_frame.dtypes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
old_dtypes = qc._modin_frame.dtypes
old_dtypes = qc.dtypes

@@ -2650,7 +2678,7 @@ def fillna(df):
}
return df.fillna(value=func_dict, **kwargs)

if self._modin_frame.has_materialized_dtypes:
if self.has_materialized_dtypes():
dtypes = self._modin_frame.dtypes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dtypes = self._modin_frame.dtypes
dtypes = self.dtypes

@@ -851,4 +851,11 @@ def _check_vars() -> None:
)


class UsePlainPandasQueryCompiler(EnvironmentVariable, type=bool):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum, a more complete definition of this class in the docstring is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle Empty/Small Data DataFrames as a separate case
4 participants