-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#4605: Adding small query compiler #7259
base: main
Are you sure you want to change the base?
Conversation
need_columns_reindex=need_columns_reindex, | ||
) | ||
else: | ||
broadcasted_items = broadcasted_items |
Check failure
Code scanning / CodeQL
Redundant assignment Error
# if len(df.columns) == 1 and df.columns[0] == "__reduced__": | ||
# df = df["__reduced__"] |
Check notice
Code scanning / CodeQL
Commented-out code Note
f80e353
to
41bab97
Compare
if hasattr(pandas_frame, "_to_pandas"): | ||
pandas_frame = pandas_frame._to_pandas() | ||
if is_scalar(pandas_frame): | ||
pandas_frame = pandas.DataFrame([pandas_frame]) | ||
elif not isinstance(pandas_frame, pandas.DataFrame): | ||
pandas_frame = pandas.DataFrame(pandas_frame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why so many conditions? Don't we always pass a pandas DataFrame in to this constructor?
from modin.experimental.core.storage_formats.pandas.small_query_compiler import ( | ||
PlainPandasQueryCompiler, | ||
) |
Check notice
Code scanning / CodeQL
Cyclic import Note
modin.experimental.core.storage_formats.pandas.small_query_compiler
from modin.experimental.core.storage_formats.pandas.small_query_compiler import ( | ||
PlainPandasQueryCompiler, | ||
) |
Check notice
Code scanning / CodeQL
Cyclic import Note
modin.experimental.core.storage_formats.pandas.small_query_compiler
from pandas.core.dtypes.common import is_list_like, is_scalar | ||
|
||
from modin.config.envvars import UsePlainPandasQueryCompiler | ||
from modin.core.storage_formats.base.query_compiler import BaseQueryCompiler |
Check notice
Code scanning / CodeQL
Cyclic import Note
modin.core.storage_formats.base.query_compiler
@@ -47,7 +47,6 @@ def bin_ops_wrapper(df, other, *args, **kwargs): | |||
"squeeze_other", False | |||
) | |||
squeeze_self = kwargs.pop("squeeze_self", False) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be reverted.
if isinstance(self._query_compiler, PlainPandasQueryCompiler): | ||
return self._query_compiler.to_pandas().iloc[indexer] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change needed? Isn't it sufficient the line below?
if UsePlainPandasQueryCompiler.get(): | ||
return ModinObjects.DataFrame(query_compiler=PlainPandasQueryCompiler(df)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should move this to BaseFactory._to_pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you mean from_pandas
?
Signed-off-by: arunjose696 <arunjose696@gmail.com>
Signed-off-by: arunjose696 <arunjose696@gmail.com>
Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start on solving this problem! Is it possible to avoid so many of the test changes?
@@ -851,4 +851,11 @@ def _check_vars() -> None: | |||
) | |||
|
|||
|
|||
class UsePlainPandasQueryCompiler(EnvironmentVariable, type=bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is probably a little confusing for users. I suggest something like SmallDataframeMode
. This can be set to None
by default, and users can set it to "pandas"
or some other option in the future (we may have some other single node options coming).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devin-petersohn, do you think VanillaPandasMode
is a good option? Also, why do you think we should make this config of string
type to have choices None
/pandas
/etc.
? Wouldn't it be sufficient to have this config boolean - enable/disable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we may add polars
mode. If this happens, we might also want to have an option for that. Making it a string keeps it open to other options. If we have pandas in the name, we can only use that mode for pandas execution. I'm open to other names, but I think we don't want to keep adding more and more configs if we have more options later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this sound like we may have multiple storage formats for a single execution? Do we really want to support this in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially, yes I think this is something we could support in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devin-petersohn, do you think we could support automatic initialization with small qc depending on a data size threshold in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to rename UsePlainPandasQueryCompiler
to NativeDataframeMode
and SmallQueryCompiler
to NativeQueryCompiler
by sort of analogy with HdkOnNative
we had previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a minimum, a more complete definition of this class in the docstring is required.
The most changes in tests are disabling few checks as it wont be supported without partitions, and as the current changes dont yet support IO like pd.read_csv(), Is there something specific that should be avoided? |
Nothing specific, I was just trying to understand context. Thanks! |
…rialized_dtypes to query compiler layer as in the code in multiple places the methods of private _modin_frame were used
Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
631dbf2
to
7b2b723
Compare
@@ -365,6 +365,37 @@ def copy(self): | |||
|
|||
# END Copy | |||
|
|||
def has_materialized_dtypes(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def has_materialized_dtypes(self): | |
@property | |
def has_materialized_dtypes(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also a good idea to change query compiler API in a separate pull request.
""" | ||
self._modin_frame.set_dtypes_cache(dtypes) | ||
|
||
def has_dtypes_cache(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def has_dtypes_cache(self) -> bool: | |
@property | |
def has_dtypes_cache(self) -> bool: |
@@ -71,7 +71,7 @@ def corr_method( | |||
np.repeat(pandas.api.types.pandas_dtype("float"), len(new_columns)), | |||
index=new_columns, | |||
) | |||
elif numeric_only and qc._modin_frame.has_materialized_dtypes: | |||
elif numeric_only and qc.has_materialized_dtypes(): | |||
old_dtypes = qc._modin_frame.dtypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old_dtypes = qc._modin_frame.dtypes | |
old_dtypes = qc.dtypes |
@@ -2650,7 +2678,7 @@ def fillna(df): | |||
} | |||
return df.fillna(value=func_dict, **kwargs) | |||
|
|||
if self._modin_frame.has_materialized_dtypes: | |||
if self.has_materialized_dtypes(): | |||
dtypes = self._modin_frame.dtypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtypes = self._modin_frame.dtypes | |
dtypes = self.dtypes |
@@ -851,4 +851,11 @@ def _check_vars() -> None: | |||
) | |||
|
|||
|
|||
class UsePlainPandasQueryCompiler(EnvironmentVariable, type=bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a minimum, a more complete definition of this class in the docstring is required.
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date