FEAT-#4605: Adding small query compiler #7259

arunjose696 · 2024-05-13T18:53:19Z

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Handle Empty/Small Data DataFrames as a separate case #4605
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

modin/experimental/core/storage_formats/pandas/small_query_compiler.py

+            need_columns_reindex=need_columns_reindex,
+        )
+    else:
+        broadcasted_items = broadcasted_items


modin/experimental/core/storage_formats/pandas/small_query_compiler.py

+    # if len(df.columns) == 1 and df.columns[0] == "__reduced__":
+    #     df = df["__reduced__"]


modin/pandas/dataframe.py

modin/pandas/io.py

modin/pandas/series.py

modin/experimental/core/storage_formats/pandas/small_query_compiler.py

modin/pandas/base.py

YarShev · 2024-05-16T14:17:03Z

modin/experimental/core/storage_formats/pandas/small_query_compiler.py

+        if hasattr(pandas_frame, "_to_pandas"):
+            pandas_frame = pandas_frame._to_pandas()
+        if is_scalar(pandas_frame):
+            pandas_frame = pandas.DataFrame([pandas_frame])
+        elif not isinstance(pandas_frame, pandas.DataFrame):
+            pandas_frame = pandas.DataFrame(pandas_frame)


Why so many conditions? Don't we always pass a pandas DataFrame in to this constructor?

modin/pandas/base.py

+from modin.experimental.core.storage_formats.pandas.small_query_compiler import (
+    PlainPandasQueryCompiler,
+)


modin/pandas/dataframe.py

modin/pandas/io.py

+from modin.experimental.core.storage_formats.pandas.small_query_compiler import (
+    PlainPandasQueryCompiler,
+)


modin/pandas/series.py

modin/experimental/core/storage_formats/pandas/small_query_compiler.py

+from pandas.core.dtypes.common import is_list_like, is_scalar
+
+from modin.config.envvars import UsePlainPandasQueryCompiler
+from modin.core.storage_formats.base.query_compiler import BaseQueryCompiler


docs/conf.py

modin/config/envvars.py

YarShev · 2024-05-16T17:10:15Z

modin/core/dataframe/algebra/default2pandas/binary.py

@@ -47,7 +47,6 @@ def bin_ops_wrapper(df, other, *args, **kwargs):
                "squeeze_other", False
            )
            squeeze_self = kwargs.pop("squeeze_self", False)
-


Should be reverted.

YarShev · 2024-05-16T17:11:30Z

modin/pandas/base.py

+        if isinstance(self._query_compiler, PlainPandasQueryCompiler):
+            return self._query_compiler.to_pandas().iloc[indexer]


Why is this change needed? Isn't it sufficient the line below?

modin/pandas/dataframe.py

YarShev · 2024-05-16T17:20:49Z

modin/pandas/io.py

+    if UsePlainPandasQueryCompiler.get():
+        return ModinObjects.DataFrame(query_compiler=PlainPandasQueryCompiler(df))


Maybe we should move this to BaseFactory._to_pandas?

Maybe you mean from_pandas?

modin/pandas/dataframe.py

modin/pandas/series.py

setup.cfg

Signed-off-by: arunjose696 <arunjose696@gmail.com>

Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>

modin/pandas/utils.py

devin-petersohn

Great start on solving this problem! Is it possible to avoid so many of the test changes?

devin-petersohn · 2024-05-22T15:34:27Z

modin/config/envvars.py

@@ -851,4 +851,11 @@ def _check_vars() -> None:
        )


+class UsePlainPandasQueryCompiler(EnvironmentVariable, type=bool):


This name is probably a little confusing for users. I suggest something like SmallDataframeMode. This can be set to None by default, and users can set it to "pandas" or some other option in the future (we may have some other single node options coming).

@devin-petersohn, do you think VanillaPandasMode is a good option? Also, why do you think we should make this config of string type to have choices None/pandas/etc.? Wouldn't it be sufficient to have this config boolean - enable/disable?

In the future we may add polars mode. If this happens, we might also want to have an option for that. Making it a string keeps it open to other options. If we have pandas in the name, we can only use that mode for pandas execution. I'm open to other names, but I think we don't want to keep adding more and more configs if we have more options later.

Doesn't this sound like we may have multiple storage formats for a single execution? Do we really want to support this in future?

Potentially, yes I think this is something we could support in the future.

@devin-petersohn, do you think we could support automatic initialization with small qc depending on a data size threshold in future?

I propose to rename UsePlainPandasQueryCompiler to NativeDataframeMode and SmallQueryCompiler to NativeQueryCompiler by sort of analogy with HdkOnNative we had previously.

At a minimum, a more complete definition of this class in the docstring is required.

arunjose696 · 2024-05-22T16:31:33Z

Great start on solving this problem! Is it possible to avoid so many of the test changes?

The most changes in tests are disabling few checks as it wont be supported without partitions, and as the current changes dont yet support IO like pd.read_csv(), Is there something specific that should be avoided?

devin-petersohn · 2024-05-22T16:45:06Z

is there something specific that should be avoided?

Nothing specific, I was just trying to understand context. Thanks!

…rialized_dtypes to query compiler layer as in the code in multiple places the methods of private _modin_frame were used

Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>

modin/pandas/dataframe.py

modin/experimental/core/storage_formats/pandas/small_query_compiler.py

anmyachev · 2024-05-27T12:55:40Z

modin/core/storage_formats/pandas/query_compiler.py

@@ -365,6 +365,37 @@ def copy(self):

    # END Copy

+    def has_materialized_dtypes(self):


Suggested change

def has_materialized_dtypes(self):

@property

def has_materialized_dtypes(self):

It's also a good idea to change query compiler API in a separate pull request.

anmyachev · 2024-05-27T12:55:54Z

modin/core/storage_formats/pandas/query_compiler.py

+        """
+        self._modin_frame.set_dtypes_cache(dtypes)
+
+    def has_dtypes_cache(self) -> bool:


Suggested change

def has_dtypes_cache(self) -> bool:

@property

def has_dtypes_cache(self) -> bool:

anmyachev · 2024-05-27T13:01:08Z

modin/core/storage_formats/pandas/aggregations.py

@@ -71,7 +71,7 @@ def corr_method(
                    np.repeat(pandas.api.types.pandas_dtype("float"), len(new_columns)),
                    index=new_columns,
                )
-            elif numeric_only and qc._modin_frame.has_materialized_dtypes:
+            elif numeric_only and qc.has_materialized_dtypes():
                old_dtypes = qc._modin_frame.dtypes


Suggested change

old_dtypes = qc._modin_frame.dtypes

old_dtypes = qc.dtypes

anmyachev · 2024-05-27T13:01:20Z

modin/core/storage_formats/pandas/query_compiler.py

@@ -2650,7 +2678,7 @@ def fillna(df):
                    }
                    return df.fillna(value=func_dict, **kwargs)

-                if self._modin_frame.has_materialized_dtypes:
+                if self.has_materialized_dtypes():
                    dtypes = self._modin_frame.dtypes


Suggested change

dtypes = self._modin_frame.dtypes

dtypes = self.dtypes

anmyachev · 2024-05-27T13:03:59Z

modin/config/envvars.py

@@ -851,4 +851,11 @@ def _check_vars() -> None:
        )


+class UsePlainPandasQueryCompiler(EnvironmentVariable, type=bool):


At a minimum, a more complete definition of this class in the docstring is required.

arunjose696 requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev, dchigarev and a team as code owners May 13, 2024 18:53

github-advanced-security bot found potential problems May 13, 2024

View reviewed changes

arunjose696 changed the title ~~Adding small query compiler~~ FEAT-#4605: Adding small query compiler May 13, 2024

arunjose696 force-pushed the arun-sqc branch 3 times, most recently from f80e353 to 41bab97 Compare May 16, 2024 11:45

YarShev force-pushed the arun-sqc branch from 41bab97 to b6dc27c Compare May 16, 2024 12:33

github-advanced-security bot found potential problems May 16, 2024

View reviewed changes

modin/pandas/base.py Fixed Show fixed Hide fixed

YarShev reviewed May 16, 2024

View reviewed changes

YarShev force-pushed the arun-sqc branch from 8c6544e to 165360f Compare May 16, 2024 15:17

github-advanced-security bot found potential problems May 16, 2024

View reviewed changes

YarShev reviewed May 16, 2024

View reviewed changes

arunjose696 and others added 6 commits May 22, 2024 08:00

FEAT-modin-project#4605: Add small query compiler

fdd0b1a

fixing tests

6d23e9a

removing additional parameter from try_cast_to_pandas

0b55cfb

Signed-off-by: arunjose696 <arunjose696@gmail.com>

test_iter passing

9b58c4e

fixing isin unique and clip

f707c0d

Signed-off-by: arunjose696 <arunjose696@gmail.com>

Enable test_default.py and test_join_sort.py

3aa554c

Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>

arunjose696 force-pushed the arun-sqc branch from b9f1dc3 to df6b6dc Compare May 22, 2024 13:11

github-advanced-security bot found potential problems May 22, 2024

View reviewed changes

modin/pandas/utils.py Fixed Show fixed Hide fixed

arunjose696 force-pushed the arun-sqc branch from df6b6dc to 1cd75e2 Compare May 22, 2024 13:15

devin-petersohn reviewed May 22, 2024

View reviewed changes

arunjose696 marked this pull request as draft May 22, 2024 19:49

arunjose696 force-pushed the arun-sqc branch from 1cd75e2 to e6b035f Compare May 23, 2024 08:18

fixed test_map_metadata by adding set_frame_dtypes_cache and has_mate…

d406414

…rialized_dtypes to query compiler layer as in the code in multiple places the methods of private _modin_frame were used

arunjose696 force-pushed the arun-sqc branch from e6b035f to d406414 Compare May 23, 2024 11:08

Fix test_dot

8e65ecb

Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>

github-advanced-security bot found potential problems May 23, 2024

View reviewed changes

modin/pandas/dataframe.py Fixed Show fixed Hide fixed

modin/pandas/dataframe.py Fixed Show fixed Hide fixed

test_udf passing

04d6626

arunjose696 force-pushed the arun-sqc branch 2 times, most recently from 631dbf2 to 7b2b723 Compare May 23, 2024 20:52

github-advanced-security bot found potential problems May 23, 2024

View reviewed changes

modin/experimental/core/storage_formats/pandas/small_query_compiler.py Fixed Show fixed Hide fixed

All tests except one passing in modin/tests/pandas/dataframe

ff58de3

arunjose696 force-pushed the arun-sqc branch from 7b2b723 to ff58de3 Compare May 27, 2024 09:29

anmyachev reviewed May 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#4605: Adding small query compiler #7259

FEAT-#4605: Adding small query compiler #7259

arunjose696 commented May 13, 2024

YarShev May 16, 2024

YarShev May 16, 2024

YarShev May 16, 2024

YarShev May 16, 2024

anmyachev May 27, 2024

devin-petersohn left a comment

devin-petersohn May 22, 2024

YarShev May 22, 2024

devin-petersohn May 22, 2024

YarShev May 22, 2024

devin-petersohn May 24, 2024

YarShev May 27, 2024

YarShev May 27, 2024

anmyachev May 27, 2024

arunjose696 commented May 22, 2024

devin-petersohn commented May 22, 2024

anmyachev May 27, 2024

anmyachev May 27, 2024

anmyachev May 27, 2024

anmyachev May 27, 2024

anmyachev May 27, 2024

anmyachev May 27, 2024

		# if len(df.columns) == 1 and df.columns[0] == "__reduced__":
		# df = df["__reduced__"]

		if isinstance(self._query_compiler, PlainPandasQueryCompiler):
		return self._query_compiler.to_pandas().iloc[indexer]

		if UsePlainPandasQueryCompiler.get():
		return ModinObjects.DataFrame(query_compiler=PlainPandasQueryCompiler(df))

		@@ -851,4 +851,11 @@ def _check_vars() -> None:
		)


		class UsePlainPandasQueryCompiler(EnvironmentVariable, type=bool):

		@@ -365,6 +365,37 @@ def copy(self):

		# END Copy

		def has_materialized_dtypes(self):

	def has_materialized_dtypes(self):
	@property
	def has_materialized_dtypes(self):

	def has_dtypes_cache(self) -> bool:
	@property
	def has_dtypes_cache(self) -> bool:

FEAT-#4605: Adding small query compiler #7259

Are you sure you want to change the base?

FEAT-#4605: Adding small query compiler #7259

Conversation

arunjose696 commented May 13, 2024

What do these changes do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devin-petersohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arunjose696 commented May 22, 2024

devin-petersohn commented May 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment