Deepchecks on distributed computing frameworks: Dask and Polars #1701

bgalvao · 2022-07-03T07:50:11Z

bgalvao
Jul 3, 2022

Greetings! This project is amazing and I'm already experimenting writing tests for a CI/CD pipeline.

I am starting to question, what happens if the dataset is too large for Pandas? Would it be seamless to switch to Dask or Polars?

I would love to work on a Pull Request if it comes to be necessary 😁

noamzbr · 2022-07-05T07:26:17Z

noamzbr
Jul 5, 2022
Maintainer

Hi @bgalvao, very good question!

Many of the checks in the package don't really need to run on all samples, so the first thing they do is sample the pandas DataFrame contained in the Dataset. I expect that will work with Dask, but of course it's better to test first hand. Other checks (mainly those that test data integrity, or compare datasets) use the whole data and can fail in case of a too-large dataset.

It indeed may be possible to seamlessly switch from pandas to Dask (or Modin, for the option to use Ray or Dask with the same api).
If you got the time, I'd be very glad if you can try to use our checks with Dask! Specifically, the following checks may prove problematic and are worth testing specifically:

ConflictingLabels
SpecialCharacters
StringLengthOutOfBounds
DateTrainTestLeakageDuplicates

Please let us know if that's something you're willing to tackle. If so, we can open an issue from this discussion.

4 replies

bgalvao Jul 5, 2022
Author

For now, I definitely want to test with Dask and Polars 😃

It seems to me that I have to remove the specific pd.DataFrame cast and add Dask and Polars dataframe and series types to the isinstance checks in your codebase in order to let the deepchecks run.

noamzbr Jul 17, 2022
Maintainer

Good catch! didn't think about it. If you're modifying the code so that it'll work with Dask, we'd appreciate if you'd consider opening a PR with these changes so that it'll be supported in the package.

royinblr Sep 11, 2023

Dear All, We are also considering to replace pandas for parallel processing. It will be helpful if anyone has already tried with any other parallel processing such as DASK, PySpark, Ray or any share suggestion on the same. It would be appreciated.

bgalvao Oct 9, 2023
Author

The big question for me is how would an abstract class / interface be implemented to abstract the engine.

It won't be just the initialization of some test suite for example. But also, all of the other functions that implement the tests will need to have engine-specific code.

Still short on time for implementing this, but still think it is an interesting discussion to have.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepchecks on distributed computing frameworks: Dask and Polars #1701

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Deepchecks on distributed computing frameworks: Dask and Polars #1701

bgalvao Jul 3, 2022

Replies: 1 comment · 4 replies

noamzbr Jul 5, 2022 Maintainer

bgalvao Jul 5, 2022 Author

noamzbr Jul 17, 2022 Maintainer

royinblr Sep 11, 2023

bgalvao Oct 9, 2023 Author

bgalvao
Jul 3, 2022

Replies: 1 comment 4 replies

noamzbr
Jul 5, 2022
Maintainer

bgalvao Jul 5, 2022
Author

noamzbr Jul 17, 2022
Maintainer

bgalvao Oct 9, 2023
Author