Add support for dropping invalid rows for pyspark backend #1639

nk4456542 · 2024-05-12T20:25:36Z

Solves issue - #1540

Tasks to be completed as per this comment:

Introduce PANDERA_FULL_TABLE_VALIDATION configuration. By default, it should be None and should be set depending on the validation backend. It should be True for the pandas check backend but False for the pyspark backend.
Modify all of the pyspark builtin checks to have two execution modes:
- PANDERA_FULL_TABLE_VALIDATION=False is the current behavior
- PANDERA_FULL_TABLE_VALIDATION=True should return a boolean column indicating which element in the column passed the check.
Make any additional changes to the pyspark backend to support a boolean column as the output of a check (we can take inspiration from the polars check backend on how to do this).
Add support for the drop_invalid_rows option
Add info logging at validation time to let the user know if full table validation is happening or not
Add documentation discussing the performance implications of turning on full table validation.
Add unit test cases in the testing pipeline to support the PANDERA_FULL_TABLE_VALIDATION config and drop_invalid_rows option

PS: New to the repo 😄 , so please call out if I am not following repo guidelines or code style. Appreciate your help!

- Add full table validation support for pyspark backend Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

…alidation Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

cosmicBboy · 2024-05-14T01:46:37Z

pandera/backends/pyspark/builtin_checks.py

+def equal_to(
+    data: PysparkDataframeColumnObject,
+    value: Any,
+    should_validate_full_table: bool,


instead of passing this in as an argument, you can use pandera.config.get_config_context to get the full_table_validation configuration value. This is so that the API for each check is consistent across the different backends.

Thanks @cosmicBboy. Will make the recommended change.

Also, is there a way that you can suggest to keep the PANDERA_FULL_TABLE_VALIDATION config value to be False when the backend is pyspark and True when the backend is pandas? Did not find a good way to do this, hence asking for a suggestion 😅.

You can use the config_context context manager in the validate methods for each backend to control this behavior: https://github.com/unionai-oss/pandera/blob/main/pandera/config.py#L71

for example this is used in the polars backend:

pandera/pandera/api/polars/container.py

Line 53 in c24dda9

with config_context(validation_depth=get_validation_depth(check_obj)):

cosmicBboy · 2024-05-14T01:46:51Z

thanks @nk4456542, this is awesome!

Looks like some of the tests are broken

=========================== short test summary info ============================
FAILED tests/core/test_pandas_config.py::TestPandasDataFrameConfig::test_disable_validation - AssertionError: assert {'validation_enabled': False, 'validation_depth': <ValidationDepth.SCHEMA_AND_DATA: 'SCHEMA_AND_DATA'>, 'cache_dataframe': False, 'keep_cached_dataframe': False, 'full_table_validation': None} == {'cache_dataframe': False, 'keep_cached_dataframe': False, 'validation_enabled': False, 'validation_depth': <ValidationDepth.SCHEMA_AND_DATA: 'SCHEMA_AND_DATA'>}

see https://github.com/unionai-oss/pandera/actions/runs/9054025532/job/24909236981?pr=1639.

You can run these tests locally with pytest tests/pyspark.

- Remove unused decorators Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

- Will help to use the flag in backend validate functions Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

- More tests to come for full_table_validation config for built_in_checks after adding support in pyspark backend Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

nk4456542 added 2 commits May 13, 2024 01:46

Add PANDERA_FULL_TABLE_VALIDATION config for full table validation

ba02957

- Add full table validation support for pyspark backend Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

Modify all builtin_checks for pyspark backend to support full table v…

eed319c

…alidation Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

cosmicBboy reviewed May 14, 2024

View reviewed changes

nk4456542 added 3 commits May 18, 2024 23:36

Use the same function signature for builtin_checks across backends

cd2f76c

- Remove unused decorators Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

Add full_table_validation as a flag to config context func

1f952d1

- Will help to use the flag in backend validate functions Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

Fix broken tests and add new tests for full_table_validation config

14513a6

- More tests to come for full_table_validation config for built_in_checks after adding support in pyspark backend Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for dropping invalid rows for pyspark backend #1639

Add support for dropping invalid rows for pyspark backend #1639

nk4456542 commented May 12, 2024 •

edited

cosmicBboy May 14, 2024

nk4456542 May 18, 2024

cosmicBboy May 18, 2024 •

edited

cosmicBboy commented May 14, 2024

Add support for dropping invalid rows for pyspark backend #1639

Are you sure you want to change the base?

Add support for dropping invalid rows for pyspark backend #1639

Conversation

nk4456542 commented May 12, 2024 • edited

cosmicBboy May 14, 2024

Choose a reason for hiding this comment

nk4456542 May 18, 2024

Choose a reason for hiding this comment

cosmicBboy May 18, 2024 • edited

Choose a reason for hiding this comment

cosmicBboy commented May 14, 2024

nk4456542 commented May 12, 2024 •

edited

cosmicBboy May 18, 2024 •

edited