Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for dropping invalid rows for pyspark backend #1639

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

nk4456542
Copy link

@nk4456542 nk4456542 commented May 12, 2024

Solves issue - #1540

Tasks to be completed as per this comment:

  • Introduce PANDERA_FULL_TABLE_VALIDATION configuration. By default, it should be None and should be set depending on the validation backend. It should be True for the pandas check backend but False for the pyspark backend.
  • Modify all of the pyspark builtin checks to have two execution modes:
    • PANDERA_FULL_TABLE_VALIDATION=False is the current behavior
    • PANDERA_FULL_TABLE_VALIDATION=True should return a boolean column indicating which element in the column passed the check.
  • Make any additional changes to the pyspark backend to support a boolean column as the output of a check (we can take inspiration from the polars check backend on how to do this).
  • Add support for the drop_invalid_rows option
  • Add info logging at validation time to let the user know if full table validation is happening or not
  • Add documentation discussing the performance implications of turning on full table validation.
  • Add unit test cases in the testing pipeline to support the PANDERA_FULL_TABLE_VALIDATION config and drop_invalid_rows option

PS: New to the repo 😄 , so please call out if I am not following repo guidelines or code style. Appreciate your help!

- Add full table validation support for pyspark backend

Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>
…alidation

Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>
def equal_to(
data: PysparkDataframeColumnObject,
value: Any,
should_validate_full_table: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of passing this in as an argument, you can use pandera.config.get_config_context to get the full_table_validation configuration value. This is so that the API for each check is consistent across the different backends.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cosmicBboy. Will make the recommended change.

Also, is there a way that you can suggest to keep the PANDERA_FULL_TABLE_VALIDATION config value to be False when the backend is pyspark and True when the backend is pandas? Did not find a good way to do this, hence asking for a suggestion 😅.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the config_context context manager in the validate methods for each backend to control this behavior: https://github.com/unionai-oss/pandera/blob/main/pandera/config.py#L71

for example this is used in the polars backend:

with config_context(validation_depth=get_validation_depth(check_obj)):

@cosmicBboy
Copy link
Collaborator

thanks @nk4456542, this is awesome!

Looks like some of the tests are broken

=========================== short test summary info ============================
FAILED tests/core/test_pandas_config.py::TestPandasDataFrameConfig::test_disable_validation - AssertionError: assert {'validation_enabled': False, 'validation_depth': <ValidationDepth.SCHEMA_AND_DATA: 'SCHEMA_AND_DATA'>, 'cache_dataframe': False, 'keep_cached_dataframe': False, 'full_table_validation': None} == {'cache_dataframe': False, 'keep_cached_dataframe': False, 'validation_enabled': False, 'validation_depth': <ValidationDepth.SCHEMA_AND_DATA: 'SCHEMA_AND_DATA'>}

see https://github.com/unionai-oss/pandera/actions/runs/9054025532/job/24909236981?pr=1639.

You can run these tests locally with pytest tests/pyspark.

- Remove unused decorators

Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>
- Will help to use the flag in backend validate functions

Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>
- More tests to come for full_table_validation config for built_in_checks after adding support in pyspark backend

Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants