Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request #1536

Open
liyaskerj opened this issue Feb 7, 2024 · 1 comment
Open

Feature Request #1536

liyaskerj opened this issue Feb 7, 2024 · 1 comment

Comments

@liyaskerj
Copy link

Missing functionality

As a frequent user of ydata_profiling am encountering the below issue

In the given dataset for the numeric datatype column we have to exclude empty cell while calculating 'sum'. When we are not including empty cells then the value of sum is coming as 'NaN'. From our side if we are replacing empty with '0' then it is impacting 'min' value, if we are replacing with some other value then it is impacting the data type of the corresponding column.

Proposed feature

we have to exclude empty or null cell of the numeric data type columns while calculating 'sum'. When we are not including empty cells then the value of sum is coming as 'NaN'

Alternatives considered

Below logic in describe_numeric_spark.py is the place where 'sum' is been is calculated. Please correct me if am wrong

@describe_numeric_1d.register
def describe_numeric_1d_spark(
config: Settings, df: DataFrame, summary: dict
) -> Tuple[Settings, DataFrame, dict]:
"""Describe a boolean series.

Args:
    series: The Series to describe.
    summary: The dict containing the series description so far.

Returns:
    A dict containing calculated series description values.
"""

stats = numeric_stats_spark(df, summary)
summary["min"] = stats["min"]
summary["max"] = stats["max"]
summary["mean"] = stats["mean"]
summary["std"] = stats["std"]
summary["variance"] = stats["variance"]
summary["skewness"] = stats["skewness"]
summary["kurtosis"] = stats["kurtosis"]
summary["sum"] = stats["sum"]

Additional context

We are building wheel file from our code and installing the same in databricks cluster and trying to do exploratory data analysis of the given source dataset in CSV file format

@liyaskerj
Copy link
Author

@azory-ydata @gliptak @akx @mattf Please let me know is it possible to implement this feature from your end. Thank You

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants