Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrpancy in seaborn.objects.Dodge groupby order #3593

Open
tiamilani opened this issue Dec 14, 2023 · 6 comments
Open

Discrpancy in seaborn.objects.Dodge groupby order #3593

tiamilani opened this issue Dec 14, 2023 · 6 comments

Comments

@tiamilani
Copy link

Hi, I would like to report a strange behavior in the Move object Dodge.

Seaborn version: 0.13.0
Matplotlib version: 3.8.2

Everything start because I wanted to play around with the objects namespace.
As dataset I use the penguins dataset, I drop both the nan and all the values that I do not consider outliers, i.e. everything in between the 0.05 and 0.95 quantile (for each combination of species and sex).
Here the code to replicate the dataset:

def get_quantile_df(df, val_col, lower, upper) -> pd.DataFrame:
    lower_limit = df[val_col].quantile(lower)
    upper_limit = df[val_col].quantile(upper)
    sub_df = df[(df[val_col] <= lower_limit) | (df[val_col] >= upper_limit)]
    return sub_df, lower_limit, upper_limit

def main():
    penguins = sns.load_dataset("penguins")
    penguins = penguins.dropna(how="any")

    category = "species"
    value="body_mass_g"
    hue="sex"

    lower_out_qt = 0.05
    upper_out_qt = 0.95

    outliers = None
    for c in penguins[category].unique():
        c_df = penguins[penguins[category] == c].copy()
        for h in c_df[hue].unique():
            hue_df = c_df[c_df[hue] == h].copy()
            sub_df, l, u = get_quantile_df(hue_df, value, lower_out_qt, upper_out_qt)

            outliers = sub_df.copy() if outliers is None else pd.concat([outliers, sub_df])
            print(f"{c}-{h}: [{l}-{u}] -> {len(sub_df)}")

The print the nested for loop is just to have an idea of how many points to expect

Then I try to plot such outliers with the species on the x axis and the body_mass_g as y axis as follows:

    (
        so.Plot(penguins, x="species", y="body_mass_g", color="sex")
        .add(so.Dot(marker="x"), so.Dodge(), data=outliers)
        .save("test_figure_outliers1.png")
    )

    (
        so.Plot(outliers, x="species", y="body_mass_g", color="sex")
        .add(so.Dot(marker="x"), so.Dodge())
        .save("test_figure_outliers2.png")
    )

As you can see the only difference is that in the first plot I pass the outliers dataset in the add layer, while in the second plot I use directly the outliers dataset at plot level.

I expected the two resulting plots to be identical, but I get two different results, and I think this is caused by the order in groupby operation in the Dodge Movement.

Attached the two plots in the same order as the code:
test_figure_rename1
test_figure_outliers2

I found a solution on how to make the two plots identical, in the first plot code include also the groupby argument as follows:

    (
        so.Plot(penguins, x="species", y="body_mass_g", color="sex")
        .add(so.Dot(marker="x"), so.Dodge(), data=outliers, groupby="sex")
        .save("test_figure_outliers1.png")
    )

The cause of the problem is that the first element of the outliers is a "female" penguin, while in the original dataset it's a "male" penguin.
I can see that without specifying anything the grupby operation is executed on the new dataset, producing a possible different order.
But I don't see then why when I specify the groupby variable at that level I get the groupby order executed on the penguins dataset.

As a solution I would suggest what follows:

  1. if nothing except the new dataset is specified and the new dataset contains the same color column the groupby order is eredited by the Plot level (if specified, otherwise recomputed on the original dataset).
  2. if the groupby object is specified at that level and also a new dataset is provided then the groupby should be executed on the new dataset rather then the original one

This way with the solution one the order of the components would be the same between all the layers even providing different subdatasets at each level.
Otherwise the user has the second option to redefine the groupby operation per level.

@thuiop
Copy link
Contributor

thuiop commented Dec 14, 2023

Duplicate of #3015 I believe?

@mwaskom
Copy link
Owner

mwaskom commented Dec 15, 2023

Yeah the problem presents a little bit differently but I think it is the same underlying issue.

@tiamilani
Copy link
Author

tiamilani commented Dec 15, 2023 via email

@mwaskom
Copy link
Owner

mwaskom commented Dec 15, 2023

Actually sorry, I want to revise what I said here a bit.

First I didn't look at the linked issue closely enough — this did sound like a duplicate, but the more relevant issue is #3556 which is a bit different and more fundamental.

But on the other hand I'm not totally convinced that there's a well-defined "correct behavior" here since you actually are passing different datasets, and the default ordering rule is to use categories in the order that they are encountered in the data. Unless I am missing something, I think that the consistent seaborn behavior would be to assign different default orderings.

In any case, if not obvious, the existing way to force a specific ordering would be .scale(color=so.Nominal(order=...)). If trying that doesn't get you consistently-dodged plots, then there is a more serious issue.

@tiamilani
Copy link
Author

It's true that I'm passing a different dataset, but the reason is that I want to produce a boxplot using seaborn objects and (up to now) I didn't find a function/argument to plot only the outliers from the original dataset.

I also agree that the current behavior it's consistent: a new dataset is passed -> the groupby operation should be repeated and the categories should be placed in te order of appearance. But, may I ask if there is already a plan to provide dataset operations inside seaborn objects? or a BoxPlot object?

The use cases that I can imagine at the moment are the following ones:

  • Use the combination of Range and Dot to produce a boxplot
  • Plot a line with PolyFit and underneath points randomly sampled by the original dataset instead of all the points (in case of big datasets)

By the way, thanks for the reference to .scale(color=so.Nominal(order=...)) didn't know about it

@mwaskom
Copy link
Owner

mwaskom commented Dec 19, 2023

Yes I've thought about adding an Outliers stat and having a Sample stat would make sense too! I also would like to support making boxplots without having to manually combine a bunch of different marks and stats, although that requires some compromises on the library design (essentially a BoxPlot mark is going to push a lot of statistical operations into the "mark" concept and I want to make sure that isn't done capriciously).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants