Discrpancy in seaborn.objects.Dodge groupby order #3593

tiamilani · 2023-12-14T15:38:57Z

Hi, I would like to report a strange behavior in the Move object Dodge.

Seaborn version: 0.13.0
Matplotlib version: 3.8.2

Everything start because I wanted to play around with the objects namespace.
As dataset I use the penguins dataset, I drop both the nan and all the values that I do not consider outliers, i.e. everything in between the 0.05 and 0.95 quantile (for each combination of species and sex).
Here the code to replicate the dataset:

def get_quantile_df(df, val_col, lower, upper) -> pd.DataFrame:
    lower_limit = df[val_col].quantile(lower)
    upper_limit = df[val_col].quantile(upper)
    sub_df = df[(df[val_col] <= lower_limit) | (df[val_col] >= upper_limit)]
    return sub_df, lower_limit, upper_limit

def main():
    penguins = sns.load_dataset("penguins")
    penguins = penguins.dropna(how="any")

    category = "species"
    value="body_mass_g"
    hue="sex"

    lower_out_qt = 0.05
    upper_out_qt = 0.95

    outliers = None
    for c in penguins[category].unique():
        c_df = penguins[penguins[category] == c].copy()
        for h in c_df[hue].unique():
            hue_df = c_df[c_df[hue] == h].copy()
            sub_df, l, u = get_quantile_df(hue_df, value, lower_out_qt, upper_out_qt)

            outliers = sub_df.copy() if outliers is None else pd.concat([outliers, sub_df])
            print(f"{c}-{h}: [{l}-{u}] -> {len(sub_df)}")

The print the nested for loop is just to have an idea of how many points to expect

Then I try to plot such outliers with the species on the x axis and the body_mass_g as y axis as follows:

    (
        so.Plot(penguins, x="species", y="body_mass_g", color="sex")
        .add(so.Dot(marker="x"), so.Dodge(), data=outliers)
        .save("test_figure_outliers1.png")
    )

    (
        so.Plot(outliers, x="species", y="body_mass_g", color="sex")
        .add(so.Dot(marker="x"), so.Dodge())
        .save("test_figure_outliers2.png")
    )

As you can see the only difference is that in the first plot I pass the outliers dataset in the add layer, while in the second plot I use directly the outliers dataset at plot level.

I expected the two resulting plots to be identical, but I get two different results, and I think this is caused by the order in groupby operation in the Dodge Movement.

Attached the two plots in the same order as the code:

I found a solution on how to make the two plots identical, in the first plot code include also the groupby argument as follows:

    (
        so.Plot(penguins, x="species", y="body_mass_g", color="sex")
        .add(so.Dot(marker="x"), so.Dodge(), data=outliers, groupby="sex")
        .save("test_figure_outliers1.png")
    )

The cause of the problem is that the first element of the outliers is a "female" penguin, while in the original dataset it's a "male" penguin.
I can see that without specifying anything the grupby operation is executed on the new dataset, producing a possible different order.
But I don't see then why when I specify the groupby variable at that level I get the groupby order executed on the penguins dataset.

As a solution I would suggest what follows:

if nothing except the new dataset is specified and the new dataset contains the same color column the groupby order is eredited by the Plot level (if specified, otherwise recomputed on the original dataset).
if the groupby object is specified at that level and also a new dataset is provided then the groupby should be executed on the new dataset rather then the original one

This way with the solution one the order of the components would be the same between all the layers even providing different subdatasets at each level.
Otherwise the user has the second option to redefine the groupby operation per level.

The text was updated successfully, but these errors were encountered:

thuiop · 2023-12-14T20:47:40Z

Duplicate of #3015 I believe?

mwaskom · 2023-12-15T12:51:48Z

Yeah the problem presents a little bit differently but I think it is the same underlying issue.

tiamilani · 2023-12-15T14:39:43Z

Agree, it seems the root it’s the same, sorry I didn’t notice the issue you mentioned.By the way, if nobody else is already taking care of it and you think could be a good first issue I would like to give it a deeper look.

mwaskom · 2023-12-15T16:44:40Z

Actually sorry, I want to revise what I said here a bit.

First I didn't look at the linked issue closely enough — this did sound like a duplicate, but the more relevant issue is #3556 which is a bit different and more fundamental.

But on the other hand I'm not totally convinced that there's a well-defined "correct behavior" here since you actually are passing different datasets, and the default ordering rule is to use categories in the order that they are encountered in the data. Unless I am missing something, I think that the consistent seaborn behavior would be to assign different default orderings.

In any case, if not obvious, the existing way to force a specific ordering would be .scale(color=so.Nominal(order=...)). If trying that doesn't get you consistently-dodged plots, then there is a more serious issue.

tiamilani · 2023-12-18T08:58:56Z

It's true that I'm passing a different dataset, but the reason is that I want to produce a boxplot using seaborn objects and (up to now) I didn't find a function/argument to plot only the outliers from the original dataset.

I also agree that the current behavior it's consistent: a new dataset is passed -> the groupby operation should be repeated and the categories should be placed in te order of appearance. But, may I ask if there is already a plan to provide dataset operations inside seaborn objects? or a BoxPlot object?

The use cases that I can imagine at the moment are the following ones:

Use the combination of Range and Dot to produce a boxplot
Plot a line with PolyFit and underneath points randomly sampled by the original dataset instead of all the points (in case of big datasets)

By the way, thanks for the reference to .scale(color=so.Nominal(order=...)) didn't know about it

mwaskom · 2023-12-19T12:05:21Z

Yes I've thought about adding an Outliers stat and having a Sample stat would make sense too! I also would like to support making boxplots without having to manually combine a bunch of different marks and stats, although that requires some compromises on the library design (essentially a BoxPlot mark is going to push a lot of statistical operations into the "mark" concept and I want to make sure that isn't done capriciously).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrpancy in seaborn.objects.Dodge groupby order #3593

Discrpancy in seaborn.objects.Dodge groupby order #3593

tiamilani commented Dec 14, 2023

thuiop commented Dec 14, 2023

mwaskom commented Dec 15, 2023

tiamilani commented Dec 15, 2023 via email •

edited

mwaskom commented Dec 15, 2023 •

edited

tiamilani commented Dec 18, 2023

mwaskom commented Dec 19, 2023

Discrpancy in seaborn.objects.Dodge groupby order #3593

Discrpancy in seaborn.objects.Dodge groupby order #3593

Comments

tiamilani commented Dec 14, 2023

thuiop commented Dec 14, 2023

mwaskom commented Dec 15, 2023

tiamilani commented Dec 15, 2023 via email • edited

mwaskom commented Dec 15, 2023 • edited

tiamilani commented Dec 18, 2023

mwaskom commented Dec 19, 2023

tiamilani commented Dec 15, 2023 via email •

edited

mwaskom commented Dec 15, 2023 •

edited