objects.Norm : Normailize among a group? #3663

DNGros · 2024-03-25T21:14:28Z

The problem

I am trying to achieve an effect where the values within a group is normalized. Consider

# Make some sample data
import pandas as pd
import seaborn.objects as so
data = [
    # 2020
    {'year': "2020", 'category': 'happy'},
    {'year': "2020", 'category': 'happy'},
    {'year': "2020", 'category': 'sad'},
    # 2021
    {'year': "2021", 'category': 'happy'},
    {'year': "2021", 'category': 'happy'},
    {'year': "2021", 'category': 'happy'},
    {'year': "2021", 'category': 'sad'},
    # 2022
    {'year': "2022", 'category': 'happy'},
    {'year': "2022", 'category': 'happy'},
    {'year': "2022", 'category': 'mad'},
]
df = pd.DataFrame(data)

With the current interface we can create something like

    (
        so.Plot(df, x='year', color='category')
        .add(so.Line(), so.Count(), so.Norm())
    ).show()

However, I actually want something like

fracs = (
    df.groupby('year')
    .apply(lambda x: x['category'].value_counts(normalize=True))
    .unstack()
    .fillna(0)
)
fracs.plot()
plt.xlabel('Year')
plt.ylabel('Fraction of Category within Year')
plt.show()

Question

As far as I can tell there isn't a way to create this style of plot. Am I missing something in so.Norm that enables this (I found the documentation of so.Norm arguments somewhat confusing)? There may also be some other kind of functionality/plot already in Seaborn to enable this kind of analysis without using so.Norm?

Otherwise, is there interest in adding something like a within_groups arg (or other name. Not sure what makes the most sense.) to so.Norm to enable this?

The text was updated successfully, but these errors were encountered:

mwaskom · 2024-03-25T21:18:56Z

I think the main operation here is supported

(
    so.Plot(df, x='year', color='category')
    .add(so.Line(), so.Count(), so.Norm(func="sum", by=["x"]))
)

But that's a little different than your manual example because your observations are incomplete and Norm doesn't do the fillna bit. That feels a little bit out of scope for the Norm operation?

DNGros · 2024-03-25T22:11:27Z

Awesome! Thanks for your reply! Yes, that does the desired effect. I agree that the fillna is unnecessary and out of scope.

For other reading this, I'll note that in this case adding a marker can make sure data does not completely disappear if there is only one part to the line segment.

(
    so.Plot(df, x='year', color='category')
    .add(so.Line(marker="o"), so.Count(), so.Norm(func="sum", by=["x"]))
).show()

Is there interest in adding a similar example to objects.Norm.html docs? There currently aren't any examples that use the func or the by args, and it might be good find ways to explain these args.

(
    so.Plot(df, x="Year", y="Spending_USD", color="Country")
    .add(so.Line(), so.Norm(func="sum", by=["x"]))
    .label(y="Fraction of Year's Spending")
).show()

(this render has the legend positioned differently than the other examples on the objects.Norm.html page. I'm not sure what specifically generates that page, and why it looks different. I found this notebook, but it doesn't seem to do anything like that.)

It's up to you whether it seems like a useful add or just makes the page too busy. Feel free to mark this issue as closed.

Thanks again for your response, and for your fantastic work on Seaborn!

mwaskom · 2024-03-25T22:23:44Z

I found this notebook, but it doesn't seem to do anything like that.)

You have indeed found the source for the examples that you're looking at. The seaborn docs build system is pretty complicated though.

Demonstrating by and func in the docs makes sense. I think I was originally not totally sold on this API as it's indeed a little confusing. But it's probably been around for too long now.

By the way — there's no need to put .show() outside of the parens you're grouping the Plot methods with (and there's no reason to use it at all if you're in a notebook — in fact it will produce worse results since you'll loose the default retina-mode that Plot's Jupyter integration offers).

One other comment I would make is that you are sort of making a histogram here :)

(
    so.Plot(df, x='year', color='category')
    .add(so.Line(), so.Hist(stat="proportion", common_norm=["x"]))
)

DNGros · 2024-03-26T00:17:56Z

You have indeed found the source for the examples that you're looking at. The seaborn docs build system is pretty complicated though.

Ok, thanks. I think I might not have a good enough grasp on how text between the code is created that would prevent me from making any PRs here. I'll leave it as-is then for now.

By the way — there's no need to put .show() outside of the parens you're grouping the Plot methods with

Thank you for noting. I was doing these plots in a script ran in pycharm rather than a notebook, so needed to get it it to actually render out.

One other comment I would make is that you are sort of making a histogram here :)

This is a good point. I looked at the so.Hist previously, but I didn't make the connection putting these args together. Thanks for explaining.

Ideas for improving so.Hist docs

Looking at it again, I'll note that the example and docs for common_args is a bit confusing. My initial reading was that you should pass in dataframe columns like "island"/"sex"/etc as you typically do for plot args. This is not the case and until you realize what is going on, the "col" comes across as somewhat magic and disconnected from prior context. I would have a few suggestions to consider.

Give an example within the param docstring

Currently

"""
@dataclass
class Hist(Stat):
    ...
    common_norm : bool or list of variables
        When not `False`, the normalization is applied across groups. Use
        `True` to normalize across all groups, or pass variable name(s) that
        define normalization groups.
+       For example, passing in ["x"] would normalize within each x value.
+       If you have a facet, passing in ["col"] would normalize within
+       the facet column.
"""

Improve each example by adding a label

Generally it might be useful to add labels to each example on the object.Hist example.

p = p.facet("island")
(
    p.add(so.Bars(), so.Hist(stat="proportion"))
    .label(y="Proportion of all Penguins")
)

...

(
    p.add(so.Bars(), so.Hist(stat="proportion", common_norm=False))
    .label(y="Proportion of all Penguins on the Island")
)

This makes it clear what is changing between the output plots.

Add a explanation of the use of "col"

In the text of object.Hist

# Current text before plot in the docs
- Or, with more than one grouping varible, specify a subset to normalize within:
# Proposal
+ It might be the case we have multiple grouping variables. 
+ For example, we already are grouping on "island", which appears in the "col" of the facet. 
+ If we also use the "color" variable to group by "sex", we can specify which variable to normalize on.

(
     p.add(so.Bars(), so.Hist(stat="proportion", common_norm=["col"]), color="sex")
     .label(y="Proportion of all Penguins on the Island")
)

Consider whether want an example with `common_norm=["x"]` or `common_norm=["y"]`

Additionally there is currently no example demonstrating how an axis variable can go common_norm. For example:

(
    so.Plot(p, x="flipper_length_mm", color="sex")
    .add(
        so.Bars(),
        so.Hist(stat="proportion", common_norm=["x"]),
        so.Stack(),
    )
    .label(title="Sex Distribution by Flipper Length", y="Sex Distribution")
)

I think such a plot is interesting for certain research questions, and demonstrates more of the API. However, without any ability to add error bars here, it might be misused to draw incorrect conclusions. Thus a mixed addition and a different example might be better.

These are just ideas to consider for improving the docs.

Thanks, this definitely helped improve my understanding of the new objects interface. Appreciate it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

objects.Norm : Normailize among a group? #3663

objects.Norm : Normailize among a group? #3663

DNGros commented Mar 25, 2024 •

edited

mwaskom commented Mar 25, 2024

DNGros commented Mar 25, 2024

mwaskom commented Mar 25, 2024

DNGros commented Mar 26, 2024 •

edited

objects.Norm : Normailize among a group? #3663

objects.Norm : Normailize among a group? #3663

Comments

DNGros commented Mar 25, 2024 • edited

The problem

Question

mwaskom commented Mar 25, 2024

DNGros commented Mar 25, 2024

mwaskom commented Mar 25, 2024

DNGros commented Mar 26, 2024 • edited

Ideas for improving so.Hist docs

Give an example within the param docstring

Improve each example by adding a label

Add a explanation of the use of "col"

Consider whether want an example with common_norm=["x"] or common_norm=["y"]

DNGros commented Mar 25, 2024 •

edited

DNGros commented Mar 26, 2024 •

edited

Consider whether want an example with `common_norm=["x"]` or `common_norm=["y"]`