Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

objects.Norm : Normailize among a group? #3663

Open
DNGros opened this issue Mar 25, 2024 · 4 comments
Open

objects.Norm : Normailize among a group? #3663

DNGros opened this issue Mar 25, 2024 · 4 comments

Comments

@DNGros
Copy link

DNGros commented Mar 25, 2024

The problem

I am trying to achieve an effect where the values within a group is normalized. Consider

# Make some sample data
import pandas as pd
import seaborn.objects as so
data = [
    # 2020
    {'year': "2020", 'category': 'happy'},
    {'year': "2020", 'category': 'happy'},
    {'year': "2020", 'category': 'sad'},
    # 2021
    {'year': "2021", 'category': 'happy'},
    {'year': "2021", 'category': 'happy'},
    {'year': "2021", 'category': 'happy'},
    {'year': "2021", 'category': 'sad'},
    # 2022
    {'year': "2022", 'category': 'happy'},
    {'year': "2022", 'category': 'happy'},
    {'year': "2022", 'category': 'mad'},
]
df = pd.DataFrame(data)

With the current interface we can create something like

    (
        so.Plot(df, x='year', color='category')
        .add(so.Line(), so.Count(), so.Norm())
    ).show()

image

However, I actually want something like

fracs = (
    df.groupby('year')
    .apply(lambda x: x['category'].value_counts(normalize=True))
    .unstack()
    .fillna(0)
)
fracs.plot()
plt.xlabel('Year')
plt.ylabel('Fraction of Category within Year')
plt.show()

image

Question

As far as I can tell there isn't a way to create this style of plot. Am I missing something in so.Norm that enables this (I found the documentation of so.Norm arguments somewhat confusing)? There may also be some other kind of functionality/plot already in Seaborn to enable this kind of analysis without using so.Norm?

Otherwise, is there interest in adding something like a within_groups arg (or other name. Not sure what makes the most sense.) to so.Norm to enable this?

@mwaskom
Copy link
Owner

mwaskom commented Mar 25, 2024

I think the main operation here is supported

(
    so.Plot(df, x='year', color='category')
    .add(so.Line(), so.Count(), so.Norm(func="sum", by=["x"]))
)

image

But that's a little different than your manual example because your observations are incomplete and Norm doesn't do the fillna bit. That feels a little bit out of scope for the Norm operation?

@DNGros
Copy link
Author

DNGros commented Mar 25, 2024

Awesome! Thanks for your reply! Yes, that does the desired effect. I agree that the fillna is unnecessary and out of scope.

For other reading this, I'll note that in this case adding a marker can make sure data does not completely disappear if there is only one part to the line segment.

(
    so.Plot(df, x='year', color='category')
    .add(so.Line(marker="o"), so.Count(), so.Norm(func="sum", by=["x"]))
).show()

image

Is there interest in adding a similar example to objects.Norm.html docs? There currently aren't any examples that use the func or the by args, and it might be good find ways to explain these args.

(
    so.Plot(df, x="Year", y="Spending_USD", color="Country")
    .add(so.Line(), so.Norm(func="sum", by=["x"]))
    .label(y="Fraction of Year's Spending")
).show()

image

(this render has the legend positioned differently than the other examples on the objects.Norm.html page. I'm not sure what specifically generates that page, and why it looks different. I found this notebook, but it doesn't seem to do anything like that.)

It's up to you whether it seems like a useful add or just makes the page too busy. Feel free to mark this issue as closed.

Thanks again for your response, and for your fantastic work on Seaborn!

@mwaskom
Copy link
Owner

mwaskom commented Mar 25, 2024

I found this notebook, but it doesn't seem to do anything like that.)

You have indeed found the source for the examples that you're looking at. The seaborn docs build system is pretty complicated though.

Demonstrating by and func in the docs makes sense. I think I was originally not totally sold on this API as it's indeed a little confusing. But it's probably been around for too long now.

By the way — there's no need to put .show() outside of the parens you're grouping the Plot methods with (and there's no reason to use it at all if you're in a notebook — in fact it will produce worse results since you'll loose the default retina-mode that Plot's Jupyter integration offers).

One other comment I would make is that you are sort of making a histogram here :)

(
    so.Plot(df, x='year', color='category')
    .add(so.Line(), so.Hist(stat="proportion", common_norm=["x"]))
)

image

@DNGros
Copy link
Author

DNGros commented Mar 26, 2024

You have indeed found the source for the examples that you're looking at. The seaborn docs build system is pretty complicated though.

Ok, thanks. I think I might not have a good enough grasp on how text between the code is created that would prevent me from making any PRs here. I'll leave it as-is then for now.

By the way — there's no need to put .show() outside of the parens you're grouping the Plot methods with

Thank you for noting. I was doing these plots in a script ran in pycharm rather than a notebook, so needed to get it it to actually render out.

One other comment I would make is that you are sort of making a histogram here :)

This is a good point. I looked at the so.Hist previously, but I didn't make the connection putting these args together. Thanks for explaining.

Ideas for improving so.Hist docs

Looking at it again, I'll note that the example and docs for common_args is a bit confusing. My initial reading was that you should pass in dataframe columns like "island"/"sex"/etc as you typically do for plot args. This is not the case and until you realize what is going on, the "col" comes across as somewhat magic and disconnected from prior context. I would have a few suggestions to consider.

Give an example within the param docstring

Currently

"""
@dataclass
class Hist(Stat):
    ...
    common_norm : bool or list of variables
        When not `False`, the normalization is applied across groups. Use
        `True` to normalize across all groups, or pass variable name(s) that
        define normalization groups.
+       For example, passing in ["x"] would normalize within each x value.
+       If you have a facet, passing in ["col"] would normalize within
+       the facet column.
"""

Improve each example by adding a label

Generally it might be useful to add labels to each example on the object.Hist example.

p = p.facet("island")
(
    p.add(so.Bars(), so.Hist(stat="proportion"))
    .label(y="Proportion of all Penguins")
)

...

(
    p.add(so.Bars(), so.Hist(stat="proportion", common_norm=False))
    .label(y="Proportion of all Penguins on the Island")
)

This makes it clear what is changing between the output plots.

Add a explanation of the use of "col"

In the text of object.Hist

# Current text before plot in the docs
- Or, with more than one grouping varible, specify a subset to normalize within:
# Proposal
+ It might be the case we have multiple grouping variables. 
+ For example, we already are grouping on "island", which appears in the "col" of the facet. 
+ If we also use the "color" variable to group by "sex", we can specify which variable to normalize on.
(
     p.add(so.Bars(), so.Hist(stat="proportion", common_norm=["col"]), color="sex")
     .label(y="Proportion of all Penguins on the Island")
)

Consider whether want an example with common_norm=["x"] or common_norm=["y"]

Additionally there is currently no example demonstrating how an axis variable can go common_norm. For example:

(
    so.Plot(p, x="flipper_length_mm", color="sex")
    .add(
        so.Bars(),
        so.Hist(stat="proportion", common_norm=["x"]),
        so.Stack(),
    )
    .label(title="Sex Distribution by Flipper Length", y="Sex Distribution")
)

image

I think such a plot is interesting for certain research questions, and demonstrates more of the API. However, without any ability to add error bars here, it might be misused to draw incorrect conclusions. Thus a mixed addition and a different example might be better.

These are just ideas to consider for improving the docs.

Thanks, this definitely helped improve my understanding of the new objects interface. Appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants