Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mean and Standard Deviation is Different on Point Plot Log Scale #3661

Open
gil2rok opened this issue Mar 22, 2024 · 8 comments
Open

Mean and Standard Deviation is Different on Point Plot Log Scale #3661

gil2rok opened this issue Mar 22, 2024 · 8 comments

Comments

@gil2rok
Copy link
Contributor

gil2rok commented Mar 22, 2024

Problem Description: I have multiple measurements of some cost, with most values being quite small, but I have some enormous outlier: my mean is $1.2$, my standard deviation is just under $10$, and my median is $0.003$.

When I plot the mean and std with a point-plot, my error bar correctly ranges from approximately $(-10, 10)$ with a mean of $1$.

image

But when I use the log scale, the standard deviation and mean shift!

The mean is located near $10^{-2} = 0.1$ instead of $1$. The standard deviation error bars range from $(10^{-3}, 10^{-1}) = (0.01, 1)$ instead of $(-10, 10)$.

image

Question: Why is this? Are these statistics computed differently in log space? A big red flag is that the standard deviation error bars are symmetric in log space. Another red flag is that the error bars no longer go past zero when they absolutely should.

Code: Here is the code I used to generate these two plots. The only difference is toggling the log_scale parameter between true and false.

fig = sns.catplot(
    data=tmp,
    kind="point",
    x="sampler_type",
    y="cost",
    estimator=np.mean,
    errorbar="sd",
    aspect=1.5,
    log_scale=False,
)
fig = sns.catplot(
    data=tmp,
    kind="point",
    x="sampler_type",
    y="cost",
    estimator=np.mean,
    errorbar="sd",
    aspect=1.5,
    log_scale=True,
)
@gil2rok gil2rok changed the title Mean and Standard Deviation is Different on Log Scale Mean and Standard Deviation is Different on Point Plot Log Scale Mar 22, 2024
@mwaskom
Copy link
Owner

mwaskom commented Mar 22, 2024

Hi, yes the statistics are computed in log space when you have log_scale=True.

@gil2rok
Copy link
Contributor Author

gil2rok commented Mar 22, 2024

Thank you so much for the fast response. And I love the seaborn library!

How precisely does this change the computation? Can you please point me to the file where this is done?

I'm struggling to understand mathematically what is different when computing mean and std in log space.

In particular, I am not sure why the mean would change. I am actually measuring the squared cost so all my data lies on $[0, \infty)$. I have no negative values that would mess up the log computation, as far as I can tell.

@mwaskom
Copy link
Owner

mwaskom commented Mar 22, 2024

Probably the best way to think about it is that you should get the same result as if you passed seaborn the log of your data and then modified the tick labels. Your error bars are symmetric around the mean because they are being drawn from mean(y) - sd(y) to mean(y) + sd(y).

@gil2rok
Copy link
Contributor Author

gil2rok commented Mar 22, 2024

Some want to first compute summary statistics and then transform them to the log scale.

Others want to first transform data to the log scale and then compute summary statistics. Seaborn appears to do the latter.

Probably the best way to think about it is that you should get the same result as if you passed seaborn the log of your data and then modified the tick labels. Your error bars are symmetric around the mean because they are being drawn from mean(y) - sd(y) to mean(y) + sd(y).

In your example, y=log(x) for some original data x that we first transform to the log scale and then compute its mean and std.

If one were interested in the former, should they plot without the log scale parameter and afterwards manually set the axis to be logarithmic?

Potentially relevant stack exchange post here.

@gil2rok
Copy link
Contributor Author

gil2rok commented Mar 22, 2024

Lastly, it may be helpful for this to appear somewhere in the docs. It was quite tricky for me to understand and I may not be the only one.

Perhaps on the tutorial page for statistical estimation and error bars here? I would consider making a pull request if you're interested. Need to confirm I have time for it though.

@mwaskom
Copy link
Owner

mwaskom commented Mar 22, 2024

If one were interested in the former, should they plot without the log scale parameter and afterwards manually set the axis to be logarithmic?

Yes

@gil2rok
Copy link
Contributor Author

gil2rok commented Mar 22, 2024

Lastly, it may be helpful for this to appear somewhere in the docs. It was quite tricky for me to understand and I may not be the only one.

Perhaps on the tutorial page for statistical estimation and error bars here? I would consider making a pull request if you're interested. Need to confirm I have time for it though.

@mwaskom Just wanted to bump this in case you didn't see. If you're not interested, no worries!

@mwaskom
Copy link
Owner

mwaskom commented Mar 22, 2024

I could have sworn the docs already said that somewhere, maybe just in the seaborn.objects documentation though. This is a very general thing in seaborn: statistics are computed in the transformed space, so it also applies to e.g. boxplots, kdes, histograms, etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants