Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCUMENTATION] Boxplot example wrongly computes whiskers #13779

Open
its-DomeE opened this issue Mar 25, 2024 · 2 comments
Open

[DOCUMENTATION] Boxplot example wrongly computes whiskers #13779

its-DomeE opened this issue Mar 25, 2024 · 2 comments

Comments

@its-DomeE
Copy link

Software versions

Python version : 3.8.17 (default, Aug 10 2023, 12:50:17)
IPython version : 8.12.3
Tornado version : 6.4
Bokeh version : 3.1.1

Browser name and version

No response

Jupyter notebook / Jupyter Lab version

No response

Expected behavior

The boxplot example of the documentation in examples/topics/stats/boxplot.py should compute the whiskers by:

  • finding the maximum value in between 75% quantile and 75% +1.5 * IQR and
  • finding the minimum value in between 25% quantile - 1.5 * IQR and 25% quantile

Observed behavior

The whiskers are computed in the example by just calculating:

  • 75% quantile + 1.5 * IQR and
  • 25% quantile - 1.5 * IQR

Which leads to the whiskers not providing any additional information at all.

Example code

import pandas as pd

from bokeh.models import ColumnDataSource, Whisker
from bokeh.plotting import figure, show
from bokeh.sampledata.autompg2 import autompg2
from bokeh.transform import factor_cmap

df = autompg2[["class", "hwy"]].rename(columns={"class": "kind"})

kinds = df.kind.unique()

# compute quantiles
qs = df.groupby("kind").hwy.quantile([0.25, 0.5, 0.75])
qs = qs.unstack().reset_index()
qs.columns = ["kind", "q1", "q2", "q3"]
df = pd.merge(df, qs, on="kind", how="left")

# compute IQR outlier bounds
iqr = df.q3 - df.q1
df["upper"] = df.q3 + 1.5*iqr
df["lower"] = df.q1 - 1.5*iqr

source = ColumnDataSource(df)

p = figure(x_range=kinds, tools="", toolbar_location=None,
           title="Highway MPG distribution by vehicle class",
           background_fill_color="#eaefef", y_axis_label="MPG")

# outlier range
whisker = Whisker(base="kind", upper="upper", lower="lower", source=source)
whisker.upper_head.size = whisker.lower_head.size = 20
p.add_layout(whisker)

# quantile boxes
cmap = factor_cmap("kind", "TolRainbow7", kinds)
p.vbar("kind", 0.7, "q2", "q3", source=source, color=cmap, line_color="black")
p.vbar("kind", 0.7, "q1", "q2", source=source, color=cmap, line_color="black")

# outliers
outliers = df[~df.hwy.between(df.lower, df.upper)]
p.scatter("kind", "hwy", source=outliers, size=6, color="black", alpha=0.3)

p.xgrid.grid_line_color = None
p.axis.major_label_text_font_size="14px"
p.axis.axis_label_text_font_size="12px"

show(p)

Stack traceback or browser console output

No response

Screenshots

No response

@its-DomeE
Copy link
Author

I could provide a PR in the next days if desired.

@dinya
Copy link
Contributor

dinya commented Apr 3, 2024

@its-DomeE do you mean the wiskers difference like

Current boxplot example [McGill1978] approach

I'm using the backported (adapted) code from matplotlib.cbook.boxplot_stats() (code) in my lib. The function itself uses the [McGill1978] approach.

(This code is used by seaborn.boxplot() too as far as seaborn is "high-level frontend" for matplotlib).

[McGill1978] McGill, R., Tukey, J.W., and Larsen, W.A. (1978) "Variations of Boxplots", The American Statistician, 32:12-16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants