Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plotting normality of residuals: Scaling issues / differences #335

Open
mutlusun opened this issue Mar 27, 2024 · 14 comments
Open

Plotting normality of residuals: Scaling issues / differences #335

mutlusun opened this issue Mar 27, 2024 · 14 comments

Comments

@mutlusun
Copy link

Dear package authors,

Thank you very much for all your work and effort for the easystats packages. I like and use them a lot and they make my life definitely easier.

Recently, I stumbled upon a scaling "issue" with a data set consisting of two time points. Here is a small example for examination:

set.seed(1234)
x <- rnorm(100)
y <- rnorm(100, mean = x, sd = .5)
myd <- data.frame(x, y)

When fitting a regression model to this data and plotting the residuals, I get the impression that something is off here:

fit <- lm(y ~ x, data = myd)
tmp <- performance::check_normality(fit)
plot(tmp, type = "qq")

see-wo-qqplotr-001

However, plotting in a different way doesn't show such a strong deviation:

plot(tmp, type = "qq", detrend = FALSE)

see-wo-qqplotr-002

First of all, I'm not sure whether these differences in visualization are an "issue" of this package. When looking at the y-axis in the first plot, it is clear that the deviations don't have a strong magnitude. However, I still find it a bit strange that the same function suggests different interpretations.

Would it be an option to set certain minimal limits for the plot in the case of detrend = TRUE? I'm happy to implement this but wanted to ask for your thoughts before.

Note: This is not such an issue with qqplotr installed as the bands are often very wide and the interval wider.

@bwiernik
Copy link
Contributor

The second plot shows a very strong deviation. Those tails are much wider than expected under a normal distribution. The first plot better shows the amount of the deviation by removing the angular axis.

@strengejacke
Copy link
Member

strengejacke commented Mar 27, 2024

This issue is related to easystats/performance#643 (comment).
When you look at the examples (https://easystats.github.io/see/articles/performance.html#qq-plot), you see that detrend = TRUE often works well. But sometimes, due to the different limits of the y-axis, the plots look rather different (although they are not).

I still find it a bit strange that the same function suggests different interpretations.

It does not. It's just the scale that "zooms in" for detrend = TRUE, and actually, for detrend = TRUE, the deviation is usually easier to see.

That said, it may be surprising to users to see that large differences. Maybe we find a way to adjust the y-axis, so the differences are not that large.

@bwiernik
Copy link
Contributor

We set the non-detrended y axis limits to fixed values if I recall correctly. We should do the same with appropriate values for the detrended y axis.

@strengejacke
Copy link
Member

strengejacke commented Mar 27, 2024

No, I think for both plots, we don't set limits, thus ggplot2 decides which range to use, based on the range of the data.

.plot_diag_qq <- function(x,

The question is if it's possible to find reasonable y-limits for the detrended plot?

@mutlusun
Copy link
Author

Thank you for your replies!

Sorry, for being sloppy in my description above. As both of you pointed out, the tails show a strong deviation and both plots show the same thing but the scaling is different. I am aware of that. I was, for example, more concerned with the range of -1 to 1 of the standard normal distribution quantiles which could be interpreted differently between both plots (when not paying attention to the limits of the y-axes).

The question is if it's possible to find reasonable y-limits for the detrended plot?

I have thought about that a while but I think it is hard to come up with reasonable y-limits. Is there a good rule of thumb for deciding when there is a deviation? (This way minimum y-limits could be set. But, I didn't find one and it is probably misleading to decide for one.)

As an alternative: Would it be possible to set qqplotr as a hard dependency for this plot? This could be an easy way to increase the y-limits a bit.

@ClaudioZandonella
Copy link

ClaudioZandonella commented May 22, 2024

Adding on this topic, I found these results interesting.

qqplot with detrend = TRUE

Screenshot 2024-05-22 at 11 04 36

qqplot with detrend = FALSE

Screenshot 2024-05-22 at 11 04 23

They provide completely opposite results 🤯. This makes interpretation pretty troublesome.
Running shapiro.test() on the residuals I obtain p-value = 0.8866.

Do you have any idea on how to interpret these opposite results?

This is not just a toy example, I initially obtained these results from a real study. I am not sure if these strange results are related to the characteristics of the study: small fixed effect (Marginal R2 ~ 12%), but very large random effect (conditional R2 ~ 90%), so the remaining residual variance is really small.

Below is a minimal working example to simulate data and reproduce the plots.

Click me - Snippet Code
library('tidyverse')
library('ggplot2')
library('lme4')
set.seed(2024)

# create data
df <- expand_grid(
  id = 1:20,
  trial = 1:20,
  condition = 0:1
  ) %>% 
  mutate(
    across(c(id, trial, condition), factor)
  )

contrasts(df$condition) <- contr.sum(2)

# generate random effect
ran_ef <- rnorm(20, mean = 0, sd =.45)

# get model matrix random effects
my_model <- "y ~ condition + (1 | id)"
df$y <- 1 # lFormula need y
foo <- lFormula(eval(my_model), df)
Z <- t(as.matrix(foo$reTrms$Zt))

# get model matrix fixed effects
V <- model.matrix(~ condition, data = df)

# simulate y
df$y <- as.vector(V %*% c(1.4, .15) + Z %*% ran_ef + rnorm(nrow(df), mean = 0, sd = .15))

# plot data
ggplot(df) +
  geom_point(aes(x = id, y = y, color = condition), 
             alpha = .6, position = position_jitter(width = .2))

# fit model
my_fit <- lmer(y ~ condition + (1 | id), data = df)
performance::check_model(my_fit)
performance::r2(my_fit)

performance::check_normality(my_fit)
shapiro.test(residuals(my_fit))
qqnorm(residuals(my_fit))

tmp <- performance::check_normality(my_fit)
plot(tmp, type = "qq")
plot(tmp, type = "qq", detrend = FALSE)

By the way, thank you for the amazing package❤️❤️

@strengejacke
Copy link
Member

It seems like qqplotr is not installed? I get completely different plots:

library(easystats)
#> # Attaching packages: easystats 0.7.1.2
#> ✔ bayestestR  0.13.2.1    ✔ correlation 0.8.4.2  
#> ✔ datawizard  0.10.0.4    ✔ effectsize  0.8.8    
#> ✔ insight     0.19.11.2   ✔ modelbased  0.8.7    
#> ✔ performance 0.11.0.9    ✔ parameters  0.21.7.1 
#> ✔ report      0.5.8.2     ✔ see         0.8.4.1
library(lme4)
#> Loading required package: Matrix
set.seed(2024)

# create data
df <- expand.grid(
  id = 1:20,
  trial = 1:20,
  condition = 0:1
) |>
  to_factor()

contrasts(df$condition) <- contr.sum(2)

# generate random effect
ran_ef <- rnorm(20, mean = 0, sd = .45)

# get model matrix random effects
my_model <- "y ~ condition + (1 | id)"
df$y <- 1 # lFormula need y
foo <- lFormula(eval(my_model), df)
Z <- t(as.matrix(foo$reTrms$Zt))

# get model matrix fixed effects
V <- model.matrix(~condition, data = df)

# simulate y
df$y <- as.vector(V %*% c(1.4, .15) + Z %*% ran_ef + rnorm(nrow(df), mean = 0, sd = .15))

# fit model
my_fit <- lmer(y ~ condition + (1 | id), data = df)
tmp <- performance::check_normality(my_fit)
plot(tmp, type = "qq")

plot(tmp, type = "qq", detrend = FALSE)

Created on 2024-05-22 with reprex v2.1.0

@ClaudioZandonella
Copy link

Yes, thank you for the clarification!

After installing qqplotr I obtain more reasonable (normal😅) results

Before I obtained the message

For confidence bands, please install 'qqplotr'.

I thought it was only an "additional" feature, not a requirement.

Thank you for the immediate help!

@strengejacke
Copy link
Member

strengejacke commented May 22, 2024

I thought it was only an "additional" feature, not a requirement.

Yes, usually, it should indeed only be for CIs. But qqplotr has a detrend-argument that does the transformation of the plot, and w/o that package, we have written our own detrend-functions, which obviously doesn't work well in all situations.

@mattansb any ideas?

@mattansb
Copy link
Member

The CIs are an additional feature enabled by qqplotr, but they really do make possible the interpretation of the the detrended QQ plot (and the non-detrended plot as well!)

@strengejacke
Copy link
Member

strengejacke commented May 22, 2024

I meant the own detrend-code:

qq_stuff <- list(
if (detrend) {
ggplot2::geom_hline(
yintercept = 0,
linewidth = size_line,
colour = colors[1],
na.rm = TRUE
)
} else {
ggplot2::geom_qq_line(
linewidth = size_line,
colour = colors[1],
na.rm = TRUE
)
},
ggplot2::geom_qq(
mapping = if (detrend) ggplot2::aes(y = ggplot2::after_stat(.data$sample) - ggplot2::after_stat(.data$theoretical)),
shape = 16,
na.rm = TRUE,
stroke = 0,
size = size_point,
colour = colors[2] # "#2c3e50"
)
)

Any ideas, why the above plot (#335 (comment)) deviates that much when qqplotr is not installed?

@mattansb
Copy link
Member

Hmmmmm.... there's some scaling issue here. I'll look into it.

@mattansb
Copy link
Member

This isn't 100%, but it's very close.........

library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.3.3
g <- rnorm(20, sd = 15)

old

ggplot() + 
  stat_qq(aes(sample = g,
              y = after_stat(sample - theoretical),
              x = after_stat(theoretical)))

fix?

# Re scale?
N <- length(g)
SD <- sd(g) * sqrt((N - 1) / N)

ggplot() + 
  stat_qq(aes(sample = g,
              y = after_stat(sample - theoretical * SD),
              x = after_stat(theoretical * SD)))

similiar to qqplotr

ggplot() + 
  qqplotr::stat_qq_point(aes(sample = g), detrend = TRUE)

Created on 2024-05-22 with reprex v2.1.0

@strengejacke
Copy link
Member

strengejacke commented May 22, 2024

Thanks, I'll revise the code in see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants