Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_data.coxph returns data without labels #790

Open
iago-pssjd opened this issue Jul 8, 2023 · 7 comments
Open

get_data.coxph returns data without labels #790

iago-pssjd opened this issue Jul 8, 2023 · 7 comments
Labels
3 investigators ❔❓ Need to look further into this issue

Comments

@iago-pssjd
Copy link

iago-pssjd commented Jul 8, 2023

get_data.coxph returns data without labels. As a consequence, when used for parameters::parameters, the attribute pretty_labels is not useful at all.

Indeed, in the function

insight/R/get_data.R

Lines 1827 to 1852 in a95325c

get_data.coxph <- function(x, source = "environment", verbose = TRUE, ...) {
# try to recover data from environment
model_data <- .get_data_from_environment(x, source = source, verbose = verbose, ...)
if (!is.null(model_data)) {
return(model_data)
}
# fall back to extract data from model frame
# first try, parent frame
dat <- tryCatch(
{
mf <- .recover_data_from_environment(x)
mf <- .prepare_get_data(x, stats::na.omit(mf), verbose = FALSE)
},
error = function(x) NULL
)
# second try, default extractor. Less good because of coercion to other types
if (is.null(dat)) {
# second try, global env
dat <- get_data.default(x, source = source, verbose = verbose, ...)
}
dat
}

the issue happens with .prepare_get_data, where labels are removed from variables.

@strengejacke
Copy link
Member

labels are not removed in general inside .prepare_get_data(), maybe there's a specific issue with coxph models. Will look into this.

library(easystats)
#> # Attaching packages: easystats 0.6.0.10
#> ✔ bayestestR  0.13.1.2   ✔ correlation 0.8.4   
#> ✔ datawizard  0.8.0.3    ✔ effectsize  0.8.3.11
#> ✔ insight     0.19.3     ✔ modelbased  0.8.6.3 
#> ✔ performance 0.10.4.1   ✔ parameters  0.21.1.2
#> ✔ report      0.5.7.9    ✔ see         0.8.0.2
data(efc)
m <- lm(neg_c_7 ~ e42dep, data = efc)
str(get_data(m))
#> 'data.frame':    94 obs. of  2 variables:
#>  $ neg_c_7: num  12 20 11 12 19 15 11 15 10 28 ...
#>   ..- attr(*, "label")= chr "Negative impact with 7 items"
#>  $ e42dep : Factor w/ 4 levels "1","2","3","4": 3 3 3 4 4 4 4 4 4 4 ...
#>   ..- attr(*, "label")= chr "elder's dependency"
#>   ..- attr(*, "labels")= Named num [1:4] 1 2 3 4
#>   .. ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"
str(get_data(m, source = "mf"))
#> 'data.frame':    94 obs. of  2 variables:
#>  $ neg_c_7: num  12 20 11 12 19 15 11 15 10 28 ...
#>   ..- attr(*, "label")= chr "Negative impact with 7 items"
#>  $ e42dep : Factor w/ 4 levels "1","2","3","4": 3 3 3 4 4 4 4 4 4 4 ...
#>   ..- attr(*, "labels")= Named num [1:4] 1 2 3 4
#>   .. ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"
#>   ..- attr(*, "label")= chr "elder's dependency"
#>  - attr(*, "terms")=Classes 'terms', 'formula'  language neg_c_7 ~ e42dep
#>   .. ..- attr(*, "variables")= language list(neg_c_7, e42dep)
#>   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#>   .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. ..$ : chr [1:2] "neg_c_7" "e42dep"
#>   .. .. .. ..$ : chr "e42dep"
#>   .. ..- attr(*, "term.labels")= chr "e42dep"
#>   .. ..- attr(*, "order")= int 1
#>   .. ..- attr(*, "intercept")= int 1
#>   .. ..- attr(*, "response")= int 1
#>   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>   .. ..- attr(*, "predvars")= language list(neg_c_7, e42dep)
#>   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "factor"
#>   .. .. ..- attr(*, "names")= chr [1:2] "neg_c_7" "e42dep"
#>  - attr(*, "na.action")= 'omit' Named int [1:6] 4 27 33 46 58 97
#>   ..- attr(*, "names")= chr [1:6] "4" "27" "33" "46" ...
#>  - attr(*, "is_subset")= logi FALSE

Created on 2023-07-09 with reprex v2.0.2

@strengejacke strengejacke added the 3 investigators ❔❓ Need to look further into this issue label Jul 9, 2023
@iago-pssjd
Copy link
Author

iago-pssjd commented Jul 16, 2023

I should remark that get_data is called with the option source = "mf", since it is what is called here:

https://github.com/easystats/parameters/blob/71a5271a3f90c4707f67e5d2b5b07bd458ffe94b/R/format_parameters.R#L364-L373

(called by parameters:::.add_model_parameters_attributes, which is called in https://github.com/easystats/parameters/blob/71a5271a3f90c4707f67e5d2b5b07bd458ffe94b/R/1_model_parameters.R#L616-L631)

For a minimal example:

library(survival)
dat_regression_test <- data.frame(
    time = c(4, 3, 1, 1, 2, 2, 3),
    status = c(1, 1, 1, 0, 1, 1, 0),
    x = c(0, 2, 1, 1, 1, 0, 0),
    sex = c(0, 0, 0, 0, 1, 1, 1)
)
attr(dat_regression_test$x, "label") <- "Pred"
mod <- survival::coxph(Surv(time, status) ~ x + strata(sex),
                       data = dat_regression_test,
                       ties = "breslow"
)

str(get_data(mod, source = "mf"))
'data.frame':	7 obs. of  4 variables:
 $ time  : num  4 3 1 1 2 2 3
 $ status: num  1 1 1 0 1 1 0
 $ x     : num  0 2 1 1 1 0 0
 $ sex   : num  0 0 0 0 1 1 1
 - attr(*, "is_subset")= logi FALSE

For your example str(parameters(m)) includes

 - attr(*, "pretty_labels")= Named chr [1:4] "(Intercept)" "elder's dependency [slightly dependent]" "elder's dependency [moderately dependent]" "elder's dependency [severely dependent]

However, for str(parameters(mod))

- attr(*, "pretty_labels")= Named chr "x"
  ..- attr(*, "names")= chr "x"

@iago-pssjd
Copy link
Author

Maybe the issue is that when calling .prepare_get_data in get_data.coxph, it is called through stats::na.omit in line 1840, which removes all labels.

@iago-pssjd
Copy link
Author

So, @strengejacke why in some of the get_data methods there is a call to stats::na.omit inside .prepare_get_data and there is no in others? Is there an alternative?

@strengejacke
Copy link
Member

The original idea of get_data() was to retrieve the data that was used to fit the model, matching the same number of observations (i.e. NA removed). Meanwhile, since there are so many edge cases, and because for updating the model or calculating predictions it's not necessary to remove missings, the default now is to retrieve the data from the environment, i.e. the original data. When this doesn't work, get_data() falls back to retrieving data from the model frame.

@strengejacke
Copy link
Member

However, for str(parameters(mod))

Yes, but that data isn't labelled, so no surprise here?

@iago-pssjd
Copy link
Author

iago-pssjd commented Aug 14, 2023

@strengejacke The issue is that stats::na.omit removes the labels. Replacing it by tidyr::drop_na solves the issue, but I know you do not use dependencies and I did not find any other base way to remove the missings keeping the labels (beyond copying the labels and pasting them after removing missings).

Yes, but that data isn't labelled, so no surprise here?

Wrong, it is labelled, since previously I had done

attr(dat_regression_test$x, "label") <- "Pred"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 investigators ❔❓ Need to look further into this issue
Projects
None yet
Development

No branches or pull requests

2 participants