Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplyr::case_when() within dplyr::mutate() loses qualtRics variable labels #323

Open
rempsyc opened this issue Aug 8, 2023 · 7 comments
Open

Comments

@rempsyc
Copy link

rempsyc commented Aug 8, 2023

Hi, continuing the discussion with @jmobrien over from tidyverse/dplyr#6857 :)

Summary of issue:

dplyr::case_when() within dplyr::mutate() loses qualtRics variable labels because they are created through sjlabelled. DavisVaughan from the dplyr team suggests using haven instead of sjlabelled. jmobrien specifies that there may already exist workflows compatible with dplyr::case_when() or an equivalent. Reprexes available in original issue.

Are there already open or closed issues about this or online documentation I could refer to regarding those alternative workflows?

@jmobrien
Copy link
Collaborator

jmobrien commented Aug 8, 2023

So, what I "know" I've mostly cobbled together from working with this package over the past couple years + my own use of it for data cleaning. My interaction with it doesn't overlap with the original owner. But basically, I think the situation was/is:

  • Qualtrics provided some extra metadata in their response downloads that was worth preserving in qualtRics, specifically question text
  • As far as storing this sort of metadata, R standard practice was to use a attribute called "label"
  • Package sjlabelled has tended to be the primary (only?) toolkit specifically for interacting with label/labels attributes.
  • The actual paradigm for using the "label" attribute comes from haven, where it's somewhat bound up in a more complex idea of a "labelled" class including other things like an attribute for response options ("labels") and more complex missing data codes. However, sjlabelled mostly just focuses on working with the label/labels attributes, AS attributes.
  • For a couple of reasons I'm surmising, going with the simpler sjlabelled approach rather than haven's was likely viewed as the best approach for this package:
    • haven itself has always presented the "labelled" class as more of a transitional tool to help with importing/exporting from other statistical suites, and not a "proper" class you'd actually want to use natively in R.
    • Qualtrics's response downloads don't include metadata about questions' response options (nor info about missing data, save for some partial info added relatively recently). So, we only had the content for populating the main "label" attribute.
    • Having the extra "labelled" class attribute could sometimes create errors with generic functions that weren't expecting them (including some really critical stuff like the common modeling functions lm(), lmer(), etc.)
    • The tidyverse was much less mature & dominant then, and neither it nor the more base-R approaches to data manipulation could be relied upon to preserve class and/or attributes effectively by default. So, you were probably going to need to bake in label preservation to your workflow regardless.

Now, even if I'm right about the above, I kind of think that a lot of that might be viewed differently today. IMO the current, post-vctrs incarnation of haven and the "labelled" class offers a lot that might warrant it being seen as a "real" class. Perhaps we should consider using if we're going to continue incorporating label metadata (and we definitely will continue). For a number of this-is-too-long-already reasons I'm not quite convinced that's the right choice, but I'm writing this up to at least put the idea in the water.

Meanwhile, if you want to preserve things in your own workflows you're going to need some options. The obvious option is to convert to the "labelled" class, though there are other approaches:

    require(tidyverse, quietly = TRUE)
    require(haven, quietly = TRUE)
    require(sjlabelled, quietly = TRUE)
    
    # Function for converting to the labelled class:
    make_labelled <- 
      \(x){
        haven::labelled(x = x, 
                        label = attr(x, "label"),
                        labels = attr(x, "labels")
        )
      }
    
    # Example data frame:
    test <- 
      tibble(
        a = sample(c(1,2,50), 15, replace = TRUE) |> 
          structure(label = "a label"),
      )
    
    test |> get_label()
#>         a 
#> "a label"
    
    test <- 
      test |> 
      mutate(
        # This approach loses label/labels attributes:
        a_conv = 
          a |> 
          case_match(50 ~ 3, .default = a),
        # But you can convert first: 
        a_lab = 
          a |> 
          make_labelled(),
        # then the standard dplyr tools will preserve attributes (if used properly):
        a_lab_conv = 
          a_lab |> 
          case_match(50 ~ 3, .default = a_lab),
        # If you want to preserve attributes but don't want to end up with 
        # labelled vars, you can do it in place (this requires magrittr's %>%):
        a_conv2 = 
          a %>%
          make_labelled() %>%
          case_match(.x = . ,50 ~ 3, .default = .) |> 
          sjlabelled::unlabel(),
        # or, an even simpler manual approach:
        a_conv3 = 
          a |> 
          case_match(50 ~ 3, .default = a) |> 
          structure(label = attr(a, "label"))
      )

# Some give you a "labelled" class, some don't:     
test |> purrr::map(class)  
#> $a
#> [1] "numeric"
#> 
#> $a_conv
#> [1] "numeric"
#> 
#> $a_lab
#> [1] "haven_labelled" "vctrs_vctr"     "double"        
#> 
#> $a_lab_conv
#> [1] "haven_labelled" "vctrs_vctr"     "double"        
#> 
#> $a_conv2
#> [1] "numeric"
#> 
#> $a_conv3
#> [1] "numeric"
# But they all (other than the one) preserve the "label" attribute
test |> sjlabelled::get_label()
#>          a     a_conv      a_lab a_lab_conv    a_conv2    a_conv3 
#>  "a label"         ""  "a label"  "a label"  "a label"  "a label"

Created on 2023-08-08 with reprex v2.0.2

@juliasilge
Copy link
Collaborator

I kind of think that a lot of that might be viewed differently today

This is spot-on IMO; the decisions around sjabelled were made quite a long time ago, before some newer and better options existed. I do think these attributes are worth revisiting so folks have data that works better with current tools. I would be open to avoiding these kinds of attributes altogether in lieu of nicer tools for dealing with the labels and other metadata, but if that would be too much of a change, we can think through how this should be updated, maybe using haven's infrastructure instead of sjlabelled.

@jmobrien
Copy link
Collaborator

What would you say are the current newer & better options? On its face I don't love the label/labels attribute approach either, but I'm not up-to-date on what alternatives might be emerging as best practice.

I will say that one case for sticking with the attribute-centric approach is haven's exporting tools. AFAIK haven provides the only reasonably up-to-date approach for making datasets available in Stata, SAS, etc., which can sometimes be valuable, esp. in academia/gov't. Exporting can include the metadata (and other things), but that does require following their conventions.

@juliasilge
Copy link
Collaborator

Ah sorry, I may not have been clear.

  • I think that current (modern) haven may be easier to work with than current sjlabelled, and this may be the way to go for better handling of, for example, question text.
  • If we have good enough tools for getting the kind of metadata that exists in the labels in a different way, like a dataframe of metadata with the question text, part of me wants to get rid of all the attributes and "labelled" class business altogether. I do not come from the SPSS or SAS world, though, so this may be too extreme of an option.

@rempsyc
Copy link
Author

rempsyc commented Aug 14, 2023

Thanks for the workaround @jmobrien. Just for the sake of completeness, here is the workaround I was using (basically saving labels and manually adding them back after to avoid relying on another package):

suppressWarnings(suppressPackageStartupMessages(library(qualtRics)))
suppressWarnings(suppressPackageStartupMessages(library(sjlabelled)))
suppressWarnings(suppressPackageStartupMessages(library(dplyr)))

# Extract all surveys
surveys <- all_surveys()

# # Identify right survey
survey1.id <- surveys$id[
  which("Projet priming-aggression (Part 1)_Study 3" == surveys$name)]

# # Fetch right survey
data <- suppressMessages(fetch_survey(surveyID = survey1.id, verbose = FALSE))

# sjlabelled works
get_label(data$Status)
#>          Status 
#> "Response Type"

# Save question labels
labels.data <- data |>
  get_label() |>
  bind_rows()

# case_when
data <- data %>% 
  mutate(Status = case_when(Status == 50 ~ 1),
         Progress = case_when(Progress == 100 ~ 1))

# Labels lost
get_label(data$Status)
#> NULL

# Repair labels
data <- data %>% 
  mutate(Status = set_label(Status, labels.data$Status))

# Labels recovered
get_label(data$Status)
#> [1] "Response Type"

# Problem: needs to be done for each variable
get_label(data$Progress)
#> NULL

Created on 2023-08-14 with reprex v2.0.2

There would probably be a way to automate this process more efficiently through a function for all relevant variables though...

@jmobrien
Copy link
Collaborator

Great, that works too. For automation across multiple variables, in one of my cases I ended up creating a set_attributes() function that worked analogously to set_names() for inline restoration of attributes. I just set key attributes aside en masse for (a) dataframe(s), then used across() to reapply attributes when needed. Don't have the code in front of me but it wasn't too complex.

@jmobrien
Copy link
Collaborator

Expanding on your response @juliasilge, yes, this runs up against where we're already using a dual-approach model wherein question text metadata can be embedded at the variable level via labels, at the dataframe level via the attached column map (attribute), or both. We could definitely move more specifically in either direction if we saw fit.

Also, I suppose an alternative approach would be to write some helper functions that can add/restore labels from the column map as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants