You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I often work with data that are censored. The approach for censored data implemented in brms is really useful, but I have found it to be slow - especially when a large part of the data are censored and with distributions where it is costly to integrate out censored data.
The approach taken by brms is explained in the stan user guide: https://mc-stan.org/docs/stan-users-guide/truncation-censoring.html#integrating-out-censored-values
This involves integrating out the PDF between the censoring bounds and therefore requires the CDF and/or CCDF functions depending on left, right or interval censoring.
I have found this approach to be much faster (about 30 times faster in below reprex) and it recovers the same parameter estimates. In this approach, the censored data are declared as parameters to be estimated. I think this approach is called a data augmentation (imputation) approach and is discussed in https://academic.oup.com/jrsssb/article-abstract/55/1/3/7035892 (section 6.3). It only requires the PDF function.
Notes:
The below reprex is for a beta distribution. I have extracted the brms generated stan code and changed this according to the User's guide, but without changing how brms requires the data to be formatted (so standata() is the same in both implementations). I am not very familiar with programming in stan, so some further efficiency gains are probably possible.
I have also tested this with a Gamma distribution and this worked too, but the speed-up is not that spectacular. I can imagine that for a Gaussian distribution there would be no speed-up.
Feature request:
Would you be willing to consider implementing this approach in brms? Maybe not as a replacement to the current implementation, but as an alternative. Perhaps an extra argument in the cens() function to switch between both alternatives?
Context:
I often work with data that are censored. The approach for censored data implemented in
brms
is really useful, but I have found it to be slow - especially when a large part of the data are censored and with distributions where it is costly to integrate out censored data.The approach taken by
brms
is explained in the stan user guide: https://mc-stan.org/docs/stan-users-guide/truncation-censoring.html#integrating-out-censored-valuesThis involves integrating out the PDF between the censoring bounds and therefore requires the CDF and/or CCDF functions depending on left, right or interval censoring.
Alternative:
The stan user guide also mentions an alternative approach: https://mc-stan.org/docs/stan-users-guide/truncation-censoring.html#estimating-censored-values
I have found this approach to be much faster (about 30 times faster in below reprex) and it recovers the same parameter estimates. In this approach, the censored data are declared as parameters to be estimated. I think this approach is called a data augmentation (imputation) approach and is discussed in https://academic.oup.com/jrsssb/article-abstract/55/1/3/7035892 (section 6.3). It only requires the PDF function.
Notes:
The below reprex is for a beta distribution. I have extracted the brms generated stan code and changed this according to the User's guide, but without changing how brms requires the data to be formatted (so standata() is the same in both implementations). I am not very familiar with programming in stan, so some further efficiency gains are probably possible.
I have also tested this with a Gamma distribution and this worked too, but the speed-up is not that spectacular. I can imagine that for a Gaussian distribution there would be no speed-up.
Feature request:
Would you be willing to consider implementing this approach in
brms
? Maybe not as a replacement to the current implementation, but as an alternative. Perhaps an extra argument in thecens()
function to switch between both alternatives?I think this approach might not work for discrete distributions because stan cannot sample discrete parameters? Although I have a feeling this might be related: https://users.aalto.fi/~johnsoa2/notebooks/CopulaIntro.html#bayesian-estimation-data-augmentation. (haven't thought this through)
Reproducible example:
Created on 2024-05-19 with reprex v2.1.0
Session info
The text was updated successfully, but these errors were encountered: