Add mean imputation function #892

tszfungc · 2022-08-17T04:23:06Z

Add mean impute function for call_dosage, call_genotype, and call_genotype_probability

tomwhite

Thanks for the contribution @tszfungc!

The overall structure looks fine to me. Hoping @jeromekelleher and @timothymillar can take a look too.

tomwhite · 2022-08-17T14:08:44Z

sgkit/stats/preprocessing.py

+        Dataset containing the variable to be imputed.
+    variable
+        Input variable name
+        ``f"{variable}"`` and ``f"{variable}_masked"`` must be present in ``ds``.


Don't think f-strings work here?

timothymillar · 2022-08-18T09:11:54Z

Thanks for looking into this @tszfungc! I think this could be a great approach for imputing call_dosage and call_genotype_probability. However, I don't think it will produce the desired result for call_genotype.

The values in call_genotype are (potentially unsorted) alleles whose order along the ploidy dimension doesn't have any particular meaning. So, as far as I can tell, the mean of those alleles can't really be used for anything.

tszfungc · 2022-08-18T19:29:02Z

Thanks for the review @tomwhite @timothymillar. I agree that the allele order doesn't have a particular meaning. The order along ploidy should be ignored by computing the mean along dim=['samples', 'ploidy'], But this is also an unusual use to me.

jeromekelleher

Approach basically looks good to me, but I'm not convinced about the general approach of creating new _imputed variables. I would be simpler/better to just replace the missing data and reset the missingness mask in the returned dataset I think.

jeromekelleher · 2022-08-25T08:44:31Z

sgkit/stats/preprocessing.py

+    dim: Union[Hashable, Sequence[Hashable]] = "samples",
+    merge: bool = True,
+) -> Dataset:
+    """Mean impute a masked variable


It would be helpful to give a more descriptive follow up sentence here, like say

This replaces missing data for the specified variable with the mean of the non-missing values.

jeromekelleher · 2022-08-25T08:48:40Z

sgkit/variables.py

@@ -214,6 +214,15 @@ def _check_field(
    )
 )

+call_dosage_imputed, call_dosage_imputed_spec = SgkitVariables.register_variable(


I'm not sure we want to create a whole new bunch of variables here. Wouldn't it be simpler if we returned a copy of the original dataset in which all the missing data for the variable in question was replaced with the mean, and the mask was unset?

This would be more useful for downstream work, wouldn't it? We'd surely want to use the (say) imputed call_dosage in downstream analyses, and we wouldn't want to need to change variable names in order to do this.

timothymillar · 2022-09-01T21:45:46Z

@jeromekelleher the trade-off between returning new variables or replacing existing variables was previously discussed in https://github.com/pystatgen/sgkit/pull/308#issuecomment-705706571. I personally have a slight preference for replacing existing variables but there are some good points raised in that discussion. The primary concern seems to be that replacing existing variables is effectively a mutate operation, which goes against the general pattern of treating arrays as immutable.

jeromekelleher · 2022-09-13T08:47:34Z

I see, thanks. Hmm, not much choice other than to create a bunch of new variables then.

mergify · 2023-03-29T13:17:50Z

This PR has conflicts, @tszfungc please rebase and push updated version 🙏

mergify · 2023-09-05T12:55:57Z

This PR has conflicts, @tszfungc please rebase and push updated version 🙏

mergify · 2023-11-13T14:12:54Z

This PR has conflicts, @tszfungc please rebase and push updated version 🙏

mergify · 2024-02-05T16:12:02Z

This PR has conflicts, @tszfungc please rebase and push updated version 🙏

tszfungc and others added 3 commits August 16, 2022 18:59

Add mean_impute

2c903f2

Merge branch 'pystatgen:main' into main

259c1e9

Update docstrings

ed6eaa8

tomwhite reviewed Aug 17, 2022

View reviewed changes

tomwhite requested review from timothymillar and jeromekelleher August 17, 2022 14:11

Replace | operator with Union in typing

1264861

Remove f-string inside mean_impute.__doc__

3c5453e

tszfungc and others added 3 commits August 18, 2022 12:32

Merge branch 'main' of https://github.com/tszfungc/sgkit into main

f636587

Merge branch 'pystatgen:main' into main

4f66de3

Merge branch 'main' of https://github.com/tszfungc/sgkit into main

4b0e01f

jeromekelleher reviewed Aug 25, 2022

View reviewed changes

mergify bot added the conflict PR conflict label Mar 29, 2023

mergify bot removed the conflict PR conflict label Sep 5, 2023

mergify bot added the conflict PR conflict label Sep 5, 2023

mergify bot removed the conflict PR conflict label Nov 13, 2023

mergify bot added the conflict PR conflict label Nov 13, 2023

mergify bot removed the conflict PR conflict label Feb 5, 2024

mergify bot added the conflict PR conflict label Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mean imputation function #892

Add mean imputation function #892

tszfungc commented Aug 17, 2022

tomwhite left a comment

tomwhite Aug 17, 2022

timothymillar commented Aug 18, 2022

tszfungc commented Aug 18, 2022

jeromekelleher left a comment

jeromekelleher Aug 25, 2022

jeromekelleher Aug 25, 2022

timothymillar commented Sep 1, 2022

jeromekelleher commented Sep 13, 2022

mergify bot commented Mar 29, 2023

mergify bot commented Sep 5, 2023

mergify bot commented Nov 13, 2023

mergify bot commented Feb 5, 2024

Add mean imputation function #892

Are you sure you want to change the base?

Add mean imputation function #892

Conversation

tszfungc commented Aug 17, 2022

tomwhite left a comment

Choose a reason for hiding this comment

tomwhite Aug 17, 2022

Choose a reason for hiding this comment

timothymillar commented Aug 18, 2022

tszfungc commented Aug 18, 2022

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Aug 25, 2022

Choose a reason for hiding this comment

jeromekelleher Aug 25, 2022

Choose a reason for hiding this comment

timothymillar commented Sep 1, 2022

jeromekelleher commented Sep 13, 2022

mergify bot commented Mar 29, 2023

mergify bot commented Sep 5, 2023

mergify bot commented Nov 13, 2023

mergify bot commented Feb 5, 2024