Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

radmeth-adjust replaces significant p-values in scientific notation with '1' and thus hides best DMC #31

Open
iromeo opened this issue Feb 17, 2023 · 6 comments

Comments

@iromeo
Copy link
Contributor

iromeo commented Feb 17, 2023

'methpipe' to 'dnmtools' code refactoring introduced a critical bug in radmeth-adjust. In 1.2.1 version & current master branch the adjust step replaces significant p-values written in scientific notation with '1.0' and thus hides best DMC and mark them as totally insignificant.

Example:
Fragment of DMC table after regression step:

chr5    167087304       +       CpG     0.944042        2100    390     2421    453
chr5    167087308       +       CpG     0.000236524     2109    769     2432    724
chr5    167087366       +       CpG     0.00873441      2289    238     2633    211
chr5    167087627       +       CpG     2.22045e-16     2015    473     2366    1151
chr5    167087636       +       CpG     2.22045e-16     2000    1087    2359    1797
chr5    167087645       +       CpG     1.9873e-14      1949    964     2330    1619
chr5    167087722       +       CpG     3.29441e-11     1781    1102    2106    1587
chr5    167087880       +       CpG     0.464406        1605    1237    1864    1459
chr5    167087934       +       CpG     0.779924        1723    1609    1972    1846
chr5    167087938       +       CpG     0.743816        1732    1644    1991    1885

After adjust step:

chr5    167087304       +       CpG     0.944042        0.000158571     0.0199993       2100    390     2421    453
chr5    167087308       +       CpG     0.000236524     0.000158571     0.0199993       2109    769     2432    724
chr5    167087366       +       CpG     0.00873441      0.000158571     0.0199993       2289    238     2633    211
chr5    167087627       +       CpG     1       1       1       2015    473     2366    1151
chr5    167087636       +       CpG     1       1       1       2000    1087    2359    1797
chr5    167087645       +       CpG     1       1       1       1949    964     2330    1619
chr5    167087722       +       CpG     1       1       1       1781    1102    2106    1587
chr5    167087880       +       CpG     0.464406        0.794533        0.910647        1605    1237    1864    1459
chr5    167087934       +       CpG     0.779924        0.794533        0.910647        1723    1609    1972    1846
chr5    167087938       +       CpG     0.743816        0.794533        0.910647        1732    1644    1991    1885

the bug is in is_number function, it isn't aware that number could have '-' and 'e' chars as it is scientific notation:

static bool
is_number(const string& str) {
  for (const char &c : str)
    if (c != '.' && !std::isdigit(c)) return false;
  return true;
}

Example DMC table attached.
dmc.txt

I've implemented & tested a fix, will attach it as pull request

@iromeo iromeo changed the title radmeth-adjust replace significant p-values in scientific notation with '1' and thus hides best DMC radmeth-adjust replaces significant p-values in scientific notation with '1' and thus hides best DMC Feb 17, 2023
iromeo added a commit to iromeo/dnmtools that referenced this issue Feb 17, 2023
…in scientific notation with '1' and thus hides best DMC smithlabcode#31
@andrewdavidsmith
Copy link
Collaborator

@iromeo Could this relate to speed of the code? I looked at your solution and I think it looks fine (at least it would be no worse), but I wonder if it would be more robust to just use istringstream and parse the number without any worries? It would incur the overhead of creating the object, copying the string, but the code already looks quite slow.

@andrewdavidsmith
Copy link
Collaborator

@iromeo I'll just pull and we can enhance this later if you think it's warranted.

@iromeo
Copy link
Contributor Author

iromeo commented Feb 18, 2023

@andrewdavidsmith I agree that is_number is not optimal, but I believe that its optimizing will be minor optimization that will not affect the whole run time. I suppose that the rest part of the adjustment algorithm works much slower compared to is_number (I mean cross correlation and liptak-stouffer test)

istringstream

AFAIU the is_number check was introduced only in DNMTOOLS version and is missing in previous methpipe radmeth impl. The check is required because regression step could return NA instead of numbers. Thus we could just check the first char of string, if it is digit, then it is likely a valid number, otherwise some NA like string when we were unable to calculate p-value.

At the moment the whole adjustment step works pretty fast, compared to regression step or reads alignment. So as a user I don't see much sense in adjustment step optimization.

P.S: I could compare performance of several approaches on my WGBS samples.

@andrewdavidsmith
Copy link
Collaborator

@iromeo is there a reason this is still open or did we just forget to close it? I recall the problem was fixed, right?

@iromeo
Copy link
Contributor Author

iromeo commented Jul 12, 2023

I think it is ok to close it. I use methpipe with this fix (from my fork) and everything is ok.

As for performance issue - as I mentioned before the adjustment step is much faster compared to very slow regression step, so I don't think that anything is needed to be optimised here.

@andrewdavidsmith
Copy link
Collaborator

andrewdavidsmith commented Jul 12, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants