-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Decrease wall time of ma.cov
and ma.corrcoef
#26285
Conversation
Is this well tested? If there are doubts, maybe the astropy test suite has some tests for this? |
The benchmarks should be added to the benchmark suite if there are no benchmarks currently for ma.cov and ma.corrcoef |
I have added benchmarks for
|
The changes are well tested. The refactored functions pass all of the previous tests and I have added a new test for |
Lets put this in. The massive performance improvement for |
Thank you for taking the time to review and approve my PR. I can see that one check for the Accelerate LP64 and ILP64 interfaces on macOS ARM64 is failing, but it seems unrelated to the PR and was successful on my previous commit, after which I only updated the documentation. https://github.com/numpy/numpy/actions/runs/9214520468/job/25350855711?pr=26285#step:7:143 Apologies, but as this is my first contribution here, I am unsure about the approval process. Is there anything else that I need to do to wrap up the PR? |
Thanks @christopher-titchen. I wanted to wait a day or two to see if anyone else would take a look. |
This PR significantly decreases the wall time of the
ma.cov
andma.corrcoef
functions which are currently performing slowly. The estimation of covariance and correlation matrices, particularly from data containing missing observations, is important in a number of fields like forecasting.This improvement is achieved by:
ma.extras._covhelper
from its currentnp.int64
tonp.float32
if the integers in thenp.float64
, resulting in faster computation of the denominator in both cases.ma.dot
inma.cov
andma.corrcoef
when estimatingnp.dot
. This results in one less dot product, asma.dot
contains an additional dot product to compute the resultant mask, which is obsolete inma.cov
andma.corrcoef
as we can simply use the factor to determine it instead by identifying locations wherema.corrcoef
that estimates the normalisation denominator,ma.extras._covhelper
, the estimation ofstats::cov
in R). This makes the assumption thatma.cov
andma.corrcoef
return a naïve estimation of the covariance structure and correlation coefficients respectively in the presence of masked values. Note:ma.cov
returns exactly the same result in this PR's implementation as the current implementation.Further work should be done in future PR to incorporate an approach with a stable single-pass algorithm in C, like Welford's online algorithm adapted for the presence of missing values and including the calculation of the sum of squared deviations from the means, for estimating the covariance and correlation coefficients, to ensure alignment with other popular implementations. As mentioned previously,
stats::cov
in R has a great implementation. If anybody is interested, I am happy to send them my parallelised implementation in Numba (LLVM). However, I do believe these naïve approaches have their place as extremely fast estimators of the covariance structure and correlation coefficients, as in the majority of cases they return similar results within a small tolerance in a fraction of the time.A test has also been added for
ma.extras._covhelper
to reflect the changes.The benchmarks and code to generate them are below.
ma.cov
ma.corrcoef
* The function was still evaluating after an hour, and would probably take between 5–8 hours to finish based on the previous two results.