Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: optimize.approx_fprime: avoid quadratic memory usage #20645

Merged
merged 2 commits into from May 6, 2024

Conversation

nickodell
Copy link
Contributor

@nickodell nickodell commented May 5, 2024

Currently, the code generates all possible vectors to change x0 by at once. When optimizing a function in many dimensions, this uses a quadratic amount of memory, which is wasteful. Reduce this by computing those vectors one at a time.

This is my first change to SciPy; I appreciate any feedback.

Reference issue

There is no previously filed issue for this problem.

What does this implement/fix?

Here are two examples that I've seen on StackOverflow of users running into this problem:

https://stackoverflow.com/questions/77867437/scipy-minimize-returns-an-array-too-big

https://stackoverflow.com/questions/78399838/when-using-scipy-minimize-a-large-array-appears-which-was-not-originally-there

To summarize, both of these people are trying to minimize a function of ~90000 variables, and SciPy then asks to allocate an array which is N ** 2 * 8 bytes in size to compute the derivative. This seems strange, since this derivative can be computed in linear amounts of memory. There is also a solver, L-BFGS-B, which is capable of minimizing a function this large, so doing this in linear amounts of memory is not just of academic interest.

Memory usage benchmarks

I have two goals for this change:

  • Reduce memory use
  • Avoid a regression in CPU usage

Here is a benchmark showing the difference in memory usage:

import numpy as np
from scipy.optimize import rosen, minimize
import resource
import time


def get_peak_mem():
    kb = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    mb = kb / 1024
    return mb

def func(x, target):
    return np.mean((target - x)**2)


x0 = np.zeros(10000)
target = np.ones_like(x0)
start = time.time()
res = minimize(
    func,
    x0,
    method='L-BFGS-B',
    options=dict(maxfun=np.inf),
    args=(target),
)
end = time.time()
print(res)
print(f"Usage: {get_peak_mem():.2f} MiB")
print(f"Duration: {end - start:.2f}s")

(Warning: memory usage will be wrong on OS X.)

Result, running on this branch

  message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
  success: True
   status: 0
      fun: 2.5149038026498236e-10
        x: [ 1.000e+00  1.000e+00 ...  1.000e+00  1.000e+00]
      nit: 2
      jac: [-3.171e-09 -3.171e-09 ... -3.171e-09 -3.171e-09]
     nfev: 50005
     njev: 5
 hess_inv: <10000x10000 LbfgsInvHessProduct with dtype=float64>
Usage: 81.21 MiB
Duration: 2.00s

Result, running on current main (4a537b7)

  message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
  success: True
   status: 0
      fun: 2.5149038026498236e-10
        x: [ 1.000e+00  1.000e+00 ...  1.000e+00  1.000e+00]
      nit: 2
      jac: [-3.171e-09 -3.171e-09 ... -3.171e-09 -3.171e-09]
     nfev: 50005
     njev: 5
 hess_inv: <10000x10000 LbfgsInvHessProduct with dtype=float64>
Usage: 120.34 MiB
Duration: 3.67s

(Interestingly, this is less memory than predicted by the formula N ** 2 * 8 bytes. I believe this is because Linux lazily allocates memory. Each row in the matrix is roughly twenty 4K pages long, but only one of those pages must be allocated, since there is only one non-zero value per row.)

CPU Benchmarks

I also attempted to benchmark this with the built-in benchmarks, but could not figure out how to use --compare. (See airspeed-velocity/asv#1401) Benchmarking without --compare, it seems on benchmarks where no derivative is supplied, this method is either faster, or only slower by ~1%.

Optimize benchmarks, my branch:

               rosenbrock_nograd        COBYLA       mean_nfev            1000.0         
               rosenbrock_nograd        COBYLA       mean_time     0.01873011589050293   
               rosenbrock_nograd        Powell       mean_nfev            887.2          
               rosenbrock_nograd        Powell       mean_time     0.01709756851196289   
               rosenbrock_nograd     nelder-mead     mean_nfev            310.2          
               rosenbrock_nograd     nelder-mead     mean_time    0.0070003032684326175  
               rosenbrock_nograd       L-BFGS-B      mean_nfev            148.4          
               rosenbrock_nograd       L-BFGS-B      mean_time     0.00667574405670166   
               rosenbrock_nograd         BFGS        mean_nfev            176.8          
               rosenbrock_nograd         BFGS        mean_time      0.0107954740524292   
               rosenbrock_nograd          CG         mean_nfev             n/a           
               rosenbrock_nograd          CG         mean_time             n/a           
               rosenbrock_nograd         TNC         mean_nfev             n/a           
               rosenbrock_nograd         TNC         mean_time             n/a           
               rosenbrock_nograd        SLSQP        mean_nfev            155.1          
               rosenbrock_nograd        SLSQP        mean_time     0.008312082290649414  
               rosenbrock_nograd      Newton-CG      mean_nfev             n/a           
               rosenbrock_nograd      Newton-CG      mean_time             n/a           
               rosenbrock_nograd        dogleg       mean_nfev             n/a           
               rosenbrock_nograd        dogleg       mean_time             n/a           
               rosenbrock_nograd      trust-ncg      mean_nfev             n/a           
               rosenbrock_nograd      trust-ncg      mean_time             n/a           
               rosenbrock_nograd     trust-exact     mean_nfev             n/a           
               rosenbrock_nograd     trust-exact     mean_time             n/a           
               rosenbrock_nograd     trust-krylov    mean_nfev             n/a           
               rosenbrock_nograd     trust-krylov    mean_time             n/a           
               rosenbrock_nograd     trust-constr    mean_nfev            216.0          
               rosenbrock_nograd     trust-constr    mean_time     0.07394084930419922   

Main branch

               rosenbrock_nograd        COBYLA       mean_nfev            1000.0         
               rosenbrock_nograd        COBYLA       mean_time     0.01912097930908203   
               rosenbrock_nograd        Powell       mean_nfev            887.2          
               rosenbrock_nograd        Powell       mean_time     0.018049192428588868  
               rosenbrock_nograd     nelder-mead     mean_nfev            310.2          
               rosenbrock_nograd     nelder-mead     mean_time     0.007235574722290039  
               rosenbrock_nograd       L-BFGS-B      mean_nfev            148.4          
               rosenbrock_nograd       L-BFGS-B      mean_time     0.00679473876953125   
               rosenbrock_nograd         BFGS        mean_nfev            176.8          
               rosenbrock_nograd         BFGS        mean_time     0.010858583450317382  
               rosenbrock_nograd          CG         mean_nfev             n/a           
               rosenbrock_nograd          CG         mean_time             n/a           
               rosenbrock_nograd         TNC         mean_nfev             n/a           
               rosenbrock_nograd         TNC         mean_time             n/a           
               rosenbrock_nograd        SLSQP        mean_nfev            155.1          
               rosenbrock_nograd        SLSQP        mean_time     0.009046149253845216  
               rosenbrock_nograd      Newton-CG      mean_nfev             n/a           
               rosenbrock_nograd      Newton-CG      mean_time             n/a           
               rosenbrock_nograd        dogleg       mean_nfev             n/a           
               rosenbrock_nograd        dogleg       mean_time             n/a           
               rosenbrock_nograd      trust-ncg      mean_nfev             n/a           
               rosenbrock_nograd      trust-ncg      mean_time             n/a           
               rosenbrock_nograd     trust-exact     mean_nfev             n/a           
               rosenbrock_nograd     trust-exact     mean_time             n/a           
               rosenbrock_nograd     trust-krylov    mean_nfev             n/a           
               rosenbrock_nograd     trust-krylov    mean_time             n/a           
               rosenbrock_nograd     trust-constr    mean_nfev            216.0          
               rosenbrock_nograd     trust-constr    mean_time     0.07378766536712647   

Currently, the code generates all possible h vectors at once. When optimizing N dimensions, this uses a quadratic amount of memory, which is wasteful. Reduce this by computing those vectors one at a time.
Copy link
Contributor

@andyfaff andyfaff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the changes in here. If there's one nit I have is that there's a several copies.
I wonder if it's better to have:

  • x1 = x0.copy and x2 = x0.copy before the loop.
  • at the end of the loop have x1[i] = x2[i] = x0[i] to reset the value.

For large values of h.size this will prevent h.size - 2 copies.

then do h_vecs *= 0

@lucascolley lucascolley added the enhancement A new feature or improvement label May 5, 2024
@lucascolley lucascolley changed the title Avoid quadratic memory usage during approx_derivative ENH: optimize.approx_derivative: avoid quadratic memory usage May 5, 2024
@lucascolley lucascolley changed the title ENH: optimize.approx_derivative: avoid quadratic memory usage ENH: optimize.approx_fprime: avoid quadratic memory usage May 5, 2024
@nickodell
Copy link
Contributor Author

I like the changes in here. If there's one nit I have is that there's a several copies. I wonder if it's better to have:

  • x1 = x0.copy and x2 = x0.copy before the loop.

  • at the end of the loop have x1[i] = x2[i] = x0[i] to reset the value.

I wrote two versions of this idea.

The first version avoids copying the vectors in the loop by setting up three vectors (x1, x2, and a complex version of x1) at the beginning, and repeatedly re-using them in each loop. This is branch njo-numdiff-sep-vars.

I also tried a version which moves the if method == ...: test out of the loop, so it can set up only the vectors which are needed. This is branch njo-numdiff-loop-exchange. However, it's a longer and more complex change.

I benchmarked the original approach and both changes using the function

approx_derivative(np.linalg.norm, np.ones(N), method)

for varying branches, methods and number of dimensions. I ran this between 5 and 50000 times, depending on the dimensions, and measured the average times. I re-ran this benchmark 20 times, to capture the variation between runs.

Plots

1
10
100
1000
10000
10000_nomain

Conclusion

Based on these plots, I find the following:

  • At all problem sizes, njo-numdiff-loop-exchange beats njo-numdiff-sep-vars.
  • For some small problems, the original njo-numdiff-mem-usage beats the other two approaches I coded.
  • For larger problem sizes, njo-numdiff-loop-exchange and njo-numdiff-sep-vars are much faster than njo-numdiff-mem-usage.

So, there's a tradeoff here between algorithm complexity and speed. Which approach would you prefer?

  1. Copy variables in each loop. (njo-numdiff-mem-usage)
  2. Set up all variables in advance. (njo-numdiff-sep-vars)
  3. Move if outside loop. (njo-numdiff-loop-exchange)
  4. Something else.

Raw results

The benchmark scripts and CSV of raw timings are available here, if you want to do your own analysis.

then do h_vecs *= 0

I don't understand this part of the feedback. I removed this variable. Can you explain?

@andyfaff
Copy link
Contributor

andyfaff commented May 6, 2024

From those graphs it looks like either of the three options are performant enough. This means the best solution is the one that is the easiest to read and maintain. For me that's njo-numdiff-sep-vars.

Sorry about the h_vecs comment, that's a bit of cruft from a draft that I didn't delete. So just ignore that.

@andyfaff
Copy link
Contributor

andyfaff commented May 6, 2024

This looks good to me. Thank you very much for the PR, is it your first? Congratulations.

I lurk on stackoverflow and notice that you answer a lot of questions. Thank you for doing that, it's helpful for the user community, and helpful for the project.

@andyfaff andyfaff merged commit ffa78f3 into scipy:main May 6, 2024
31 checks passed
@nickodell
Copy link
Contributor Author

This looks good to me. Thank you very much for the PR, is it your first? Congratulations.

It's my first in SciPy, but I've made PRs for other open source projects before.

I lurk on stackoverflow and notice that you answer a lot of questions. Thank you for doing that, it's helpful for the user community, and helpful for the project.

Glad to help! :)

@j-bowhay j-bowhay added this to the 1.14.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.optimize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants