ENH: optimize.approx_fprime: avoid quadratic memory usage #20645

nickodell · 2024-05-05T04:05:42Z

Currently, the code generates all possible vectors to change x0 by at once. When optimizing a function in many dimensions, this uses a quadratic amount of memory, which is wasteful. Reduce this by computing those vectors one at a time.

This is my first change to SciPy; I appreciate any feedback.

Reference issue

There is no previously filed issue for this problem.

What does this implement/fix?

Here are two examples that I've seen on StackOverflow of users running into this problem:

https://stackoverflow.com/questions/77867437/scipy-minimize-returns-an-array-too-big

https://stackoverflow.com/questions/78399838/when-using-scipy-minimize-a-large-array-appears-which-was-not-originally-there

To summarize, both of these people are trying to minimize a function of ~90000 variables, and SciPy then asks to allocate an array which is N ** 2 * 8 bytes in size to compute the derivative. This seems strange, since this derivative can be computed in linear amounts of memory. There is also a solver, L-BFGS-B, which is capable of minimizing a function this large, so doing this in linear amounts of memory is not just of academic interest.

Memory usage benchmarks

I have two goals for this change:

Reduce memory use
Avoid a regression in CPU usage

Here is a benchmark showing the difference in memory usage:

import numpy as np
from scipy.optimize import rosen, minimize
import resource
import time


def get_peak_mem():
    kb = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    mb = kb / 1024
    return mb

def func(x, target):
    return np.mean((target - x)**2)


x0 = np.zeros(10000)
target = np.ones_like(x0)
start = time.time()
res = minimize(
    func,
    x0,
    method='L-BFGS-B',
    options=dict(maxfun=np.inf),
    args=(target),
)
end = time.time()
print(res)
print(f"Usage: {get_peak_mem():.2f} MiB")
print(f"Duration: {end - start:.2f}s")

(Warning: memory usage will be wrong on OS X.)

Result, running on this branch

  message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
  success: True
   status: 0
      fun: 2.5149038026498236e-10
        x: [ 1.000e+00  1.000e+00 ...  1.000e+00  1.000e+00]
      nit: 2
      jac: [-3.171e-09 -3.171e-09 ... -3.171e-09 -3.171e-09]
     nfev: 50005
     njev: 5
 hess_inv: <10000x10000 LbfgsInvHessProduct with dtype=float64>
Usage: 81.21 MiB
Duration: 2.00s

Result, running on current main (4a537b7)

  message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
  success: True
   status: 0
      fun: 2.5149038026498236e-10
        x: [ 1.000e+00  1.000e+00 ...  1.000e+00  1.000e+00]
      nit: 2
      jac: [-3.171e-09 -3.171e-09 ... -3.171e-09 -3.171e-09]
     nfev: 50005
     njev: 5
 hess_inv: <10000x10000 LbfgsInvHessProduct with dtype=float64>
Usage: 120.34 MiB
Duration: 3.67s

(Interestingly, this is less memory than predicted by the formula N ** 2 * 8 bytes. I believe this is because Linux lazily allocates memory. Each row in the matrix is roughly twenty 4K pages long, but only one of those pages must be allocated, since there is only one non-zero value per row.)

CPU Benchmarks

I also attempted to benchmark this with the built-in benchmarks, but could not figure out how to use --compare. (See airspeed-velocity/asv#1401) Benchmarking without --compare, it seems on benchmarks where no derivative is supplied, this method is either faster, or only slower by ~1%.

Optimize benchmarks, my branch:

               rosenbrock_nograd        COBYLA       mean_nfev            1000.0         
               rosenbrock_nograd        COBYLA       mean_time     0.01873011589050293   
               rosenbrock_nograd        Powell       mean_nfev            887.2          
               rosenbrock_nograd        Powell       mean_time     0.01709756851196289   
               rosenbrock_nograd     nelder-mead     mean_nfev            310.2          
               rosenbrock_nograd     nelder-mead     mean_time    0.0070003032684326175  
               rosenbrock_nograd       L-BFGS-B      mean_nfev            148.4          
               rosenbrock_nograd       L-BFGS-B      mean_time     0.00667574405670166   
               rosenbrock_nograd         BFGS        mean_nfev            176.8          
               rosenbrock_nograd         BFGS        mean_time      0.0107954740524292   
               rosenbrock_nograd          CG         mean_nfev             n/a           
               rosenbrock_nograd          CG         mean_time             n/a           
               rosenbrock_nograd         TNC         mean_nfev             n/a           
               rosenbrock_nograd         TNC         mean_time             n/a           
               rosenbrock_nograd        SLSQP        mean_nfev            155.1          
               rosenbrock_nograd        SLSQP        mean_time     0.008312082290649414  
               rosenbrock_nograd      Newton-CG      mean_nfev             n/a           
               rosenbrock_nograd      Newton-CG      mean_time             n/a           
               rosenbrock_nograd        dogleg       mean_nfev             n/a           
               rosenbrock_nograd        dogleg       mean_time             n/a           
               rosenbrock_nograd      trust-ncg      mean_nfev             n/a           
               rosenbrock_nograd      trust-ncg      mean_time             n/a           
               rosenbrock_nograd     trust-exact     mean_nfev             n/a           
               rosenbrock_nograd     trust-exact     mean_time             n/a           
               rosenbrock_nograd     trust-krylov    mean_nfev             n/a           
               rosenbrock_nograd     trust-krylov    mean_time             n/a           
               rosenbrock_nograd     trust-constr    mean_nfev            216.0          
               rosenbrock_nograd     trust-constr    mean_time     0.07394084930419922

Main branch

               rosenbrock_nograd        COBYLA       mean_nfev            1000.0         
               rosenbrock_nograd        COBYLA       mean_time     0.01912097930908203   
               rosenbrock_nograd        Powell       mean_nfev            887.2          
               rosenbrock_nograd        Powell       mean_time     0.018049192428588868  
               rosenbrock_nograd     nelder-mead     mean_nfev            310.2          
               rosenbrock_nograd     nelder-mead     mean_time     0.007235574722290039  
               rosenbrock_nograd       L-BFGS-B      mean_nfev            148.4          
               rosenbrock_nograd       L-BFGS-B      mean_time     0.00679473876953125   
               rosenbrock_nograd         BFGS        mean_nfev            176.8          
               rosenbrock_nograd         BFGS        mean_time     0.010858583450317382  
               rosenbrock_nograd          CG         mean_nfev             n/a           
               rosenbrock_nograd          CG         mean_time             n/a           
               rosenbrock_nograd         TNC         mean_nfev             n/a           
               rosenbrock_nograd         TNC         mean_time             n/a           
               rosenbrock_nograd        SLSQP        mean_nfev            155.1          
               rosenbrock_nograd        SLSQP        mean_time     0.009046149253845216  
               rosenbrock_nograd      Newton-CG      mean_nfev             n/a           
               rosenbrock_nograd      Newton-CG      mean_time             n/a           
               rosenbrock_nograd        dogleg       mean_nfev             n/a           
               rosenbrock_nograd        dogleg       mean_time             n/a           
               rosenbrock_nograd      trust-ncg      mean_nfev             n/a           
               rosenbrock_nograd      trust-ncg      mean_time             n/a           
               rosenbrock_nograd     trust-exact     mean_nfev             n/a           
               rosenbrock_nograd     trust-exact     mean_time             n/a           
               rosenbrock_nograd     trust-krylov    mean_nfev             n/a           
               rosenbrock_nograd     trust-krylov    mean_time             n/a           
               rosenbrock_nograd     trust-constr    mean_nfev            216.0          
               rosenbrock_nograd     trust-constr    mean_time     0.07378766536712647

Currently, the code generates all possible h vectors at once. When optimizing N dimensions, this uses a quadratic amount of memory, which is wasteful. Reduce this by computing those vectors one at a time.

andyfaff

I like the changes in here. If there's one nit I have is that there's a several copies.
I wonder if it's better to have:

x1 = x0.copy and x2 = x0.copy before the loop.
at the end of the loop have x1[i] = x2[i] = x0[i] to reset the value.

For large values of h.size this will prevent h.size - 2 copies.

then do h_vecs *= 0

nickodell · 2024-05-06T01:06:24Z

I like the changes in here. If there's one nit I have is that there's a several copies. I wonder if it's better to have:

x1 = x0.copy and x2 = x0.copy before the loop.

at the end of the loop have x1[i] = x2[i] = x0[i] to reset the value.

I wrote two versions of this idea.

The first version avoids copying the vectors in the loop by setting up three vectors (x1, x2, and a complex version of x1) at the beginning, and repeatedly re-using them in each loop. This is branch njo-numdiff-sep-vars.

I also tried a version which moves the if method == ...: test out of the loop, so it can set up only the vectors which are needed. This is branch njo-numdiff-loop-exchange. However, it's a longer and more complex change.

I benchmarked the original approach and both changes using the function

approx_derivative(np.linalg.norm, np.ones(N), method)

for varying branches, methods and number of dimensions. I ran this between 5 and 50000 times, depending on the dimensions, and measured the average times. I re-ran this benchmark 20 times, to capture the variation between runs.

Plots

Conclusion

Based on these plots, I find the following:

At all problem sizes, njo-numdiff-loop-exchange beats njo-numdiff-sep-vars.
For some small problems, the original njo-numdiff-mem-usage beats the other two approaches I coded.
For larger problem sizes, njo-numdiff-loop-exchange and njo-numdiff-sep-vars are much faster than njo-numdiff-mem-usage.

So, there's a tradeoff here between algorithm complexity and speed. Which approach would you prefer?

Copy variables in each loop. (njo-numdiff-mem-usage)
Set up all variables in advance. (njo-numdiff-sep-vars)
Move if outside loop. (njo-numdiff-loop-exchange)
Something else.

Raw results

The benchmark scripts and CSV of raw timings are available here, if you want to do your own analysis.

then do h_vecs *= 0

I don't understand this part of the feedback. I removed this variable. Can you explain?

andyfaff · 2024-05-06T01:25:34Z

From those graphs it looks like either of the three options are performant enough. This means the best solution is the one that is the easiest to read and maintain. For me that's njo-numdiff-sep-vars.

Sorry about the h_vecs comment, that's a bit of cruft from a draft that I didn't delete. So just ignore that.

andyfaff · 2024-05-06T02:48:32Z

This looks good to me. Thank you very much for the PR, is it your first? Congratulations.

I lurk on stackoverflow and notice that you answer a lot of questions. Thank you for doing that, it's helpful for the user community, and helpful for the project.

nickodell · 2024-05-06T04:55:04Z

This looks good to me. Thank you very much for the PR, is it your first? Congratulations.

It's my first in SciPy, but I've made PRs for other open source projects before.

I lurk on stackoverflow and notice that you answer a lot of questions. Thank you for doing that, it's helpful for the user community, and helpful for the project.

Glad to help! :)

ENH: Reduce memory usage during approx_derivative

adf1f0e

Currently, the code generates all possible h vectors at once. When optimizing N dimensions, this uses a quadratic amount of memory, which is wasteful. Reduce this by computing those vectors one at a time.

nickodell requested a review from andyfaff as a code owner May 5, 2024 04:05

github-actions bot added the scipy.optimize label May 5, 2024

andyfaff reviewed May 5, 2024

View reviewed changes

lucascolley added the enhancement A new feature or improvement label May 5, 2024

lucascolley changed the title ~~Avoid quadratic memory usage during approx_derivative~~ ENH: optimize.approx_derivative: avoid quadratic memory usage May 5, 2024

lucascolley changed the title ~~ENH: optimize.approx_derivative: avoid quadratic memory usage~~ ENH: optimize.approx_fprime: avoid quadratic memory usage May 5, 2024

ENH: Move array allocation out of loop

eb34ef3

andyfaff merged commit ffa78f3 into scipy:main May 6, 2024
31 checks passed

j-bowhay added this to the 1.14.0 milestone May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: optimize.approx_fprime: avoid quadratic memory usage #20645

ENH: optimize.approx_fprime: avoid quadratic memory usage #20645

nickodell commented May 5, 2024 •

edited

andyfaff left a comment

nickodell commented May 6, 2024

andyfaff commented May 6, 2024 •

edited

andyfaff commented May 6, 2024

nickodell commented May 6, 2024

ENH: optimize.approx_fprime: avoid quadratic memory usage #20645

ENH: optimize.approx_fprime: avoid quadratic memory usage #20645

Conversation

nickodell commented May 5, 2024 • edited

Reference issue

What does this implement/fix?

Memory usage benchmarks

CPU Benchmarks

andyfaff left a comment

Choose a reason for hiding this comment

nickodell commented May 6, 2024

Plots

Conclusion

Raw results

andyfaff commented May 6, 2024 • edited

andyfaff commented May 6, 2024

nickodell commented May 6, 2024

nickodell commented May 5, 2024 •

edited

andyfaff commented May 6, 2024 •

edited