Memory not being returned to OS on calling C.LGBM_BoosterFree #6421

alzio2607 · 2024-04-17T10:33:45Z

How we are using LightGBM:

We are using the LightGBM c_api in our model hosting service, written in Golang. We've written a CGO wrapper around the c_api. We are using the “lib_lightgbm.so” library file provided with on Github.

Version Used :
3.1.1

Environment info:

Operating System: Observed on both Linux and MacOS

Architecture: x86_64

CPU model: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz

C++ compiler version: gcc version 7.2.0 (Debian 7.2.0-1)

CMake version: 3.9.0

GLIBC version: ldd (Debian GLIBC 2.28-10) 2.28

Context:

We load a few LightGBM models in our model hosting service and refresh the models as soon as the new ones are available. The models are loaded via the method LGBM_BoosterLoadModelFromString provided by the api. We are releasing the older models with the method LGBM_BoosterFree.
We are hosting this service on GKE pods which have a fixed amount of memory.

Issue:

We're seeing a gradual uptick in the RSS (Resident Memory Set) of the service as soon as the model is refreshed. We are measuring the RSS metric via prometheus, which exposes process_resident_memory_bytes. This indicates
that the entire memory is not freed up when LGBM_BoosterFree is called. This is causing our service pods to go down of OOM and the lifetime of the pods and in turn service health has taken a hit.

To remove the suspicions of the "Golang part of code" contributing to the RSS, we looked at the heap and the heap comes back to the pre-load value for a model.

To further solidify that Go is not the issue, we ran an experiment where we continuously free up and reload the same model again and again. We do not load anything else in the service except the model file (no metadatas or any such sort of thing).
We observed a staircase for the RSS metric:

For this experiment, the heap looked like this:

This ascertains the fact that something is going on with the C API while freeing up the space it takes for the model. I started digging into the code but the memory seems to be managed appropriately using unique pointers and destructors.

Then, I stripped the situation to it's bare minimum form where I load a model string (which is on disk), record the rss, release it, record the rss. I do this multiple times, while forcing the GC to run after every action and waiting 20 seconds for rss to settle so i get the exact values. And the results are a bit weird. RSS seems to be coming down to the older values in some iterations while it does not do so in others. The result is a gradual increase.

Initial: 1524
C_API Model Loaded: 4576
Model Released: 1988

Initial: 1988
C_API Model Loaded: 5659
Model Released: 4012

Initial: 4012
C_API Model Loaded: 7263
Model Released: 5509

Initial: 5509
C_API Model Loaded: 9209
Model Released: 7561

Initial: 7561
C_API Model Loaded: 9281
Model Released: 7603

Initial: 7603
C_API Model Loaded: 9336
Model Released: 7658

Initial: 7658
C_API Model Loaded: 11329
Model Released: 9681

Initial: 9681
C_API Model Loaded: 11372
Model Released: 9724

Initial: 9724
C_API Model Loaded: 13395
Model Released: 11748

Initial: 11748
C_API Model Loaded: 13476
Model Released: 11828

I am at a loss on how to figure this. Is it something about Go that I am missing?

Minimal Reproducible Example:

Attaching a minimal reproducible example. Since this example cannot be a simple snippet of code, have linked it to my git repo.

The text was updated successfully, but these errors were encountered:

jameslamb · 2024-04-17T14:08:11Z

Thanks for using LightGBM.

Is it possible to upgrade to a newer version of LightGBM? v3.1.1 is 3.5 years old at this point (link), so you're missing out on 3.5 years of fixes and changes, including some related to memory efficiency and fixed memory leaks.

alzio2607 · 2024-04-17T14:18:12Z

Thank you for the prompt reply @jameslamb. I would really want to upgrade the lightgbm version, but working in a production environment, the switch is not easy to do for us, since our training backend also uses the same lightgbm version and we'll need to do a whole lot of testing to make the switch. Moreover, the same version works fine in a Java environment, so I would really want to figure out why Golang is causing an issue.

However, I'll try using the latest version and report back the results.

jameslamb · 2024-04-17T14:26:33Z

Thanks! Excellent report by the way, this was really interesting to read.

I have one other theory. I know very little about Go, but looking at the code you provided, it seems like you are loading the model into a string and then passing that string into LightGBM.... but never freeing the string.

modelString = getModelAsString()

(code link)

Your ReleaseMemory() method leaves that modelString unmodified.

func ReleaseMemory() {
	res := int(C.LGBM_BoosterFree(predictor))

	if res == -1 {
		panic("LightGBM C_API : Failed to release the memory of the LightGBM model")
	}
}

Could you try freeing the string?

Alternatively, if the model file is available in the container's local filesystem, you can skip the step of reading it up into a string and just directly load it from a file, with LGBM_BoosterCreateFromModelFile(). That was available in v3.1.1.

https://github.com/microsoft/LightGBM/blob/v3.1.1/include/LightGBM/c_api.h#L421-L423

alzio2607 · 2024-04-17T14:32:19Z

@jameslamb Yes I am loading the model string as a global variable, but I am doing that once in intialization. And my base(Initial) rss is compared after that point. So, in the output Initial: 1524 refers to the RSS value after loading the model string once. I knew about the LGBM_BoosterCreateFromModelFile() method, but loaded from string to emulate exactly what we are doing in production.

Also, I just tried with the latest version and I am seeing the same. So, it could possibly be how things are handled at Golang side. Is there any documentation that allows one to learn how to properly write a C Wrapper for the LightGBM C API?

jameslamb · 2024-04-17T14:41:55Z

Is there any documentation that allows one to learn how to properly write a C Wrapper for the LightGBM C API?

There is not really, sorry. That feature request is tracked in #6261.

There are some suggestions in this comment: #6261 (comment).

Also, I just tried with the latest version and I am seeing the same. So, it could possibly be how things are handled at Golang side.

That's helpful, thank you!

Every release of LightGBM is tested with valgrind and leak sanitizer, and I'm fairly sure that those tests cover code paths that use LGBM_BoosterLoadModelFromString(). I'd expect those tests to catch any significant memory leaks.

So yes, based on what you've shared so far, I strongly suspect this is about some interaction between Go's memory management and how your Go code is calling into LightGBM, not a leak from LightGBM itself. But I don't know that for sure.

We can leave this open for a bit, maybe someone else will be able to provide other theories.

alzio2607 · 2024-04-17T15:51:07Z

Okay sure. Will be really helpful if someone can shed light on go implementations

jameslamb · 2024-04-17T15:54:16Z

We don't have any Go code in this repo or anyone doing active Go development (as far as I know).

Maybe https://github.com/dmitryikh/leaves can help.

If you do find a solution, please do come back and post it here. And thanks again for the EXCELLENT write-up and reproducible example!

alzio2607 · 2024-04-17T16:18:23Z

Welcome! From the write-up of leaves, looks like it focuses on the prediction part. I'll look into it though. Thanks!

jameslamb · 2024-04-18T21:44:36Z

I see you've posted a related question over on Stack Overflow: https://stackoverflow.com/questions/78347190/is-this-the-correct-way-to-use-cgo-to-call-c-code-via-go

Totally fine, since it's not exactly the same as the one you posted here.

I'm linking it here so anyone finding this GitHub issue from search will also find that related Stack Overflow question. I recommend you link back to this issue from your Stack Overflow post, for the same reason.

jameslamb · 2024-04-18T22:22:48Z

Oh actually it looks like you posted two separate Stack Overflow posts about this same topic?

https://stackoverflow.com/questions/78347190/is-this-the-correct-way-to-use-cgo-to-call-c-code-via-go

https://stackoverflow.com/questions/78343407/how-to-use-cgo-properly-to-avoid-memory-leaks-working-with-lightgbm-c-api-from

Please don't do this. Spreading the same conversation around so many places, especially without linking between them, means that different people might spend time answering your question in different places, not knowing that others are talking about it somewhere else. That makes it more difficult for others facing the same problem to benefit from the conversation when searching from search engines, and risks wasting the time of some of the people involved.

Since it seems like https://stackoverflow.com/questions/78343407/how-to-use-cgo-properly-to-avoid-memory-leaks-working-with-lightgbm-c-api-from is getting answers, I think you should close https://stackoverflow.com/questions/78347190/is-this-the-correct-way-to-use-cgo-to-call-c-code-via-go and link to that.

alzio2607 · 2024-04-19T06:07:17Z

Hey. Yes I raised the issue on Stackoverflow as well. Since there are very less number of people using Lightgbm who are working on Go, I wanted to get a broader audience to have a better chance at resolving. Although I did go overboard with creating 2 posts. Removed the 2nd one.

alzio2607 · 2024-04-22T16:20:58Z

@jameslamb we deep dived into this and have come to a conclusion that this issue is a result of interaction between the Golang Garbage Collector and the threads created by the C_API.

Here are our findings:

-- First, we wrote a pure cpp script and loaded and released the model via c_api repeatedly. We saw that the RSS was constant throughout the run. We noticed that the process creates as many threads as available on machine.

-- Second, since LightGBM c_api uses Open MP interface to manage threads we fixed the number of threads to 8 by setting the env var OMP_NUM_THREADS. We noticed that the RSS was still stable, and the process created the specified number of threads and did not spawn more threads.

We believe that the c_api keeps a certain amount of buffer with the threads in order to reuse it when required, which is the correct thing to do as a part of optimization.

These 2 experiments ruled out the doubt that the c_api is mismanaging the memory.

-- Third, we loaded the same model via a Golang Script. We forced triggered GC after every cycle of load and release. We also fixed the number of threads for the process to 8 using OMP_NUM_THREADS = 8. This is where things got interesting.
When the GC ran, it transiently zombied out the live threads and c_api had to request a new set of threads from the OS in order to load the model. This led to an increase in the total number of threads of the process. Also worth noting is that whenever the number of threads remain unchanged after a cycle, the RSS is stable as well.

__
The zombied threads still held on to their buffer memory, and hence there was a gradual increase in RSS. From the following image, it's clear that only 8 threads are in use, rest are just holding the memory.

__

-- Fourth, we stopped the force trigger of GC and switched off the automatic GC by setting debug.setgcpercent(-1) in the Golang code. The result was a stable RSS with the Golang code!

Conclusion : The Garbage Collection of Golang is somehow interfering with the thread management of c_api and the c code is losing control over it's threads that have a buffer memory and when zombied, this memory is lost and contributes to RSS.

Let me know, if this issue should still be a part of this repo. I am anyway heading over to the Golang repo to highlight this and find a way to resolve it.

jameslamb · 2024-04-22T16:34:47Z

Wow, interesting!!!

Thank you for this excellent write-up. It'll take me some time to think through it carefully, but I do have one piece of advice to share that might be useful.

You can build LightGBM with OpenMP support completely turned off like this:

cmake -B build -S . -DUSE_OPENMP=OFF
cmake --build build --target _lightgbm

When using LightGBM to generate predictions, OpenMP is helpful for parallelizing over rows in the input data. It's also helpful in parallelizing the work of initially loading the model from a text file.

But I don't believe OpenMP actually speeds up anything in LightGBM for single-row predictions. So if your service is only generating predictions on a single row at a time, and you're finding that some interaction with LightGBM's use of OpenMP is leading to a memory leak, I recommend using a version of LightGBM built without OpenMP support.

That'll prevent that initial pool of threads being allocated, and it'll make the behavior of the program invariant to settings like the OMP_NUM_THREADS environment variable.

alzio2607 · 2024-04-26T07:00:58Z

@jameslamb hi. we did try the threadless version and we saw an improvement initially, but with time the memory seems to swell up like the threaded version only.

Although, we seemed to have solved the issue. We schedule our model refreshes in a goroutine, so all the calls to the C API are made inside this goroutine. We think that the C API is allocating some buffers in the underlying native OS thread of this goroutine and during GC and the sleeps we are adding between model refreshes, this native thread is yielded and so are the buffers (hence zombieng the memory).

We used a functionality provided by Golang in its runtime package, runtime.LockOSThread(), to permanently map the Model Loader Goroutine to a particular native thread.

We are seeing improvements and a much flatter RSS graph now

Would just like to request you to confirm if we're correct in assuming that the library does some allocations in the thread locally.

jameslamb · 2024-05-12T04:51:11Z

Ah interesting, thanks for sharing that! I personally am not familiar with Go or how its garbage collection works so I can't offer any feedback on that approach. Glad to hear you've found changes to stabilize your applications though!

Would just like to request you to confirm if we're correct in assuming that the library does some allocations in the thread locally.

I'm not certain, but I think the answer is "yes". I believe the std::vectors in this block (which runs during some prediction codepaths) will be thread-local allocations:

LightGBM/src/boosting/gbdt.cpp

Lines 674 to 686 in a70e832

    
           if (objective_function_ != nullptr) { 
        
             #pragma omp parallel for num_threads(OMP_NUM_THREADS()) schedule(static) 
        
             for (data_size_t i = 0; i < num_data; ++i) { 
        
               std::vector<double> tree_pred(num_tree_per_iteration_); 
        
               for (int j = 0; j < num_tree_per_iteration_; ++j) { 
        
                 tree_pred[j] = raw_scores[j * num_data + i]; 
        
               } 
        
               std::vector<double> tmp_result(num_class_); 
        
               objective_function_->ConvertOutput(tree_pred.data(), tmp_result.data()); 
        
               for (int j = 0; j < num_class_; ++j) { 
        
                 out_result[j * num_data + i] = static_cast<double>(tmp_result[j]); 
        
               } 
        
             }

Those are never explicitly freed in LightGBM's code... LightGBM is just declaring them on the stack and trusting they'll be cleaned up automatically.

jameslamb added the question label Apr 17, 2024

jameslamb added the awaiting response label May 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory not being returned to OS on calling C.LGBM_BoosterFree #6421

Memory not being returned to OS on calling C.LGBM_BoosterFree #6421

alzio2607 commented Apr 17, 2024 •

edited

jameslamb commented Apr 17, 2024

alzio2607 commented Apr 17, 2024

jameslamb commented Apr 17, 2024

alzio2607 commented Apr 17, 2024 •

edited

jameslamb commented Apr 17, 2024

alzio2607 commented Apr 17, 2024 •

edited

jameslamb commented Apr 17, 2024

alzio2607 commented Apr 17, 2024

jameslamb commented Apr 18, 2024

jameslamb commented Apr 18, 2024

alzio2607 commented Apr 19, 2024

alzio2607 commented Apr 22, 2024 •

edited

jameslamb commented Apr 22, 2024

alzio2607 commented Apr 26, 2024 •

edited

jameslamb commented May 12, 2024

Memory not being returned to OS on calling C.LGBM_BoosterFree #6421

Memory not being returned to OS on calling C.LGBM_BoosterFree #6421

Comments

alzio2607 commented Apr 17, 2024 • edited

jameslamb commented Apr 17, 2024

alzio2607 commented Apr 17, 2024

jameslamb commented Apr 17, 2024

alzio2607 commented Apr 17, 2024 • edited

jameslamb commented Apr 17, 2024

alzio2607 commented Apr 17, 2024 • edited

jameslamb commented Apr 17, 2024

alzio2607 commented Apr 17, 2024

jameslamb commented Apr 18, 2024

jameslamb commented Apr 18, 2024

alzio2607 commented Apr 19, 2024

alzio2607 commented Apr 22, 2024 • edited

jameslamb commented Apr 22, 2024

alzio2607 commented Apr 26, 2024 • edited

jameslamb commented May 12, 2024

alzio2607 commented Apr 17, 2024 •

edited

alzio2607 commented Apr 17, 2024 •

edited

alzio2607 commented Apr 17, 2024 •

edited

alzio2607 commented Apr 22, 2024 •

edited

alzio2607 commented Apr 26, 2024 •

edited