-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continuous memory leak #4843
Comments
Are you using Canvas workflows? Maybe #4839 is related. Also I assume you are using prefork pool for worker concurrency? |
Thanks georgepsarakis. I am not using workflow. |
The increase rate seems quite linear, quite weird. Is the worker processing tasks during this time period? Also, can you add a note with the complete command you are using to start the worker? |
Yes. The worker continues to process the task normally. The worker is started with the following command.
|
This problem is occurring in both the production environment and the test environment. |
We need to understand what the worker is running during the time that the memory increase is observed. Any information and details you can possibly provide would definitely. It is also good that you can reproduce this. |
Although it was a case occurred at a timing different from the graph, the next log was output at the timing when the memory leak started.
It seems that it occurred when the connection with RabbitMQ was temporarily cut off. |
@marvelph so it occurs during RabbitMQ reconnections? Perhaps these issues are related: |
Yes. |
It looks like I'm having the same issue... It is so hard for me to find out what triggers it and why there is a memeory leak. It annoys me for at least a month. I fallback to used celery 3 and everything is fine. For the memory leak issue, I'm using ubuntu 16, celery 4.1.0 with rabbitmq. I deployed it via docker. The memory leak is with MainProcess not ForkPoolWorker. The memory usage of ForkPoolWorker is normal, but memory usage of MainProcess is always increasing. For five seconds, around 0.1MB memeory is leaked. The memory leak doesn't start after the work starts immediatly but maybe after one or two days. I used gdb and pyrasite to inject the running process and try to I checked the log, the Any hints for debugging this issue and to find out what really happens? Thanks. |
Since @marvelph mentioned it may relate with rabbitmq reconnection, I try to stop my rabbitmq server. The memory usage did increase after each reconnection, following is the log. So I can confirm this celery/kombu#843 issue. But after the connection is reconnected, the memory usage stops to gradually increase. So I'm not sure this is the reason for memory leak. I will try to use redis to figure out whether this memory leak issue relates wtih rabbitmq or not.
|
Although I checked the logs, I found a log of reconnection at the timing of memory leak, but there was also a case where a memory leak started at the timing when reconnection did not occur. Also, when I was using Celery 3.x, I did not encounter such a problem. |
@marvelph @dmitry-kostin could you please provide your exact configuration (omitting sensitive information of course) and possibly a task, or sample, that reproduces the issue? Also, do you have any estimate of the average uptime interval that the worker memory increase starts appearing? |
the config is nearby to default imports = ('app.tasks',) the task is pretty simple and there is no superlogic there, i think i can reproduce the whole situation on a clear temp project but have no free time for now, if i will be lucky i will try to do a full example on weekend UPD |
@marvelph @dmitry-kostin @jxltom I noticed you use Python3. Would you mind enabling tracemalloc for the process? You may need to patch the worker process though to log memory allocation traces, let me know if you need help with that. |
@georgepsarakis You mean enable tracemalloc in worker and log stats, such as the top 10 memory usage files, at a specific interval such as 5 minutes? |
@jxltom I think something like that would help locate the part of code that is responsible. What do you think? |
@georgepsarakis I'v tried to use gdb and https://github.com/lmacken/pyrasite to inject the memory leak process, and start debug via tracemalloc. Here is the top 10 file with highest mem usage. I use
Here is the difference between two snapshots after around 5 minutes.
|
Any suggestions for how to continue to debug this? I have no clue for how to proceed. Thanks. |
I want a little time to cut out the project for reproduction. It is setting of Celery.
The scheduler has the following settings.
On EC 2, I am using supervisord to operate it. |
@georgepsarakis |
@jxltom I bet tracemalloc with 5 minutes wont help to locate problem |
I tried to find out whether similar problems occurred in other running systems. |
@dmitry-kostin What's the difference with the other two normal nodes, are they both using same rabbitmq as broker? Since our discussion mentioned it may related to rabbitmq, I started another new node with same configuration except for using redis instead. So far, this node has no memory leak after running 24 hours. I will post it here if it has memory leak later |
@marvelph So do you mean that the three system with memory leak are using python3 while the one which is fine is using python2? |
@jxltom no difference at all, and yes they are on python 3 & rabit as broker and redis on backend |
@jxltom |
The pull request I made today with the fix for the Redis broker leaking memory (when connections to the broker fail) was just merged. I'm not aware of any other ways to reproduce memory leaks for #4843 at the moment. Here's a summary of the fixes so far: These fixes should completely prevent leaks due to disconnected connections to the broker:
And, if there are still some scenarios where that doesn't work... There's also these fixes that make
Thank you @auvipy for all the feedback and help with getting this stuff reviewed and merged. |
@pawl thanks to you and your team mates for the great collaboration & contributions. will push point releases with other merged changes next Sunday if not swallowed by family/holiday vibes. but next week for sure |
@auvipy Just to double-check, version 5.2.3 of celery that you pushed recently has the memory leak fixes, right? |
@caleb15 Celery 5.2.3 does have a minor leak fix I didn't mention in my comment above: #7187 But, I'm not sure that one is the main one that is generating the complaints in this thread. I think the main leak fixes are going to come from upgrading kombu to 5.2.3 (if you're using the redis broker) and py-amqp to 5.0.9 (if you're using py-amqp for connecting to rabbitmq). For more details, see: #4843 (comment) You may also want to check out this new section of the docs about handling memory leaks: https://docs.celeryproject.org/en/stable/userguide/optimizing.html#memory-usage |
@auvipy Were you able to confirm that the issue was solved? If you don't know, I'll spend time checking. Please let me know. 🙏 |
it was partially fixed. but another attempt to fix or figure out the remaining leaks would be very helpful. I sorry for late reply, I took a a week break |
I've created this repository: https://github.com/Kludex/celery-leak On my observations, the memory grows until a certain point, and then it remains constant. It took around 2k tasks to get to the point of being constant. Can someone point me out, how to reproduce it or what I should try to reproduce it? |
Seeing this on Celery-4.3.1, Kombu-4.6.11, Redis-4.1.2 Below is average memory chart. The available memory increases when service is restarted during deployment twice a day(mon-fri) During weekends, available memory keeps on decreasing until service is restarted @auvipy Any suggestion/fix for this? Does upgrading resolve this issue? |
first of all, we really can't tell much anything about an unsupported version, which was released almost 5 years ago. using latest version usually provide more stability in general, and if any issues were raised, generally easier to reproduce/fix. |
In our case, what we thought was a memory leak, actually turned out to be More info: |
I'm experiencing memory leak in forked worker. Essentially not all memory freed after consequent task execution. |
General tips and guidance on how to approach fixing memory leaks in Python, which can be applied to the Celery project.
Another example:
This example assumes that you have resources that need to be processed. Instead of passing the actual resource object to the Celery task, you maintain a weak reference dictionary, and only pass the id. This way, once the resource is no longer needed, it can be garbage collected, preventing a memory leak.
|
That...kind of reads like a ChatGPT answer. |
I have a memory leak in a process of sending emails, there are 50 celery tasks executed every certain distance(eta) in parallel, that is, it is not necessary to finish a sending task to start, I do it with group() of celery. Where what he mainly did in this process is (open and close the connection many times with the mail server to send mails) and generate records in the database (there are around 1000 records in 45 minutes) and there comes a time where my memory collapses to the maximum available, what I suppose is that there is a memory leak and it is never recovered, so no matter how long the function ends until the worker is restarted, that memory will not be recovered, what can you recommend I do to avoid this leak? django 3.2.18 |
@FabriQuinteros if you use eta tasks, you might find this comment useful: #4843 (comment) |
@norbertcyran I checked it but my problem is short term not long term. I have many other tasks scheduled, besides these. The problem is when they start to run. Not at the moment where I long them to the task queue |
Hi guys, I advise you to use jmalloc. It has helped us to considerably reduce memory consumption. Here's my Dockerfile configuration
|
For anyone running into this on Django, this helped my memory leak. Most answers online mention setting CELERYD_MAX_TASKS_PER_CHILD - this is the right idea but the lingo needs to be updated for new django/celery projects. Celery has switched the naming of certain configuration options. Celery has a command to make this conversion easy: This then will change CELERYD_MAX_TASKS_PER_CHILD to CELERY_WORKER_MAX_TASKS_PER_CHILD To troubleshoot whether this is working or not, run flower and on the Flower -> Pool tab you should see If this approach doesn't work, you can add it to the worker invocation as |
There is a memory leak in the parent process of Celery's worker.
It is not a child process executing a task.
It happens suddenly every few days.
Unless you stop Celery, it consumes server memory in tens of hours.
This problem happens at least in Celery 4.1, and it also occurs in Celery 4.2.
Celery is running on Ubuntu 16 and brokers use RabbitMQ.
The text was updated successfully, but these errors were encountered: