Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to check why a job is killed? #2092

Closed
tctco opened this issue May 19, 2024 · 1 comment
Closed

How to check why a job is killed? #2092

tctco opened this issue May 19, 2024 · 1 comment

Comments

@tctco
Copy link

tctco commented May 19, 2024

I currently use rq to schedule neural network training jobs using mmdetection (a framework based on pytorch) in a docker environment. However, the training job sometimes gets killed unexpectedly:

2024-05-19 21:41:11 05/19 13:41:11 - mmengine - INFO - load backbone. in model from: /segtracker/resources/models/pretrained/cspnext-tiny_imagenet_600e.pth
2024-05-19 21:41:11 Loads checkpoint by local backend from path: /segtracker/resources/models/pretrained/cspnext-tiny_imagenet_600e.pth
2024-05-19 21:41:11 05/19 13:41:11 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
2024-05-19 21:41:11 05/19 13:41:11 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
2024-05-19 21:41:11 05/19 13:41:11 - mmengine - INFO - Checkpoints will be saved to /backend/trained_models/152-1-13-mouse-det.
2024-05-19 21:41:13 13:41:13 Killed horse pid 2424
2024-05-19 21:41:13 13:41:13 Job stopped by user, moving job to FailedJobRegistry

At first, I thought this might be a memory problem, but increasing the docker container's memory limit does not help resolve this problem. I also noticed that rq would kill the job when pytorch tries to download a pretrained model.

2024-05-19 21:48:49 creating index...
2024-05-19 21:48:49 index created!
2024-05-19 21:48:49 05/19 13:48:49 - mmengine - INFO - load model from: torchvision://resnet50
2024-05-19 21:48:49 05/19 13:48:49 - mmengine - INFO - Loads checkpoint by torchvision backend from path: torchvision://resnet50
2024-05-19 21:48:49 Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
2024-05-19 21:48:51 13:48:51 Killed horse pid 2475
2024-05-19 21:48:51 13:48:51 Job stopped by user, moving job to FailedJobRegistry

However, the mmpretrain (a classification package similar to mmdetection) package works smoothly within a rq job. It also works fine with subprocess.run function within the rq job.

How can I find what caused the problem?

Any information would be helpful!

@selwin
Copy link
Collaborator

selwin commented May 27, 2024

Your log shows that Job stopped by user, moving job to FailedJobRegistry, meaning the job was stopped because it received a stop job command.

@selwin selwin closed this as completed May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants