You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I currently use rq to schedule neural network training jobs using mmdetection (a framework based on pytorch) in a docker environment. However, the training job sometimes gets killed unexpectedly:
2024-05-19 21:41:11 05/19 13:41:11 - mmengine - INFO - load backbone. in model from: /segtracker/resources/models/pretrained/cspnext-tiny_imagenet_600e.pth
2024-05-19 21:41:11 Loads checkpoint by local backend from path: /segtracker/resources/models/pretrained/cspnext-tiny_imagenet_600e.pth
2024-05-19 21:41:11 05/19 13:41:11 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
2024-05-19 21:41:11 05/19 13:41:11 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
2024-05-19 21:41:11 05/19 13:41:11 - mmengine - INFO - Checkpoints will be saved to /backend/trained_models/152-1-13-mouse-det.
2024-05-19 21:41:13 13:41:13 Killed horse pid 2424
2024-05-19 21:41:13 13:41:13 Job stopped by user, moving job to FailedJobRegistry
At first, I thought this might be a memory problem, but increasing the docker container's memory limit does not help resolve this problem. I also noticed that rq would kill the job when pytorch tries to download a pretrained model.
2024-05-19 21:48:49 creating index...
2024-05-19 21:48:49 index created!
2024-05-19 21:48:49 05/19 13:48:49 - mmengine - INFO - load model from: torchvision://resnet50
2024-05-19 21:48:49 05/19 13:48:49 - mmengine - INFO - Loads checkpoint by torchvision backend from path: torchvision://resnet50
2024-05-19 21:48:49 Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
2024-05-19 21:48:51 13:48:51 Killed horse pid 2475
2024-05-19 21:48:51 13:48:51 Job stopped by user, moving job to FailedJobRegistry
However, the mmpretrain (a classification package similar to mmdetection) package works smoothly within a rq job. It also works fine with subprocess.run function within the rq job.
How can I find what caused the problem?
Any information would be helpful!
The text was updated successfully, but these errors were encountered:
I currently use rq to schedule neural network training jobs using mmdetection (a framework based on pytorch) in a docker environment. However, the training job sometimes gets killed unexpectedly:
At first, I thought this might be a memory problem, but increasing the docker container's memory limit does not help resolve this problem. I also noticed that rq would kill the job when pytorch tries to download a pretrained model.
However, the mmpretrain (a classification package similar to mmdetection) package works smoothly within a rq job. It also works fine with
subprocess.run
function within the rq job.How can I find what caused the problem?
Any information would be helpful!
The text was updated successfully, but these errors were encountered: