You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was configuring the pytorch/torchserve:0.10.0-gpu image of docker to deploy a model to production and Ive encountered the following issue.
The thing is that the package nvgpu used by the metrics collector, fails to work with the NVIDIA MIG technology, and it crashes the thread.
After a bit of investigation, the culprit is the nvgpu.gpu_info() function, that tries to parse the nvidia-smi output. In a normal GPU, it works fine since it tries to grab the Memory-Usage field (5ish line, column 2):
urko@port-urkoa:~$ nvidia-smi
Tue Apr 16 08:35:29 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2060 Off | 00000000:01:00.0 On | N/A |
| N/A 68C P0 38W / 80W | 3091MiB / 6144MiB | 18% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3317 G /usr/lib/xorg/Xorg 1955MiB |
| 0 N/A N/A 3632 G /usr/bin/gnome-shell 279MiB |
| 0 N/A N/A 4915 G ...seed-version=20240414-180149.278000 327MiB |
| 0 N/A N/A 5778 G ...erProcess --variations-seed-version 477MiB |
+---------------------------------------------------------------------------------------+
However, as the MIG technology changes the nvidia-smi command, it looks like this:
root@torchserve-depl-6479499d9f-8p8j7:/home/model-server# nvidia-smi
Tue Apr 16 06:27:22 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:03:00.0 Off | On |
| N/A 74C P0 63W / 250W | N/A | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 4 0 0 | 1141MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 2MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
So when nvgpu tries to parse the Memory-Usage field, its gets N/A and tries to convert it to an integer, and thats the error I get.
Error logs
The error I get in the main logs:
2024-04-16T06:25:41,915 [ERROR] Thread-14 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
File "/home/venv/lib/python3.9/site-packages/ts/metrics/metric_collector.py", line 27, in <module>
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all
value(num_of_gpu)
File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 71, in gpu_utilization
info = nvgpu.gpu_info()
File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in gpu_info
mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in <listcomp>
mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
ValueError: invalid literal for int() with base 10: 'N'
Installation instructions
I used a dockerfile with torchserve as base:
FROM pytorch/torchserve:0.10.0-gpu
ENV DEBIAN_FRONTEND=noninteractive
USER 0
RUN apt update && apt install -y python3-opencv python3-pip git build-essential
ENV PYTHONUNBUFFERED=1
RUN pip install opencv-python torchvision torch torchaudio timm numpy scikit-learn matplotlib seaborn pandas
RUN pip install 'git+https://github.com/facebookresearch/detectron2.git'
Model Packaing
Standard mar. Doesnt apply.
config.properties
default_workers_per_model=1
Versions
docker standard pytorch/torchserve:0.10.0-gpu
Repro instructions
The steps to reproduced it (its MANDATORY to have MIGs):
root@torchserve-depl-6479499d9f-8p8j7:/home/model-server# python3
Python 3.9.18 (main, Aug 25 2023, 13:20:04)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nvgpu
>>> nvgpu.gpu_info()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in gpu_info
mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in <listcomp>
mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
ValueError: invalid literal for int() with base 10: 'N'
>>>
Possible Solution
I know its a bug of nvgpu, not from torchserve, but AFAIK, nvgpu is no longer being maitained so it might be a good chance to change the package or way to work. Just my suggestion. Thanks
The text was updated successfully, but these errors were encountered:
@UrkoAT Thank you for investigating the root cause. We do notice that there are some bugs in nvgpu which is also in maintenance mode. TS v0.10.0 provides a feature which is able to customize system metrics (see PR)
馃悰 Describe the bug
I was configuring the pytorch/torchserve:0.10.0-gpu image of docker to deploy a model to production and Ive encountered the following issue.
The thing is that the package nvgpu used by the metrics collector, fails to work with the NVIDIA MIG technology, and it crashes the thread.
After a bit of investigation, the culprit is the nvgpu.gpu_info() function, that tries to parse the nvidia-smi output. In a normal GPU, it works fine since it tries to grab the Memory-Usage field (5ish line, column 2):
However, as the MIG technology changes the
nvidia-smi
command, it looks like this:So when nvgpu tries to parse the Memory-Usage field, its gets N/A and tries to convert it to an integer, and thats the error I get.
Error logs
The error I get in the main logs:
Installation instructions
I used a dockerfile with torchserve as base:
Model Packaing
Standard mar. Doesnt apply.
config.properties
default_workers_per_model=1
Versions
docker standard pytorch/torchserve:0.10.0-gpu
Repro instructions
The steps to reproduced it (its MANDATORY to have MIGs):
Possible Solution
I know its a bug of nvgpu, not from torchserve, but AFAIK, nvgpu is no longer being maitained so it might be a good chance to change the package or way to work. Just my suggestion. Thanks
The text was updated successfully, but these errors were encountered: