Metrics collector crashes when NVIDIA MIGs are present #3090

UrkoAT · 2024-04-16T06:48:15Z

🐛 Describe the bug

I was configuring the pytorch/torchserve:0.10.0-gpu image of docker to deploy a model to production and Ive encountered the following issue.
The thing is that the package nvgpu used by the metrics collector, fails to work with the NVIDIA MIG technology, and it crashes the thread.

After a bit of investigation, the culprit is the nvgpu.gpu_info() function, that tries to parse the nvidia-smi output. In a normal GPU, it works fine since it tries to grab the Memory-Usage field (5ish line, column 2):

urko@port-urkoa:~$ nvidia-smi 
Tue Apr 16 08:35:29 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060        Off | 00000000:01:00.0  On |                  N/A |
| N/A   68C    P0              38W /  80W |   3091MiB /  6144MiB |     18%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3317      G   /usr/lib/xorg/Xorg                         1955MiB |
|    0   N/A  N/A      3632      G   /usr/bin/gnome-shell                        279MiB |
|    0   N/A  N/A      4915      G   ...seed-version=20240414-180149.278000      327MiB |
|    0   N/A  N/A      5778      G   ...erProcess --variations-seed-version      477MiB |
+---------------------------------------------------------------------------------------+

However, as the MIG technology changes the nvidia-smi command, it looks like this:

root@torchserve-depl-6479499d9f-8p8j7:/home/model-server# nvidia-smi 
Tue Apr 16 06:27:22 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:03:00.0 Off |                   On |
| N/A   74C    P0    63W / 250W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    4   0   0  |   1141MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      2MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

So when nvgpu tries to parse the Memory-Usage field, its gets N/A and tries to convert it to an integer, and thats the error I get.

Error logs

The error I get in the main logs:

2024-04-16T06:25:41,915 [ERROR] Thread-14 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
  File "/home/venv/lib/python3.9/site-packages/ts/metrics/metric_collector.py", line 27, in <module>
    system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
  File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all
    value(num_of_gpu)
  File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 71, in gpu_utilization
    info = nvgpu.gpu_info()
  File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in gpu_info
    mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
  File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in <listcomp>
    mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
ValueError: invalid literal for int() with base 10: 'N'

Installation instructions

I used a dockerfile with torchserve as base:

FROM pytorch/torchserve:0.10.0-gpu
ENV DEBIAN_FRONTEND=noninteractive
USER 0
RUN apt update && apt install -y python3-opencv python3-pip git build-essential
ENV PYTHONUNBUFFERED=1
RUN pip install opencv-python torchvision torch torchaudio timm numpy scikit-learn matplotlib seaborn pandas
RUN pip install 'git+https://github.com/facebookresearch/detectron2.git'

Model Packaing

Standard mar. Doesnt apply.

config.properties

default_workers_per_model=1

Versions

docker standard pytorch/torchserve:0.10.0-gpu

Repro instructions

The steps to reproduced it (its MANDATORY to have MIGs):

root@torchserve-depl-6479499d9f-8p8j7:/home/model-server# python3   
Python 3.9.18 (main, Aug 25 2023, 13:20:04) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nvgpu
>>> nvgpu.gpu_info()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in gpu_info
    mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
  File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in <listcomp>
    mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
ValueError: invalid literal for int() with base 10: 'N'
>>>

Possible Solution

I know its a bug of nvgpu, not from torchserve, but AFAIK, nvgpu is no longer being maitained so it might be a good chance to change the package or way to work. Just my suggestion. Thanks

The text was updated successfully, but these errors were encountered:

lxning · 2024-04-16T16:31:26Z

@UrkoAT Thank you for investigating the root cause. We do notice that there are some bugs in nvgpu which is also in maintenance mode. TS v0.10.0 provides a feature which is able to customize system metrics (see PR)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics collector crashes when NVIDIA MIGs are present #3090

Metrics collector crashes when NVIDIA MIGs are present #3090

UrkoAT commented Apr 16, 2024

lxning commented Apr 16, 2024

Metrics collector crashes when NVIDIA MIGs are present #3090

Metrics collector crashes when NVIDIA MIGs are present #3090

Comments

UrkoAT commented Apr 16, 2024

🐛 Describe the bug

Error logs

Installation instructions

Model Packaing

config.properties

Versions

Repro instructions

Possible Solution

lxning commented Apr 16, 2024