Joonas' Note

Joonas' Note

nvidia-smi 명령어 정리 본문

개발

nvidia-smi 명령어 정리

2025. 11. 3. 14:05 joonas

    GPU 전체 상태 보기

    nvidia-smi

     

    특정 GPU 상태 보기

    숫자는 gpu id (UUID 또는 PIC bus ID) 를 입력하면 된다.

     nvidia-smi -i 0

    여러 개를 한 번에 출력할 수 도 있다.

    $ nvidia-smi -i 0,3
    Mon Nov  3 14:02:05 2025
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.141.10   Driver Version: 470.141.10   CUDA Version: 11.4     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-DGXS...  On   | 00000000:07:00.0  On |                    0 |
    | N/A   48C    P0    41W / 300W |    218MiB / 16155MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   3  Tesla V100-DGXS...  On   | 00000000:0F:00.0 Off |                    0 |
    | N/A   64C    P0    43W / 300W |     10MiB / 16158MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |    0   N/A  N/A      3174      G   /usr/lib/xorg/Xorg                102MiB |
    |    0   N/A  N/A      4054      G   /usr/lib/xorg/Xorg                 67MiB |
    |    0   N/A  N/A      4253      G   /usr/bin/gnome-shell               26MiB |
    |    3   N/A  N/A      3174      G   /usr/lib/xorg/Xorg                  4MiB |
    |    3   N/A  N/A      4054      G   /usr/lib/xorg/Xorg                  4MiB |
    +-----------------------------------------------------------------------------+

     

    GPU 전체 리스트로 출력

    nvidia-smi -L
    GPU 0: Tesla V100-DGXS-16GB (UUID: GPU-4a5696d4-12ff-d8a3-f604-d25020c46dc9)
    GPU 1: Tesla V100-DGXS-16GB (UUID: GPU-036e8961-7968-e8ed-05db-3ed1117387ab)
    Unable to determine the device handle for gpu 0000:0E:00.0: Unknown Error
    GPU 3: Tesla V100-DGXS-16GB (UUID: GPU-4ce0c45f-8be0-d20f-db73-e6e0e254b51f)

    과부하 걸렸거나 온도가 높아서 죽어버린 GPU 도 위처럼 볼 수 있음

     

    GPU 최대 전력 제한

    $ nvidia-smi -pm 1
    
    $ nvidia-smi -pl 220
    Power limit for GPU 00000000:07:00.0 was set to 220.00 W from 300.00 W.
    Power limit for GPU 00000000:08:00.0 was set to 220.00 W from 300.00 W.
    Power limit for GPU 00000000:0E:00.0 was set to 220.00 W from 300.00 W.
    Power limit for GPU 00000000:0F:00.0 was set to 220.00 W from 300.00 W.
    All done.

     

    옵션 전체 보기 (공식 문서)

    https://docs.nvidia.com/deploy/nvidia-smi/index.html

     

    https://docs.nvidia.com/deploy/nvidia-smi/index.html

    Operating state of the PSU. The power supply state can be any of the following: "Normal", "Abnormal", "High voltage", "Fan failure", "Heatsink temperature", "Current limit", "Voltage below UV alarm threshold", "Low-voltage", "I2C remote off command", "MOD_

    docs.nvidia.com

     

    Comments