`convert_to_singleton` seems to hang for OPT-66B

Created by: EIFY

What is your question?

With the directory prepared

$ ls 66b/
dict.txt         reshard-model_part-0-shard0.pt  reshard-model_part-3-shard0.pt  reshard-model_part-6-shard0.pt
gpt2-merges.txt  reshard-model_part-1-shard0.pt  reshard-model_part-4-shard0.pt  reshard-model_part-7-shard0.pt
gpt2-vocab.json  reshard-model_part-2-shard0.pt  reshard-model_part-5-shard0.pt

I had to hack checkpoint_utils.py a bit, since this assumption isn't true for OPT-66B: https://github.com/facebookresearch/metaseq/blob/ac8659de23b680005a14490d72a874613ab59381/metaseq/checkpoint_utils.py#L390-L391

with the following instead

    # path to checkpoint...-shared.pt
    local_path = local_path.split('.')[0] + '-shard0.pt'
    paths_to_load = get_paths_to_load(local_path, suffix="shard")

Running the following

NCCL_SHM_DISABLE=1 NCCL_DEBUG=INFO python -m metaseq.scripts.convert_to_singleton 66b/

is taking a long time (22 hours and counting). Initially nvidia-smi looks like this: and then the process on GPU 5 terminated first, and it has been in the following state for hours:

$ nvidia-smi
Thu Oct 13 19:24:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:16.0 Off |                    0 |
| N/A   54C    P0    74W / 300W |  20049MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:17.0 Off |                    0 |
| N/A   53C    P0    72W / 300W |  20133MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:18.0 Off |                    0 |
| N/A   52C    P0    73W / 300W |  19845MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:19.0 Off |                    0 |
| N/A   50C    P0    70W / 300W |  19857MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:00:1A.0 Off |                    0 |
| N/A   54C    P0    76W / 300W |  20073MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   47C    P0    44W / 300W |   1413MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   50C    P0    72W / 300W |  19977MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   54C    P0    69W / 300W |  19905MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1335      C   python                          19788MiB |
|    1   N/A  N/A      1419      C   ...onda/envs/user/bin/python    19872MiB |
|    2   N/A  N/A      1420      C   ...onda/envs/user/bin/python    19584MiB |
|    3   N/A  N/A      1421      C   ...onda/envs/user/bin/python    19596MiB |
|    4   N/A  N/A      1422      C   ...onda/envs/user/bin/python    19812MiB |
|    6   N/A  N/A      1424      C   ...onda/envs/user/bin/python    19716MiB |
|    7   N/A  N/A      1425      C   ...onda/envs/user/bin/python    19644MiB |
+-----------------------------------------------------------------------------+

Is there something obviously wrong here, or something I should try instead? Just in case it's really taking a long time, it's still running. The last few logging lines at INFO level look like this:

(...)
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 14 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0]
i-0b2d24dbd20c27dd0:1423:3387 [5] NCCL INFO Channel 14 : 5[1b0] -> 3[190] via P2P/indirect/1[170]
i-0b2d24dbd20c27dd0:1419:3383 [1] NCCL INFO Channel 14 : 1[170] -> 7[1d0] via P2P/indirect/3[190]
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 07 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO Channel 07 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 15 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO Channel 15 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
i-0b2d24dbd20c27dd0:1419:3383 [1] NCCL INFO comm 0x7f5f78003090 rank 1 nranks 8 cudaDev 1 busId 170 - Init COMPLETE
i-0b2d24dbd20c27dd0:1420:3386 [2] NCCL INFO comm 0x7f7408003090 rank 2 nranks 8 cudaDev 2 busId 180 - Init COMPLETE
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO comm 0x7fdfc8003090 rank 4 nranks 8 cudaDev 4 busId 1a0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO comm 0x7f5b60003090 rank 0 nranks 8 cudaDev 0 busId 160 - Init COMPLETE
i-0b2d24dbd20c27dd0:1424:3384 [6] NCCL INFO comm 0x7fd82c003090 rank 6 nranks 8 cudaDev 6 busId 1c0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1423:3387 [5] NCCL INFO comm 0x7fd544003090 rank 5 nranks 8 cudaDev 5 busId 1b0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1421:3389 [3] NCCL INFO comm 0x7f9c64003090 rank 3 nranks 8 cudaDev 3 busId 190 - Init COMPLETE
i-0b2d24dbd20c27dd0:1425:3385 [7] NCCL INFO comm 0x7f3fe0003090 rank 7 nranks 8 cudaDev 7 busId 1d0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1335:1335 [0] NCCL INFO Launch mode Parallel

What's your environment?

metaseq Version: 7828d728 (Oct 5 main)
PyTorch Version: 1.12.1+cu113
OS: Ubuntu 18.04.6 LTS
How you installed metaseq: pip
Build command you used (if compiling from source): N.A.
Python version: 3.10
CUDA/cuDNN version: CUDA 11.8
GPU models and configuration: 8 x V100 SXM2 32 GB