`convert_to_singleton` seems to hang for OPT-66B
Created by: EIFY
What is your question?
With the directory prepared
$ ls 66b/
dict.txt reshard-model_part-0-shard0.pt reshard-model_part-3-shard0.pt reshard-model_part-6-shard0.pt
gpt2-merges.txt reshard-model_part-1-shard0.pt reshard-model_part-4-shard0.pt reshard-model_part-7-shard0.pt
gpt2-vocab.json reshard-model_part-2-shard0.pt reshard-model_part-5-shard0.pt
I had to hack checkpoint_utils.py a bit, since this assumption isn't true for OPT-66B:
https://github.com/facebookresearch/metaseq/blob/ac8659de23b680005a14490d72a874613ab59381/metaseq/checkpoint_utils.py#L390-L391
with the following instead
# path to checkpoint...-shared.pt
local_path = local_path.split('.')[0] + '-shard0.pt'
paths_to_load = get_paths_to_load(local_path, suffix="shard")
Running the following
NCCL_SHM_DISABLE=1 NCCL_DEBUG=INFO python -m metaseq.scripts.convert_to_singleton 66b/
is taking a long time (22 hours and counting). Initially nvidia-smi looks like this:
and then the process on GPU 5 terminated first, and it has been in the following state for hours:
$ nvidia-smi
Thu Oct 13 19:24:37 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:16.0 Off | 0 |
| N/A 54C P0 74W / 300W | 20049MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:17.0 Off | 0 |
| N/A 53C P0 72W / 300W | 20133MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:18.0 Off | 0 |
| N/A 52C P0 73W / 300W | 19845MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:19.0 Off | 0 |
| N/A 50C P0 70W / 300W | 19857MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... Off | 00000000:00:1A.0 Off | 0 |
| N/A 54C P0 76W / 300W | 20073MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... Off | 00000000:00:1B.0 Off | 0 |
| N/A 47C P0 44W / 300W | 1413MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... Off | 00000000:00:1C.0 Off | 0 |
| N/A 50C P0 72W / 300W | 19977MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... Off | 00000000:00:1D.0 Off | 0 |
| N/A 54C P0 69W / 300W | 19905MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1335 C python 19788MiB |
| 1 N/A N/A 1419 C ...onda/envs/user/bin/python 19872MiB |
| 2 N/A N/A 1420 C ...onda/envs/user/bin/python 19584MiB |
| 3 N/A N/A 1421 C ...onda/envs/user/bin/python 19596MiB |
| 4 N/A N/A 1422 C ...onda/envs/user/bin/python 19812MiB |
| 6 N/A N/A 1424 C ...onda/envs/user/bin/python 19716MiB |
| 7 N/A N/A 1425 C ...onda/envs/user/bin/python 19644MiB |
+-----------------------------------------------------------------------------+
Is there something obviously wrong here, or something I should try instead? Just in case it's really taking a long time, it's still running. The last few logging lines at INFO level look like this:
(...)
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 14 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0]
i-0b2d24dbd20c27dd0:1423:3387 [5] NCCL INFO Channel 14 : 5[1b0] -> 3[190] via P2P/indirect/1[170]
i-0b2d24dbd20c27dd0:1419:3383 [1] NCCL INFO Channel 14 : 1[170] -> 7[1d0] via P2P/indirect/3[190]
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 07 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO Channel 07 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 15 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO Channel 15 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
i-0b2d24dbd20c27dd0:1419:3383 [1] NCCL INFO comm 0x7f5f78003090 rank 1 nranks 8 cudaDev 1 busId 170 - Init COMPLETE
i-0b2d24dbd20c27dd0:1420:3386 [2] NCCL INFO comm 0x7f7408003090 rank 2 nranks 8 cudaDev 2 busId 180 - Init COMPLETE
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO comm 0x7fdfc8003090 rank 4 nranks 8 cudaDev 4 busId 1a0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO comm 0x7f5b60003090 rank 0 nranks 8 cudaDev 0 busId 160 - Init COMPLETE
i-0b2d24dbd20c27dd0:1424:3384 [6] NCCL INFO comm 0x7fd82c003090 rank 6 nranks 8 cudaDev 6 busId 1c0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1423:3387 [5] NCCL INFO comm 0x7fd544003090 rank 5 nranks 8 cudaDev 5 busId 1b0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1421:3389 [3] NCCL INFO comm 0x7f9c64003090 rank 3 nranks 8 cudaDev 3 busId 190 - Init COMPLETE
i-0b2d24dbd20c27dd0:1425:3385 [7] NCCL INFO comm 0x7f3fe0003090 rank 7 nranks 8 cudaDev 7 busId 1d0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1335:1335 [0] NCCL INFO Launch mode Parallel
What's your environment?
- metaseq Version: 7828d728 (Oct 5 main)
- PyTorch Version: 1.12.1+cu113
- OS: Ubuntu 18.04.6 LTS
- How you installed metaseq:
pip - Build command you used (if compiling from source): N.A.
- Python version: 3.10
- CUDA/cuDNN version: CUDA 11.8
- GPU models and configuration: 8 x V100 SXM2 32 GB