CLI times out with 350 M
Created by: Mrs-Hudson
🐛 Bug
When I launch the interactive hosted CLI with the 350M parameter model, the server logs show timeout without the server up and running (logs below)
To Reproduce
Running python -m metaseq_cli.interactive_hosted with 350M parameter OPT model
2022-05-10 05:09:38 | INFO | metaseq_cli.interactive | Local checkpoint copy already exists, skipping copy 2022-05-10 05:09:39 | INFO | metaseq.distributed.utils | distributed init (rank 0): tcp://localhost:19587 2022-05-10 05:09:41 | INFO | metaseq.distributed.utils | distributed init (rank 2): tcp://localhost:19587 2022-05-10 05:09:41 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 2 2022-05-10 05:09:41 | INFO | metaseq.distributed.utils | distributed init (rank 1): tcp://localhost:19587 2022-05-10 05:09:41 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 1 2022-05-10 05:09:41 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. 2022-05-10 05:09:41 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0 2022-05-10 05:09:51 | INFO | torch.distributed.distributed_c10d | Waiting in store based barrier to initialize process group for rank: 2, key: store_based_barrier_key:1 (world_size=2, worker_count=3, timeout=0:30:00) 2022-05-10 05:09:51 | INFO | torch.distributed.distributed_c10d | Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=3, timeout=0:30:00) 2022-05-10 05:10:01 | INFO | torch.distributed.distributed_c10d | Waiting in store based barrier to initialize process group for rank: 2, key: store_based_barrier_key:1 (world_size=2, worker_count=3, timeout=0:30:00) . . 2022-05-10 05:39:33 | INFO | torch.distributed.distributed_c10d | Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=3, timeout=0:30:00) 2022-05-10 05:39:33 | INFO | torch.distributed.distributed_c10d | Waiting in store based barrier to initialize process group for rank: 2, key: store_based_barrier_key:1 (world_size=2, worker_count=3, timeout=0:30:00) Traceback (most recent call last): File "/anaconda/envs/azureml_py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/anaconda/envs/azureml_py38/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/azureuser/metaseq/metaseq_cli/interactive_hosted.py", line 308, in <module> cli_main() File "/home/azureuser/metaseq/metaseq_cli/interactive_hosted.py", line 304, in cli_main dist_utils.call_main(cfg, worker_main, namespace_args=args) File "/home/azureuser/metaseq/metaseq/distributed/utils.py", line 256, in call_main return _spawn_helper(main, cfg, kwargs) File "/home/azureuser/metaseq/metaseq/distributed/utils.py", line 234, in _spawn_helper retval = distributed_main(-1, main, cfg, kwargs) File "/home/azureuser/metaseq/metaseq/distributed/utils.py", line 198, in distributed_main cfg.distributed_training.distributed_rank = distributed_init(cfg) File "/home/azureuser/metaseq/metaseq/distributed/utils.py", line 136, in distributed_init dist.init_process_group( File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 608, in init_process_group _store_based_barrier(rank, store, timeout) File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 244, in _store_based_barrier raise RuntimeError( RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=3, timeout=0:30:00) 2022-05-10 17:17:18 | INFO | metaseq_cli.interactive | Local checkpoint copy already exists, skipping copy
If I i crease the world size and model parallel to 3, the worker count is 4
metaseq Version (e.g., 1.0 or master): 0.0.1
PyTorch Version (e.g., 1.0): '1.10.1+cu113'
OS (e.g., Linux): Linux NAME="Ubuntu" VERSION="18.04.6 LTS (Bionic Beaver)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.6 LTS" VERSION_ID="18.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=bionic UBUNTU_CODENAME=bionic
How you installed metaseq (pip, source): Same as setup instructions
Build command you used (if compiling from source): Same as setup instructions
Python version: 3.8.5
CUDA/cuDNN version: (azureml_py38) azureuser@rparik4:~/metaseq$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Feb_14_21:12:58_PST_2021 Cuda compilation tools, release 11.2, V11.2.152 Build cuda_11.2.r11.2/compiler.29618528_0
GPU models and configuration: Azure compute node with 8 gpus Virtual machine size Standard_ND40rs_v2 (40 cores, 672 GB RAM, 2900 GB disk) Processing unit GPU - 8 x NVIDIA Tesla V100