Issue running metaseq-api-local for blenderbot3_30B model
Created by: Tacacs-1101
🐛 Bug
I have sharded the checkpoints into 4 parts according to the number of GPU nodes I have (4 GPUs 15gb each) and also set the TOTAL_WORLD_SIZE=4, MODEL_PARALLEL=4. but I am facing this error of "intra layer model parallel group is not initialized". I have checked the earlier issues and tried various combinations of TOTAL_WORLD_SIZE & MODEL_PARALLEL eg. (2, 2) (2, 4) (4, 4) but the same error keeps popping. Please suggest any solution.
To Reproduce
- shard blenderbot3_30B checkpoints into 4 parts.
- set the path of model path, bpe file path, and MODEL_PARALLEL
- execute metaseq-api-local
