Training starts hanging at the beginning if dataset is too small - on AWS

Created by: Xirider

🐛 Bug

When starting a training run the model starts hanging at the first forward pass. This happened when I used the small book dataset used in the gpu_test/test_training_integrity.py.

To Reproduce

On AWS, install env with pytorch compiled with efa


conda create -n sor python=3.8 -y
conda activate sor
pip install bitarray boto3 deepspeed editdistance iopath ipdb ipython pyarrow pytest sacremoses sentencepiece subword-nmt hydra-core==1.0.7 omegaconf==2.0.6 tokenizers more_itertools cython fire
pip install six regex

srun -p train -t 3:00:00 --gpus-per-node=8 --pty bash
conda activate sor 
module purge
module load cuda/11.6
module load nccl/2.12.7-cuda.11.6
module load nccl_efa/1.2.0-nccl.2.12.7-cuda.11.6

conda install pytorch=1.12.1=aws* cudatoolkit=11.6 torchvision torchaudio \
--override-channels \
-c https://aws-pytorch.s3.us-west-2.amazonaws.com \
-c pytorch \
-c nvidia \
-c conda-forge

git clone https://github.com/NVIDIA/apex
cd apex
git checkout 265b451de8ba9bfcb67edc7360f3d8772d0a8bea
sed -i '32 i \ \ \ \ return' setup.py
python -m pip install -v --no-cache-dir --global-option="--cpp_ext" \
--global-option="--cuda_ext" \
--global-option="--deprecated_fused_adam" \
--global-option="--xentropy" \
--global-option="--fast_multihead_attn" ./ 
cd ..

git clone https://github.com/ngoyal2707/Megatron-LM.git
cd Megatron-LM 
git checkout fairseq_v2 
pip install -e . 
cd .. 

git clone https://github.com/facebookresearch/fairscale.git
cd fairscale
git checkout fixing_memory_issues_with_keeping_overlap_may24
pip install -e .
cd ..

git clone https://github.com/facebookresearch/metaseq.git
cd metaseq
pip install -e .
cd ..

Run circleci test which downloads BookCorpusFair

python -m pytest -o log_cli=true gpu_tests/test_training_integrity.py

The test also launches the command below. Everything seems to go well until the first forward pass, at which the model hangs.

python metaseq/launcher/opt_baselines.py --prefix train.8m --model-size 8m --checkpoints-dir ./test-checkpoint --tensorboard-logdir ./test-checkpoint --num-trials 1 --num-gpus 8 --num-nodes 1 --seed 1 --circleci --local --disable-validation --max-epoch 5 --max-update 5 --azure

This is the last output before hanging (with nccl debug on):

2022-10-07 12:38:21 | INFO | metaseq.trainer | begin training epoch 1
2022-10-07 12:38:21 | INFO | metaseq_cli.train | Start iterating over samples
25489:25489 [1] NCCL INFO comm 0x7f37a0008f70 rank 1 nranks 8 cudaDev 1 busId 101d0 - Abort COMPLETE
25493:25493 [5] NCCL INFO comm 0x7fb97c008f70 rank 5 nranks 8 cudaDev 5 busId 901d0 - Abort COMPLETE
25492:25492 [4] NCCL INFO comm 0x7f5804008f70 rank 4 nranks 8 cudaDev 4 busId 901c0 - Abort COMPLETE
25496:25496 [7] NCCL INFO comm 0x7f2a58008f70 rank 7 nranks 8 cudaDev 7 busId a01d0 - Abort COMPLETE
25491:25491 [3] NCCL INFO comm 0x7f0df4008f70 rank 3 nranks 8 cudaDev 3 busId 201d0 - Abort COMPLETE
25494:25494 [6] NCCL INFO comm 0x7fb1a4008f70 rank 6 nranks 8 cudaDev 6 busId a01c0 - Abort COMPLETE
25490:25490 [2] NCCL INFO comm 0x7f929c008f70 rank 2 nranks 8 cudaDev 2 busId 201c0 - Abort COMPLETE
25489:25489 [1] NCCL INFO comm 0x7f36c0008f70 rank 1 nranks 8 cudaDev 1 busId 101d0 - Abort COMPLETE
25493:25493 [5] NCCL INFO comm 0x7fb888008f70 rank 5 nranks 8 cudaDev 5 busId 901d0 - Abort COMPLETE
25492:25492 [4] NCCL INFO comm 0x7f5714008f70 rank 4 nranks 8 cudaDev 4 busId 901c0 - Abort COMPLETE
25496:25496 [7] NCCL INFO comm 0x7f2990008f70 rank 7 nranks 8 cudaDev 7 busId a01d0 - Abort COMPLETE
25491:25491 [3] NCCL INFO comm 0x7f0d04008f70 rank 3 nranks 8 cudaDev 3 busId 201d0 - Abort COMPLETE
25494:25494 [6] NCCL INFO comm 0x7fb0c8008f70 rank 6 nranks 8 cudaDev 6 busId a01c0 - Abort COMPLETE
25490:25490 [2] NCCL INFO comm 0x7f91a8008f70 rank 2 nranks 8 cudaDev 2 busId 201c0 - Abort COMPLETE

With the --benchmark flag which uses dummy sequences instead of the books dataset there appears to be no issue