Dynamic loss scaler does not fully checkpoint state, causing path dependency wrt restarts
Created by: suchenzang
Dynamic loss scaler has _iter and _last_overflow_iter attributes, which are not checkpointed: https://github.com/fairinternal/fairseq-py/blob/gshard_combine_megatron_fsdp/fairseq/optim/dynamic_loss_scaler.py#L32
As a result, loss scaling changes as a function of when we checkpoint / resume from checkpoints.
We should add a flag to enable checkpointing this state for reproducibility, along with keeping a flag for allowing this state to be forgotten if need be.