No longer able to load provided OPT checkpoint after recent changes
Created by: EIFY
🐛 Bug
No longer able to load provided OPT checkpoint after recent changes
To Reproduce
Edit metaseq/service/constants.py as before, in my case:
MAX_SEQ_LEN = 2048
BATCH_SIZE = 2048 # silly high bc we dynamically batch by MAX_BATCH_TOKENS
MAX_BATCH_TOKENS = 3072
DEFAULT_PORT = 6010
MODEL_PARALLEL = 1
TOTAL_WORLD_SIZE = 1
MAX_BEAM = 16
try:
# internal logic denoting where checkpoints are in meta infrastructure
from metaseq_internal.constants import CHECKPOINT_FOLDER
except ImportError:
CHECKPOINT_FOLDER = "/home/jason_chou/redspot_home/350m/"
(...)
where
$ pwd
/home/jason_chou/redspot_home
$ ls 350m/
dict.txt gpt2-merges.txt gpt2-vocab.json reshard.pt
and then run metaseq-api-local, but it no longer works:
$ metaseq-api-local
2022-10-05 22:19:25 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/350m/reshard.pt
2022-10-05 22:19:26 | INFO | metaseq.checkpoint_utils | Done reading from disk
Traceback (most recent call last):
File "/home/jason_chou/.conda/envs/user/bin/metaseq-api-local", line 8, in <module>
sys.exit(cli_main())
File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main
distributed_utils.call_main(cfg, worker_main, namespace_args=args)
File "/home/default_user/metaseq/metaseq/distributed/utils.py", line 279, in call_main
return main(cfg, **kwargs)
File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main
models = generator.load_model() # noqa: F841
File "/home/default_user/metaseq/metaseq/hub_utils.py", line 565, in load_model
models, _model_args, _task = _load_checkpoint()
File "/home/default_user/metaseq/metaseq/hub_utils.py", line 548, in _load_checkpoint
return checkpoint_utils.load_model_ensemble_and_task(
File "/home/default_user/metaseq/metaseq/checkpoint_utils.py", line 482, in load_model_ensemble_and_task
model = build_model_hook(cfg, task)
File "/home/default_user/metaseq/metaseq/hub_utils.py", line 538, in _build_model
setattr(cfg["model"], "inference", True)
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 337, in __setattr__
raise e
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 334, in __setattr__
self.__set_impl(key, value)
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 318, in __set_impl
self._set_item_impl(key, value)
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 511, in _set_item_impl
self._validate_set(key, value)
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 180, in _validate_set
target = self._get_node(key) if key is not None else self
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 465, in _get_node
self._validate_get(key)
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 166, in _validate_get
self._format_and_raise(
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/base.py", line 190, in _format_and_raise
format_and_raise(
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/_utils.py", line 821, in format_and_raise
_raise(ex, cause)
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/omegaconf/_utils.py", line 719, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set end OC_CAUSE=1 for full backtrace
omegaconf.errors.ConfigAttributeError: Key 'inference' is not in struct
full_key: model.inference
object_type=dict
Apparently this can be traced back to when setattr(cfg["model"], "inference", True) was added (https://github.com/facebookresearch/metaseq/pull/356). However, another issue surfaced even with that line commented out:
$ metaseq-api-local
2022-10-05 22:23:31 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/350m/reshard.pt
2022-10-05 22:23:31 | INFO | metaseq.checkpoint_utils | Done reading from disk
Traceback (most recent call last):
File "/home/jason_chou/.conda/envs/user/bin/metaseq-api-local", line 8, in <module>
sys.exit(cli_main())
File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main
distributed_utils.call_main(cfg, worker_main, namespace_args=args)
File "/home/default_user/metaseq/metaseq/distributed/utils.py", line 279, in call_main
return main(cfg, **kwargs)
File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main
models = generator.load_model() # noqa: F841
File "/home/default_user/metaseq/metaseq/hub_utils.py", line 565, in load_model
models, _model_args, _task = _load_checkpoint()
File "/home/default_user/metaseq/metaseq/hub_utils.py", line 548, in _load_checkpoint
return checkpoint_utils.load_model_ensemble_and_task(
File "/home/default_user/metaseq/metaseq/checkpoint_utils.py", line 487, in load_model_ensemble_and_task
model.load_state_dict(state["model"], strict=strict)
File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerLanguageModel:
Missing key(s) in state_dict: "decoder.layer_norm.weight", "decoder.layer_norm.bias".
which seems to be due to recent cleanup PRs (https://github.com/facebookresearch/metaseq/pull/366, https://github.com/facebookresearch/metaseq/pull/380, https://github.com/facebookresearch/metaseq/pull/381).
Expected behavior
metaseq-api-local up & running
Environment
- metaseq Version: latest main (7828d728)
- PyTorch Version: 1.12.1+cu113
- OS (e.g., Linux): Ubuntu 18.04.6 LTS
- How you installed metaseq:
pip - Python version: 3.10.4
- CUDA/cuDNN version: CUDA 11.8
- GPU models and configuration: 1 x T4