why the "vocab_size" in config file is 50272 but the len(tokenizer) is 50265.

Created by: Zcchill

🐛 Bug

The "vocab_size" in config file is 50272 but the len(tokenizer) is 50265, they not match eacch other.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Run cmd '....'
See error

None

Code sample

model.resize_token_embeddings(len(tokenizer))

Expected behavior

The results seem good when I use the code abbove to align to tokenizer, but I just wonder why the vocab size for training is 50272, did I miss some important parameter?

Environment

metaseq Version (e.g., 1.0 or master):
PyTorch Version (e.g., 1.0)
OS (e.g., Linux, Windows, MacOS):
How you installed metaseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information: