why the "vocab_size" in config file is 50272 but the len(tokenizer) is 50265.
Created by: Zcchill
🐛 Bug
The "vocab_size" in config file is 50272 but the len(tokenizer) is 50265, they not match eacch other.
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
- Run cmd '....'
- See error
Code sample
model.resize_token_embeddings(len(tokenizer))Expected behavior
The results seem good when I use the code abbove to align to tokenizer, but I just wonder why the vocab size for training is 50272, did I miss some important parameter?Environment
- metaseq Version (e.g., 1.0 or master):
- PyTorch Version (e.g., 1.0)
- OS (e.g., Linux, Windows, MacOS):
- How you installed metaseq (
pip, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information: