I’ve been working on transferring LLMs across tokenizers using a new method called ALM (details in our paper [1]). It distills a model trained with one tokenizer into a version using another, enabling things like converting subword models into byte-level ones much more effectively than what was possible before.
To make this easy to use, I released tokenkit, a library implementing ALM and other tokenizer transfer methods: https://github.com/bminixhofer/tokenkit.
As a demo, I used ALM to create two byte-level instruction-tuned models:
- https://huggingface.co/benjamin/Gemma2-2B-IT-Byte
- https://huggingface.co/benjamin/Llama3-2-3B-IT-Byte
Even though the distillation phase is very short (just 1.2B bytes ≈ 330M subword tokens), the models perform competitively (for example 57.0% MMLU of the byte-level Llama vs. 62.4% MMLU of the original Llama3-3B-Instruct).
This approach opens up an interesting direction: we can potentially keep subword tokenization for pretraining (to still squeeze as much text into the model in as little time as possible), but then change to a more user-friendly tokenization afterwards.
These models aren’t yet optimized for efficiency, but if you would add self-speculative decoding plus a BLT/DTP-style hierarchical architecture and/or linearized attention, they might also be able to replace subword-based models when speed matters.
If you're interested in training your own models, the guide on tokenizer transfer via tokenkit [2] should make it easy. The model cards of the transfers above also contain the exact command used to train them. I’ve been training on fairly limited hardware, so effective transfer is possible even in a (near) consumer-grade setup.
[1]: https://arxiv.org/abs/2503.20083
[2]: https://github.com/bminixhofer/tokenkit/blob/main/docs/token...