Note: alternative to regex splitting in byte tokenizers
In a previous note, I discussed an alternative for setting split
to true in a ByteLevel
pretokenizer. I suggested using a ByteLevel
normalizer first, and then splitting using a complicated regex in “byte space”. However, this turned out to not work very well: there are certain character classes in an original Regex, such as \s
, that are very difficult to convert to a pattern in byte space.
I was wondering about how others did this, and discovered that you can stack multiple pretokenizers by first using a Split
pretokenizer with a regex, and then using a ByteLevel
pretokenizer with split
set to False. This is, e.g., what Qwen/Qwen3-Embedding-0.6B
uses. Doing it this way is correct and achieves my original proposal: a way to split using a regex of your own design, with Byte normalization.
Here’s what that looks like:
from tokenizers import Regex
from tokenizers.pre_tokenizers import Split, ByteLevel, Sequence
pattern = r"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"
split = Split(Regex(pattern), behavior="isolated")
byte = ByteLevel(use_regex=False, add_prefix_space=False)
pretokenizer = Sequence([split, byte])
original = ByteLevel(use_regex=True, add_prefix_space=False)
s = "hello, ご 「きげんよう?」?”"
print(pretokenizer.pre_tokenize_str(s))
print(original.pre_tokenize_str(s))
This allows you to freely change your regex without any difficulties. One thing to note is that add_prefix_space
needs to be unset for this to be totally equivalent. If not, you will need to add a Prepend
normalizer.