Tokenizer decasing
In this post I will talk about something I call tokenizer de_casing. _decasing is very similar to putting a lowercase normalizer in front a of a tokenizer, but works better.
Cased vs Uncased
Most modern tokenizers are cased; they will create different segmentations for words with (leading) uppercased tokens. For example, the intfloat/multilingual-e5-base
tokenizer will get you the following segmentations:
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("intfloat/multilingual-e5-base")
print(tok.encode("Amsterdam", add_special_tokens=False).tokens)
# ['▁Amsterdam']
print(tok.encode("amsterdam", add_special_tokens=False).tokens)
# ['▁am', 'ster', 'dam']
As you can see, you not only get completely different tokens when tokenizing Amsterdam
and amsterdam
, but the lowercase tokenized version is also much less efficient. Amsterdam
suddenly takes up 3 tokens instead of 1. Note that the representations of amsterdam
and Amsterdam
share 0 input tokens. Hence, any similarity between the uppercase and lowercase variants of these strings need to be learned by the model. Yet another way to put this: there is no a priori way for the model to tell that these strings actually refer to the same concept. So, if you, say have to teach the model to recognize cities, the model will need to learn a lot of things multiple times.
uncased tokenizers do not care about the distinction: they were trained on lowercase text, and turn any input string into a lowercased version. Let’s use bert-base-uncased
as an example:
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("bert-base-uncased")
print(tok.encode("Amsterdam", add_special_tokens=False).tokens)
# ['amsterdam']
print(tok.encode("amsterdam", add_special_tokens=False).tokens)
# ['amsterdam']
This looks cool! But unfortunately, many tokenizers are inherently cased. decasing is the process of turning a cased tokenizer into an uncased one.
Why uncase
So why would we ever decase a tokenizer? Here’s some observations:
- Many pretrained models use cased tokenizers (e.g., ModernBert).
- Users rarely use case consistently.
- If we lowercase our text before putting it into the model, we lose all uppercase tokens.
For reference, about 12% of all tokens in the multilingual-e5
tokenizer have at least one cased letter, and about 25% of tokens in the modernbert
tokenizer are cased. So throwing out these tokens would be pretty wasteful. Hence, decasing!
The actual decasing
The decasing procedure is really simple: we open the tokenizer internals and lowercase all internal tokens. If the tokenizer is a BPE tokenizer, you should also lowercase the merges, and create new merges for new tokens that need them. Don’t lowercase any special tokens. For any token, if the lowercase version already exists, replace the cased one with an anonymous placeholder.
Then, insert a lowercasing Normalizer, which lowercases all input strings.
And that’s it! Now your tokenizer will use 100% of the vocabulary. For example:
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("multilingual-e5-base_uncased")
print(tok.encode("Amsterdam", add_special_tokens=False).tokens)
# ['▁amsterdam']
print(tok.encode("amsterdam", add_special_tokens=False).tokens)
# ['▁amsterdam']
(Note that while this sounds really simple, it’s actually pretty annoying, and took me a lot of thinky to get right)
You can find the code here. This code is simple, it exposes a simple function that decases your tokenizer. Use it and conquer all tokens.
How well does this work?
Using the above code, I ran some preliminary experiments with two models:
I used nanobeir
and the English Clustering
subset of the Massive Text Embedding Benchmark (MTEB). I used these subsets specifically because they test the inherent similarity learned by the models in a large variety of domains.
Nanobeir results
The scores are NDCG@10, averaged over all datasets.
Modernbert | e5 | |
---|---|---|
Original | 57.68 | 55.4 |
Lowered | 56.01 | 56.2 |
Decased | 56.13 | 56.5 |
In all cases, the decased version outperformed the lowercased one, but with very small margins. So if your users prefer lowercased input, decasing seems to beat lowercasing. But there’s not a lot of reason to prefer this over the cased version. An interesting alternate reading, however, is that this actually is working quite well out of the box: lowercasing changes the distribution of tokens, so a performance drop is to be expected. That this drop is so small might be a good sign for further fine-tuning.
Clustering results
The scores here are the V-measure.
Modernbert | e5 | |
---|---|---|
Original | 50.8 | 43.1 |
Lowered | 50.0 | 43.8 |
Decased | 50.0 | 42.7 |
Strangely, the clustering results show a different pattern. For Modernbert, decasing does not seem to help at all. For e5, the pattern is even weirder: lowercasing outperforms the original cased model, but decasing is worse than either. A closer look at the results shows that this result is entirely caused by a single dataset, on which lowercasing improves scores by a whopping 20 points (!). Leaving out this dataset shows the same pattern as before.
Discussion
While the differences are very small, we can often see an increase in performance when decasing compared to naively lowercasing. So we can conclude that if your task, user or domain does not care about casing, it can pay off to decase instead of naively lowercasing. Decasing and lowercasing are not always beneficial, however, compared to using the cased model, and depends on the specific model, task, and dataset. But, luckily for us, it is very cheap to try both.
Future work
There’s many other ways to improve performance of tokenizers, including:
- Finetuning after decasing
- Switching to another tokenizer model during inference (see: Greedy tokenization paper)
- Improving the pretokenizer to return appropriate tokens after punctuation.
- Removing the prefix space if a byte pretokenizer is used.
In the coming weeks, I’ll be tackling some of these topics. I think there’s a lot of leverage in improving tokenizers, so I hope some of these manipulations will show better peformance than the one I just showed you.