Note: alternative to regex splitting in byte tokenizers
In a previous note, I discussed an alternative for setting split
to true in a ByteLevel
pretokenizer. I suggested using a ByteLevel
normalizer first, and then splitting using a complicated regex in “byte space”. However, this turned out to not work very well: there are certain character classes in an original Regex, such as \s
, that are very difficult to convert to a pattern in byte space.
Turning any tokenizer into a greedy one
I recently re-read Greed is All You Need: An Evaluation of Tokenizer Inference Methods. In this paper, the authors show that switching out inference methods for tokenizers can improve performance on various tasks.
Tokenizer decasing
In this post I will talk about something I call tokenizer de_casing. _decasing is very similar to putting a lowercase normalizer in front a of a tokenizer, but works better.
Using overload to handle tagged union return types
Here’s a function with an idiom I’ve seen a lot (probably copied from sentence-transformers
):
Protocols to make untyped code behave
Working with external untyped code in a typed code base can be challenging, you’ll get lots of Any
or Unknown
, which might propagate through your codebase. This can force you to reach for typing.cast
, or # type: ignore
statements, which kind of defeats the purpose of using static typing in the first place.
Rethinking evaluation and relative performance
Here’s a pop quiz: classifier A
scores 90% accuracy on some benchmark. Classifier B
scores 80%. How much better is A
?
Exposing string types to maximize user happiness
Regular users of my blog will know that I am opposed to what is known as stringly typing: using strings in place of more strongly typed identifiers. As an example, consider a language-specific tokenizer: