Stéphan Tulkens

NLP Person /// token addict

Note: alternative to regex splitting in byte tokenizers

In a previous note, I discussed an alternative for setting split to true in a ByteLevel pretokenizer. I suggested using a ByteLevel normalizer first, and then splitting using a complicated regex in “byte space”. However, this turned out to not work very well: there are certain character classes in an original Regex, such as \s, that are very difficult to convert to a pattern in byte space.

Read More

Separate Normalization from Splitting in ByteLevel tokenizers

NOTE

Read More

Turning any tokenizer into a greedy one

I recently re-read Greed is All You Need: An Evaluation of Tokenizer Inference Methods. In this paper, the authors show that switching out inference methods for tokenizers can improve performance on various tasks.

Read More

Tokenizer decasing

In this post I will talk about something I call tokenizer de_casing. _decasing is very similar to putting a lowercase normalizer in front a of a tokenizer, but works better.

Read More

kwargs.pop is probably a code smell

Sometimes I see something like this:

Read More

Using overload to handle tagged union return types

Here’s a function with an idiom I’ve seen a lot (probably copied from sentence-transformers):

Read More

Protocols to make untyped code behave

Working with external untyped code in a typed code base can be challenging, you’ll get lots of Any or Unknown, which might propagate through your codebase. This can force you to reach for typing.cast, or # type: ignore statements, which kind of defeats the purpose of using static typing in the first place.

Read More

Rethinking evaluation and relative performance

Here’s a pop quiz: classifier A scores 90% accuracy on some benchmark. Classifier B scores 80%. How much better is A?

Read More

Exposing string types to maximize user happiness

Regular users of my blog will know that I am opposed to what is known as stringly typing: using strings in place of more strongly typed identifiers. As an example, consider a language-specific tokenizer:

Read More

String casing in python

Below are two ways to check if a string is lower-cased in Python.

Read More