Stéphan Tulkens

Skeletoken 0.4.0 release

tokenization | Aug 2, 2026

skeletoken has a new version! For those not in the know: skeletoken is a set of Pydantic datamodels that fully describe the Hugging Face tokenizer.json format, so you can validate, edit, and transform tokenizers as typed objects instead of editing JSON files. So: version 0.4.0 is out, and it’s a pretty big jump from 0.3.3, which is why I decided to write a blog post about it.

From Chesterton's fence to Chesterton's gap

Jun 14, 2026

The English Writer and Christian apologist G. K. Chesterton is, perhaps, most well known to programmers through a paragraph in which he introduces what is now known as “Chesterton’s fence”. It’s a very simple idea: You walk through a field and see a fence which, seemingly, has no purpose. Instead of tearing it down because it seemingly has no use, try to understand or ask why somebody put it there. (¹) That’s it!

The full quote is: ‘IN the matter of reforming things, as distinct from deforming them, there is one plain and simple principle; a principle which will probably be called a paradox. There exists in such a case a certain institution or law; let us say for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, “I don’t see the use of this; let us clear it away.” To which the more intelligent type of reformer will do well to answer: “If you don’t see the use of it, I certainly won’t let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.”’ [source] ↩

Make something for someone

creation | May 19, 2026

I recently listened to the album Music for Existing by producer Martyn. I wasn’t really familiar with him, and in fact it got algorithmically recommended by me because the album features Duval Timothy, whose album Meeting with a Judas Tree I adore.

Scikit-learn's fit transform paradigm is probably not for you

python | May 17, 2026

If you’ve ever used code from scikit-learn, you will have seen the following pattern:

Evaluating static models on RTEB

static models | Oct 8, 2025

The group of researchers associated with the Massive Text Embedding Benchmark (MTEB) has released a new benchmark: the Retrieval Text Embedding Benchmark. As you may know, MTEB ranks models on their ability to perform well at a variety of tasks in a zero-shot setting, and is meant to reflect how well your model transfers to new tasks. Ranking high on MTEB can make or break your model, so it has become something that people optimize for, and as Goodhart put it: “when a measure becomes a target, it ceases to become a good measure”.

Comparing PCA and MRL for static models

static models | Oct 6, 2025

Without reducing dimensionality, static models can be hundreds of MB large. Choosing the right dimensionality-reduction technique can shrink them without sacrificing retrieval quality. I was always a huge fan of Principal Component Analysis (PCA) for making static models smaller. For example, PCA is used in model2vec and was used in an older version of tokenlearn to post-process models, and is used in the newer version of tokenlearn to reduce the dimension of the teacher models. PCA was always on my mind as a good option for reducing dimensions. Recently, however, I started experimenting with Matryoshka Representation Learning (MRL) for reducing dimensions and have found it to be superior, which I found surprising. This blog post thus tries to answer the question: when should you be using PCA and MRL? If one is better than the other, why? I discuss both techniques, why applying dimensionality reduction to static embeddings makes sense, and some options for future work.

Static late interaction models

static models | Sep 30, 2025

Late interaction is an interesting paradigm for computing the similarity between two documents, and can be seen as a hybrid of sparse and dense retrieval. In this post, I will show how static models in a late interaction setting actually reduce to sparse models. I will also argue that, in absence of empirical evidence to the contrary, there’s no good reason to assume that static late interaction models will be much better than their dense counterparts. But first, let’s dive into some fundamentals: I’ll explain what sparse retrieval and dense retrieval are, and how late interaction fits in with both of those paradigms.

Better Greedy Tokenizers: Handling WordPiece's [UNK] Problem

tokenization | Sep 18, 2025

In a previous post, I showed that making a tokenizer greedy, that is, always picking the longest matching subword like WordPiece does, can improve results without retraining. But WordPiece can unfortunately silently break your tokenization.

Note: alternative to regex splitting in byte tokenizers

tokenization | Aug 12, 2025

In a previous note, I discussed an alternative for setting split to true in a ByteLevel pretokenizer. I suggested using a ByteLevel normalizer first, and then splitting using a complicated regex in “byte space”. However, this turned out to not work very well: there are certain character classes in an original Regex, such as \s, that are very difficult to convert to a pattern in byte space.

Separate Normalization from Splitting in ByteLevel tokenizers

tokenization | Aug 12, 2025

This note is wrong! This was revealed to me by Sasuke___420. As it turns out, the regex does not work the same as the original one, specifically for non-ascii spaces. Upon further reflection, I don’t think you should really use this.