Stéphan Tulkens

From Chesterton's fence to Chesterton's gap

Sun, 14 Jun 2026 00:00:00 +0000

The English Writer and Christian apologist G. K. Chesterton is, perhaps, most well known to programmers through a paragraph in which he introduces what is now known as “Chesterton’s fence”. It’s a very simple idea: You walk through a field and see a fence which, seemingly, has no purpose. Instead of tearing it down because it seemingly has no use, try to understand or ask why somebody put it there. (¹) That’s it!

Paraphrasing: if you think somebody built something bad or in a bad way, try to understand why they did it that way before undoing their work. Being burned while ignoring Chesterton’s fence is a rite of passage for every programmer: you see a piece of code and think “who the hell wrote this”. Then, when rewriting it, you break production, and realize that there was a good reason somebody did the things they did. They weren’t stupid after all. Or, you rewrite it and it’s actually better, and you now know more about the person who wrote it, and maybe can teach them how to build better together.

Chesterton’s fence urges us to slow down, and retrace the thinking steps of the person who built before you, thus putting you in their shoes. Keeping Chesterton’s fence in mind does not only make you a better engineer, but it also makes you empathize more with the people around you, the ones that came before you. It shows you the limits of your own knowledge, but simultaneously shows you what you can teach others around you.

Chesterton’s gap

So having said that, I think there’s an interesting new dynamic at play in software land, which I will call Chesterton’s gap. It’s like Chesterton’s fence, except that people walk through the field and ask themselves why somebody hasn’t built a fence there yet, and then, without asking, build a fence.

To me, this is what it feels like to build open source libraries. The cost of creating lines of code has dropped to ~0, which causes people (²) to submit 10k line PRs without even opening an issue first. The thing is, these PRs make sense. They are not bad! They’re just not necessary. They add features to projects that nobody asked for, add tools that are marginally useful, add configuration scaffolding for IDEs that barely anyone uses. (³) To return to the parable: the fences are well built, they are sturdy, they may even serve a purpose. But I don’t need a fence in that specific location, even if it is free. I just don’t need it, and I don’t want it.

I can also write lines of code for free. I have the same superpowers you do, so if I didn’t add some feature to a project I own, there’s probably a good reason I didn’t add that specific feature. If you’re wondering why I didn’t add it myself, just ask. Don’t build the fence.

Footnotes

The full quote is: ‘IN the matter of reforming things, as distinct from deforming them, there is one plain and simple principle; a principle which will probably be called a paradox. There exists in such a case a certain institution or law; let us say for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, “I don’t see the use of this; let us clear it away.” To which the more intelligent type of reformer will do well to answer: “If you don’t see the use of it, I certainly won’t let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.”’ [source] ↩
who am I kidding. ↩
I have ignored maintenance here, but maintenance is a big part of why you should not just accept 10k line PRs. ↩

Make something for someone

Tue, 19 May 2026 00:00:00 +0000

I recently listened to the album Music for Existing by producer Martyn. I wasn’t really familiar with him, and in fact it got algorithmically recommended by me because the album features Duval Timothy, whose album Meeting with a Judas Tree I adore.

The track Musa at Erbil, which features the voice and words of Musa Okwonga features a beautiful reflection on modern life, which really resonated with me as a software professional, although I realize that that last part probably wasn’t his intention. It is worth quoting here in full, but please also listen to the track and possibly buy it:

The main thing is to make something together with someone else, it could be a 
meal, a mess, whatever.
The point is, you have to make something. 
It's not what you make that matters, it's that you make it. It's that you 
enjoy making it. 
Look, so much of modern life is about outcome, capitalism, outcome, product, 
outcome, AI, outcome.
But those are not outcomes you achieve in group, they're outcomes you 
achieve alone. 
Sit at your desk, you write something for a deadline for payment, outcome 
alone. 
AI, outcome alone, you type something in, outcome, alone. 
The thing we're missing from that is human process, shared process. 
Shared journeys, it's about that, I don't know why you'd want to do that any 
other way, because it's about shared journey, community, right. 
Community is what makes us human. 
Why would you want to cut that out, why would you go, why would you go from 
A to Z, why would you miss out on the alphabet in between. 
That's the, that's the fun, that's the joy. 
You don't watch a movie and go straight to the end credits, you experience it. 
Why would you go, why would you do that? 
You don't do that in a movie. 
Why would you do that in life? 
How is a movie serving us in the way we live each day. 
Movies were created to describe the human experience, and now they're better than
human experience. 
What are we doing here? 
What are we actually doing? 
So yeah, make something with someone. 
If you can make something and laugh with someone... but best of all, just love 
making something with someone.

This blew me away.

Without beating around the bush, we’re trying to automate ourselves out a job. I do this, and for a large part this has led me to be very productive. I have built systems I could not have built because I used coding assistance, and have learned things I would not have otherwise learned. This to say that this is not something I shy away from.

But at the same time I am afraid of losing the fun part. This is the part which makes the creation of software feel like creation. This is when you make something by hand, together with another human being. You brainstorm about what you want to achieve, share your goals and your dreams, loudly ask daring questions about which machinery is missing, and dream about building this together. And then, you tap a bunch of keys which makes symbols appear on a screen in a little box and you then run some commands and then suddenly the thing you made is alive. You have made something! For a while last year, I worked by myself on a bunch of open source projects, but it just wasn’t fun at all. It wasn’t intellectually stimulating in the same way as writing code together is. This is the dreaming part, it’s making things together, being part of a well-oiled machine together, fixing the machine together when it’s broken, saying sorry when you break it, be a little bit mad at your partner for breaking it but also help them fix it.

The biggest risk in giving away your agency to a machine is not downskilling, making yourself vulnerable to bugs, it’s about losing touch with others, and unlearning what it means to make something together, or even despising working together with others. The worst thing we can become is a bunch of people sitting alone and just efficiently but soulleslly contributing code to some god-forsaken pile nobody cares about. To quote the text above: “what are we doing?” We know that good things happen when people actually care about what they make, we just need to be brave enough to accept the consequences of working with humans.

Scikit-learn's fit transform paradigm is probably not for you

Sun, 17 May 2026 00:00:00 +0000

If you’ve ever used code from scikit-learn, you will have seen the following pattern:

import numpy as np

from sklearn.preprocessing import StandardScaler

X = np.random.randn((100, 32))

scaler = StandardScaler()
scaler.fit(X)
X_transformed = scaler.transform(X)

# Or equivalently
X_transformed = scaler.fit_transform(X)

For all scikit-learn transformers (¹), the fit call sets the internal state of the object, while the transform call uses the set internal state to transform some data into something else. (²) This paradigm is really useful because it allows for zero-cost chaining: any sequence of transformations can be fit_transformed by simply calling fit_transform on all transformations in sequence.

Conflation between construction and usage

The main point I’ll be making in this article is that scikit-learn’s fit transform paradigm mixes up the factory pattern, that is, an object that instantiates other objects, with the actual objects. This is used really well by scikit-learn, but probably doesn’t fit your codebase.

To illustrate, let’s reimplement the StandardScaler using numpy: (³)

from __future__ import annotations

import numpy as np

class StandardScaler:

    def __init__(self, with_mean: bool = True, with_std: bool = True) -> None:
        self.mean: None | np.ndarray = None
        self.std: None | np.ndarray = None
        self.with_mean = with_mean
        self.with_std = with_std

    def fit(self, X: np.ndarray) -> StandardScaler:
        if self.with_mean:
            self.mean = X.mean(0)
        if self.with_std:
            self.std = X.std(0)
        
        return self

    @property
    def _is_fit(self) -> bool:
        if self.with_mean and self.mean is None:
            return False
        if self.with_std and self.std is None:
            return False
        return True

    def transform(self, X: np.ndarray) -> np.ndarray:
        if not self._is_fit:
            raise ValueError("Standardscaler has not been fit")
        if self.with_mean:
            X = X - self.mean
        if self.with_std:
            X = X / self.std
        return X

    def fit_transform(self, X: np.ndarray) -> np.ndarray:
        self.fit(X)
        return self.transform(X)

Let’s first talk about the initializer. In a scikit-learn initializer, you are only supposed to set the so-called hyperparameters of a transformer or estimator.That is, you should only set attribues that do not depend on the data you will use to fit the model. So, in this case, the parameters of the initializer determine what the behavior of the instantiated StandardScaler will be. So, in our case, with_mean and with_std determine what the behavior is of the StandardScaler that is produced by fitting the StandardScaler on some data; if we set with_mean to False, we actually get a different object than we would get if we set it to True.

Second, note that the fit function is destructive. It erases the original state, and introduces a completely new state. From a python perspective, however, the same object is returned, its only the internal state that is reset.

Third, note that there is no need to store the hyperparameters once you’ve fit the transformer. ⁴

Fourth, for a given StandardScaler, it is impossible to know whether it has been fit or not. So, whenever you work with scikit-learn’s internals, you’ll have to continuously check whether the estimators and transformers you work with actually have their internal state set.

Fifth, when you write your own transformers and estimators, it is very easy to incorrectly implement this state. (⁵)

Splitting out the factory

So, now on to my main thesis: this whole problem can be avoided by conceding that StandardScaler is both a factory and the object that is constructed by the factory. As such, if we split this up into two separate classes, we’ll see that we’ll end up with much cleaner code.

from __future__ import annotations

import numpy as np

class StandardScaler:

    def __init__(self, mean: np.ndarray | None, std: np.ndarray | None) -> None:
        self.mean: np.ndarray | None = mean
        self.std: np.ndarray | None = std

    def transform(self, X: np.ndarray) -> np.ndarray:
        if self.mean is not None:
            X = X - self.mean
        if self.std is not None:
            X = X / self.std
        return X

class StandardScalerFactory:

    def __init__(self, with_mean: bool = True, with_std: bool = True) -> None:
        self.with_mean = with_mean
        self.with_std = with_std

    def fit(self, X: np.ndarray) -> StandardScaler:
        mean, std = None, None
        if self.with_mean:
            mean = X.mean(0)
        if self.with_std:
            std = X.std(0)
        
        return StandardScaler(mean, std)

    def fit_transform(self, X: np.ndarray) -> tuple[StandardScaler, np.ndarray]:
        scaler = self.fit(X)
        return scaler, scaler.transform(X)

As you can see, we’ve changed the structure considerably. fit now returns an object which implements transform, and only implements transform. fit_transform returns a tuple, the first item of which is the fit object, the second of which is the transformed data. This still allows us to forward state in a single call as follows:

transformers = [...]  # Some list of transformers
X = ...  # some numpy array:

fit_transformers = []
for transformer in transformers:
    fit_transformer, X = transformer.fit_transformer(X)
    fit_transformers.append(fit_transformer)

So what did we gain? A couple of things:

1) We can guarantee that the object we’re dealing with has been fit on some data, and is usable. 2) We clearly separate between creation (the factory) and usage. 3) We have much fewer checks

The main advantage to this is that we have very strong typing guarantees. For every fit, we can statically detect what the type object is, and whether it is usable to transform and predict. For example, with base classes:

from typing import Generic

class BaseTransformer:

    def transform(self, X: np.ndarray) -> np.ndarray:
        ...

T = TypeVar("T", BaseTransformer)

class BaseFactory(Generic[T]):

    def fit(self, X: np.ndarray) -> T:
        ...

One downside of this pattern is that the hyperparameters are no longer accessible on the fit object.

In a follow-up post, we’ll investigate how we can improve on this pattern and have our cake and eat it to.

Footnotes

A transformer here is something that transforms some data, not a transformer in the machine learning sense. ↩
scikit-learn also implements predictors, which have fit, predict and fit_predict functions. ↩
In a serious implementation, we’d derive from a base class, use generics, etc. ↩
Although doing so is very useful for reproducing research. ↩
I don’t think this is a problem of scikit-learn itself though. Their estimators are all implemented correctly. This is easy to get wrong, however. ↩

Evaluating static models on RTEB

Wed, 08 Oct 2025 00:00:00 +0000

The group of researchers associated with the Massive Text Embedding Benchmark (MTEB) has released a new benchmark: the Retrieval Text Embedding Benchmark. As you may know, MTEB ranks models on their ability to perform well at a variety of tasks in a zero-shot setting, and is meant to reflect how well your model transfers to new tasks. Ranking high on MTEB can make or break your model, so it has become something that people optimize for, and as Goodhart put it: “when a measure becomes a target, it ceases to become a good measure”.

The mechanism behind Goodhart’s law is particularly problematic for MTEB, since all the datasets and evaluations behind it are completely open, making it feasible to hill-climb MTEB without actually training directly on any data from the benchmark. (¹) RTEB solves this issue by keeping a portion of the leaderboard private, a practice that used to be common in so-called shared tasks. (²) Users wishing to appear on the leaderboard need to provide their model and have it tested on the private subset. This solves the issue of adversaries with a lot of compute being able to hill-climb the leaderboard by themselves. The downside of doing this is obviously that keeping the leaderboard up to date is a substantial effort. In addition, RTEB exclusively focuses on retrieval, and only uses datasets that are relevant for retrieval.

Training static models

By and large, there are two good ways to train a static model: (³)

Knowledge distillation: used to create the potion models by Minish. This approach performs basic knowledge distillation using a larger teacher model and the cosine similarity or MSE as a loss function.
Supervised training: detailed in Tom Aarsen’s blog post. This is simply performing supervised training using, e.g., a ranking loss, on very large datasets, without doing any finetuning.

As far as I could tell, both approaches are roughly competitive. Here are the scores for the models on MTEB, where potion models are trained via knowledge distillation, and the static-.+mrl models are trained on large datasets of sentences.

Name	MTEB avg score	MTEB subset
potion-multilingual-128m	47.23	multilingual
static-similarity-mrl-multilingual-v1	47.24	multilingual
potion-base-8m	53.3	english
static-retrieval-mrl-en-v1	51.25	english

As you can see, potion-base-8m is on average better than static-retrieval-mrl-en-v1. At retrieval, however, the supervised model is better than the potion model. For the multilingual models, knowledge distillation and the supervised approach seem to do equally well. The conclusion so far: training on sentence datasets leads to pretty good general models, but very good models at whatever you are training on (retrieval), and multilingual semantics can be learned really well from sentence datasets, even without prior language model training.

Now, let’s turn to RTEB. As noted above, RTEB is specifically meant for retrieval, and also has an English and multilingual subset. This allows us to answer the following question: does knowledge distillation-based training lead to better performance on held-out data than straight supervision? Because we have English and multilingual models in both conditions, we have a very nice way to test. My personal prediction is that knowledge distillation would be better than supervision; even though the supervised models have been trained on large amounts of data, they have been trained to solve specific problems. Knowledge distillation, on the other hand, is about generally mimicking a larger model, and should thus generalize better to unseen datasets.

Results

Overall, supervised models outperform knowledge-distilled ones, particularly on the private leaderboard. I also didn’t do anything; Kenneth Enevoldsen ran the models on RTEB. (⁴) I can’t really report much more than the actual results, so let’s dive right in. Note that, as before, the top two rows are on the English subset, while the bottom ones are on the multilingual subset.

Name	RTEB public score	RTEB private score
potion-multilingual-128m	23.23	36.63
static-similarity-mrl-multilingual-v1	24.54	43.73
potion-base-8m	24.11	37.45
static-retrieval-mrl-en-v1	29.09	44.48

As you can see both of the supervised mrl models surpass their knowledge distilled counterparts on the private set. This is especially striking for the multilingual model: potion-base-128m tracks the mrl model very closely on the public set, but is much worse on the private set. This is very interesting, and ran counter to my expectation, as all these models were more or less evenly matched on the full MTEB set.

Discussion

This provides some interesting insights for future models. Knowledge distillation is basically free: all you need is a model for whatever domain you want, and a relatively small corpus, but it does not perform as well as supervised learning, even if the data you have does not match your task directly. The main data point here is the performance of static-similarity-mrl-multilingual, which was only trained on similarity datasets, and not on retrieval, but still outperforms the knowledge distilled potion model on retrieval.

Another interesting observation missing from this chart is hybrid models; it could be that first performing knowledge distillation and then supervised learning (⁵) outperforms doing either of them alone. One issue with static models, however, is that they are extremely susceptible to catastrophic forgetting; without any intervening model, the embeddings just change shape to suit whatever task you train them on.

Conclusion

I think that hybrid training and knowledge distillation, and especially knowledge distillation on a larger and more diverse set of documents, could be beneficial. In addition, I think the solution space of knowledge distillation for static models remains unexplored. For example, I don’t think anyone has trained a model using hard negatives, or using logit scores. These things will surely be tried by someone (⁶)

Footnotes

This is obviously against the spirit of the leaderboard, but also how science progresses. This is not necessarily an issue, because when parties are forced to disclose whatever made them take a step up the hill, we learn a little bit. The main issue, in my opinion, is that a single user takes many steps in private, and then only discloses the tricks that worked. ↩
See for example the *SEM shared task series, which gave us the well-known sts datasets. ↩
Note that I am excluding model2vec from training because I view that as an initialization strategy. Models that come straight from model2vec are not competitive; performing knowledge distillation or training is always better. ↩
Thanks! 🙏🙏🙏 ↩
Or the other way around, or both ↩
Probably me… ↩

Comparing PCA and MRL for static models

Mon, 06 Oct 2025 00:00:00 +0000

Without reducing dimensionality, static models can be hundreds of MB large. Choosing the right dimensionality-reduction technique can shrink them without sacrificing retrieval quality. I was always a huge fan of Principal Component Analysis (PCA) for making static models smaller. For example, PCA is used in model2vec and was used in an older version of tokenlearn to post-process models, and is used in the newer version of tokenlearn to reduce the dimension of the teacher models. PCA was always on my mind as a good option for reducing dimensions. Recently, however, I started experimenting with Matryoshka Representation Learning (MRL) for reducing dimensions and have found it to be superior, which I found surprising. This blog post thus tries to answer the question: when should you be using PCA and MRL? If one is better than the other, why? I discuss both techniques, why applying dimensionality reduction to static embeddings makes sense, and some options for future work.

Static models and PCA

First, let’s talk about static models and PCA. Static models are just embedding tables indexed by a tokenizer. Just like good old word embeddings, but better. One determining factor in the performance of a static model is that the embedding space does not model irrelevant or redundant information; because no downstream task exists to process or ignore information, the embedding space needs to handle all of this.

So now, on to PCA. PCA finds an orthonormal basis of vectors (principal components) such that each successive component captures as much of the remaining variance as possible. (¹) Transformed embeddings are then expressed as linear combinations of these components. As it turns out, in addition to being used for reducing dimensionality, PCA also has the property of making the individual dimensions of your embedding space uncorrelated; i.e., a space for which the expected cosine similarity is close to 0. The expected cosine being close to 0 is caused by all dimensions being centered around 0, and also uncorrelated with other dimensions.

In addition to uncorrelating them, PCA orders the components by the variance they explain. This allows you to truncate embedding spaces to a specified dimension without losing a lot of performance, a property MRL also has.

The code below demonstrates that PCA creates an expected cosine similarity close to 0:

import numpy as np
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import pairwise_distances

def cosine_similarity(x: np.ndarray) -> np.ndarray:
    return 1 - pairwise_distances(x, metric="cosine")

# Uniform embeddings are not isotropic.
state = np.random.RandomState(42)
random_uniform = state.uniform(size=(8192, 64))

# Compute similarity
sim = cosine_similarity(random_uniform)
# Compute mean score of the upper triangular matrix
# (otherwise we count double)
mean_score = sim[np.triu_indices_from(sim)].mean()
# mean_score ~= 0.75
p = PCA(n_components=64)
transformed = p.fit_transform(random_uniform)

sim = cosine_similarity(transformed)
mean_score = sim[np.triu_indices_from(sim)].mean()
# mean_score ~= 0.0001

So, as you can see, even without reducing dimensionality, we get the expected mean cosine of 0, simply because of the new basis.

This property of PCA was also surprising to us when we made model2vec: we used PCA to just reduce the dimensionality to directly compare to traditional embeddings, such as GloVe, but we saw that even when not reducing dimensionality, performance improved.

So far so good, I loved PCA. I was a PCA apologist. (¹)

Static models and MRL

Matryoshka Representation Learning (MRL) is a relatively new technique. It was proposed in a 2022 paper, but as far as I know really rose to prominence once OpenAI included it in their embedding models. See this blog post by Tom Aarsen for more information.

The idea behind MRL is that, if you train a network with some kind of loss function that operates on vectors, you can evaluate that loss on many contiguous subspaces of those vectors in a single forward pass. This works as follows: you first perform a forward pass to obtain the vectors, and then, for a set of dimensions D, you evaluate the loss at that specific dimension. For example, if our vector is 256-dimensional, and our dimensions D are 32, 64, 128 and 256, we will evaluate the loss four times for each forward pass. This has a few important consequences:

The model learns to create useful representations in the subspaces specified by D, but also in intermediate subspaces.
The model upweights “lower” dimensions, because these are effectively evaluated more often. For example, if there are four dimensions in D, the first dimension of the space is updated four times for each forward pass, while the last dimension is only updated once.

Note that MRL does not guarantee that dimensions are uncorrelated or have an expected cosine of 0. This needs to be a property of the loss function to which MRL is applied. MRL merely guarantees that performance is maintained when the vector is truncated. In practice, static models trained with something like a cosine loss have an expected cosine of 0; this is a useful property to have, so the model should naturally learn and exploit it. Below, we’ll test whether this is actually the case.

Experiments

The MRL paper shows that PCA is worse than MRL; PCA performance degrades more rapidly than MRL performance when the dimensionality is decreased. There’s a possibility that this conclusion does not transfer to static models, as we’re not applying PCA to the output of a model, but to the model itself, something which is impossible for a regular model. So it could be that, for static models, the fact that the whole model can be optimized by PCA is still better than using MRL.

There’s also the caveat that MRL requires a loss function to be optimized, although I think this is easily circumvented in practice. (²)

To see what it all means, I trained two static models: one with MRL, and the other without MRL. They were trained using the recipe from Tom Aarsen’s blog about static models, although I left out the paq and s2orc datasets. TLDR; it’s just supervised finetuning on a whole bunch of retrieval datasets using the MultipleNegativesRanking loss (also known as InfoNCE). The models were trained for 1 epoch using a very high learning rate of 0.2, 10% warmup and a linear cooldown. I experimented with other configurations but most of this had no effect.

I then evaluated both models on NanoBEIR. The results in the table below are the mean NDCG@10 over all datasets.

Dim	MRL	no MRL
32	32.52	25.93
64	39.71	34.51
128	45.20	42.36
256	48.10	47.20
512	49.49	49.63
1024	50.30	50.56

As you can see, there is not really a big downside to not using MRL. The scores using the full dimensionality are a bit lower, but this is a discrepancy I think will disappear. For lower dimensionalities, MRL is much better than not doing MRL, leading to a 7 point gain at very low dimensions.

Now, let’s apply PCA to both of them:

Dim	MRL + PCA	no MRL + PCA
32	32.41 (-0.10)	26.34 (+0.40)
64	39.65 (-0.10)	34.60 (+0.10)
128	44.95 (-0.30)	42.24 (-0.10)
256	48.04 (-0.10)	46.95 (-0.30)
512	49.52 (+0.00)	49.34 (-0.30)
1024	50.24 (-0.10)	50.59 (+0.00)

So, surprisingly, applying PCA after training does not really add any performance, even for really low dimensions. What is also surprising is that this holds regardless of whether we trained with or without MRL; I expected the non-MRL model to benefit from PCA.

Numerically, training with MRL appears consistently better. Applying PCA after training is not useful. Unfortunately for me, PCA is clearly outperformed by MRL. In short, “Friendship ended with PCA, MRL is now my friend”.

Discussion

Now, we need to square why we saw such large improvements when we applied PCA in model2vec, while not seeing any improvements here. Recall that the main reason for performance improvements seems to be that PCA transforms the vectors to an orthogonal basis, which has an expected mean cosine of 0; embedding spaces that do not have this property are worse for static models.

As it turns out, models directly optimized through gradient descent already have this property. For both the MRL and non-MRL models above, the expected cosine distance between the embeddings approaches 0. So the renormalizing effect of PCA had no additional impact, and hence applying PCA does not have any additional uses beyond actually reducing dimensionality for models not trained with MRL.

Conclusion & Future work

I think this opens up some interesting areas for improvement in static model initializers, such as model2vec. For example, you could just initialize the model randomly, and then train a small auto-encoder with MRL to get the scale-free behavior displayed by MRL. Whether this works better in practice than PCA remains to be seen.

Practical advice:

PCA helps when embeddings aren’t zero-mean and you’re not doing any training
MRL learns truncation robustness directly
combining them doesn’t help

Footnotes

An apologist is not someone who apologizes, but I guess they do sometimes apologize. Sorry if this was confusing. ↩ ↩²
PCA is mathematically equivalent to a 1-layer auto-encoder with linear activation function, albeit without the ordered dimensions property of PCA. As such, we can easily rewrite a “PCA” using MRL by training a reconstruction loss with MRL, which would give it the ordered dimensions property. I think there’s a lot of interesting low-hanging fruit here, because you can easily modify the loss of the auto-encoder, make the network deeper, etc. ↩

Static late interaction models

Tue, 30 Sep 2025 00:00:00 +0000

Late interaction is an interesting paradigm for computing the similarity between two documents, and can be seen as a hybrid of sparse and dense retrieval. In this post, I will show how static models in a late interaction setting actually reduce to sparse models. I will also argue that, in absence of empirical evidence to the contrary, there’s no good reason to assume that static late interaction models will be much better than their dense counterparts. But first, let’s dive into some fundamentals: I’ll explain what sparse retrieval and dense retrieval are, and how late interaction fits in with both of those paradigms.

Sparse

Sparse retrieval assigns each individual token one or more coefficients, and only scores tokens when they are present in the document. For example, a query like “pet stores” will only return documents that contain those terms. In other words, sparse retrieval does not retrieve semantically related words; it only indexes documents based on the terms that are actually present. Examples of sparse retrieval techniques are SPLADE, DeepImpact, BM25 and uniCOIL. Sparse retrieval tends to be precise because it matches terms exactly, but for the same reason also has trouble bridging the gap between semantically related terms. (¹) In general, sparse retrievers don’t do well if there’s little to no lexical overlap between queries and documents, or if there’s lots of semantic ambiguity.

Dense

In contrast, in dense retrieval, we assign a single vector to a whole document, and just compute dot-product based similarities between vectors to find the most similar ones. Putting everything in a single vector means that we mix up all words in the documents, thus allowing for queries with no lexical overlap to still retrieve relevant documents. The downside of this is that things can get too mixed up; a vector can implicitly model many related things, and it is difficult to predict a priori which vectors will match and why. In a way, it is naïve to expect a single vector to fully express the semantics of a document of arbitrary length.

Late interaction

Which brings us to late interaction. In late interaction, we use a dense model (²) to create one vector for each token in the document. If this sounds excessive, don’t worry, there’s many tricks to alleviate the burden of storing and retrieving this many documents. (³) At query time, instead of calculating the dot product between vectors, we calculate the maxsim similarity. For a given query Q with m tokens and document D with n tokens, the similarity is as follows:

\[s(Q,D) = \sum_{i=1}^{m} \max_{1 \le j \le n} \;\Big\langle \widehat{\mathbf q}_i,\;\widehat{\mathbf d}_j \Big\rangle\]

So, for each query token, we first calculate the similarity (⁴) to each document token, and then take the max of those similarities. The sum over all of the query tokens is the maxsim score.

maxsim allows late interaction models to attach scores to specific tokens, like sparse retrieval models, but also allows for a graded similarity between related tokens, like dense models (and unlike sparse models). As such, we can think of late interaction models as a hybrid between dense and sparse models. There’s many other aspects to dig into, which I won’t cover here, so please read one of the many good posts on the subject.

Static models

The reason late interaction models work is not just because of maxsim, but also because the underlying models are trained to maximize the similarity between a query token and a related document token. These models are contextualized, which means that the model produces different vectors for tokens in different contexts. Static models, on the other hand, always produce the same vector for each token, regardless of the context. This makes static vectors worse, but also much faster. Why and when this is useful is the topic of an upcoming post, but for now let’s assume this is a useful property.

Static late interaction

Now, I will argue that maxsim, when applied to a static model, implicitly leads to a sparse model. First, recall that, in a static model, every occurrence of a token always gets the same vector. This also implies that the similarity between two tokens is always exactly the same: if dog and cat always get the same vector, then sim(dog, cat) is always the same value. So, this gives us a nice optimization: we can precompute all possible similarities. For a vocabulary V with t tokens, this leads to a t x t-sized matrix, which we call W. Note that W is very big! For a vocabulary size of 30k, this already is a 900 million parameter matrix. In practice we can easily make this matrix extremely sparse by pruning any items below a certain threshold. (⁵)

Now, given W, maxsim reduces to:

\[s(Q,D) = \sum_{i=1}^{m} \max_{1 \le j \le n} \;W_{Q_iD_j}\]

This formulation means that we only need to store token indices and compute query indices to get the same result as we would have gotten when storing all vectors and computing vectors at query time. (⁶) We also still need to store W, however. In addition, it is also unclear whether this is actually efficient.

Fortunately for us there’s yet another shortcut: for each token in document D, we can index the columns from W, and take the max. This leads to a single V-sized vector, which we call Y. Y contains the pre-computed max from the document to each possible token. This effectively precomputes the max for each possible token for each document. So, if we do this, the only thing we need to do at query time is index this vector using the query tokens, and take the sum. Because the vectors are pretty sparse, and the sparsity is controllable, this leads to a small memory footprint, and small query-time compute. Here’s the equation:

\[s(Q,Y) = \sum_{i=1}^{m} Y_{Q_i}\]

To repeat: during query time, the only thing we do is index. The index consists of a single document-term matrix, with the number of rows equal to the number of documents, and number of rows equal to t, the vocabulary size.

One question this raises is whether, for a decently-sized corpus, this document-term matrix is actually smaller than W. The answer is: no, except for really small numbers of documents. This is caused by the fact that a document vector is the max of a lot of tokens, and there tends to have a lot of non-zero coefficients. So, in practice, if space is an issue, it might actually be better to still use W. If speed is a concern, it might be better to bite the bullet, and store the extra coefficients.

Sparsity

The older people among you will point to this and say: this is just a sparse index, but with soft weights on related terms! (⁷) And you would be right! In fact, if we set the similarity threshold on W to 1.0 we get a very bad version of BM25. (⁸) Note that this behavior does not appear because we perform some magic trick or manipulation: it is inherent to the way maxsim works. So even if you compute maxsim as in the original equation, you will get this BM25-like behavior.

This explains why I think that just computing the maxsim with a static model as a regular late interaction model will never work well: BM25 contains a lot of cool tricks to make sure retrieval works well, including different weighting schemes for queries and documents, a length bias, and two tunable parameters. These are all missing from this algorithm. An interesting task, then, could be to re-add these terms: model2vec for example, adds weighting by inflating and shrinking the norms by token frequency. These weighting terms can be re-added on the query tokens, or put in the index. Similarly (⁹), the length bias in BM25 can also be integrated into this formulation.

Conclusion

This is all preliminary theoretical work, but which can be very promising. One thing that specifically is interesting is asymmetric static models, i.e., using different static models to encode queries and documents, which is something I am actively working on. It is currently unclear whether training static models as late interaction models is actually useful. I have trained some static models using PyLate, but this did not lead to good results; training them as regular dense retrievers works much better. More research is needed, as always. Feel free to reach out if you have ideas, I’m always open to talk.

Acknowledgments

Thanks jonah for proofreading and helpful suggestions about SPLADE.
Thanks Ben for suggesting blogs to link to.

Appendix: code sample

Here’s some code showing the methods are equivalent. We don’t precompute the document representations, but in the last function you could just do that.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def maxsim(q: list[int], doc: list[int], vecs: np.ndarray) -> float:
    """Compute the maxsim"""
    q_x = vecs[q]
    d_x = vecs[doc]
    sim = cosine_similarity(q_x, d_x)  # q, d matrix
    maxes = sim.max(1)  # q vector
    return maxes.sum()


def maxsim_w(q: list[int], doc: list[int], W: np.ndarray) -> float:
    """Compute the maxsim with W."""
    vectors = W[doc]  # d, V matriw
    indexed = vectors[:, q]  # d, q matrix
    return indexed.max(0).sum()


def maxsim_doc(q: list[int], doc_w: np.ndarray) -> float:
    """Last step, precompute the documents."""
    return doc_w[q].sum()

random = np.random.RandomState(42)
vectors = random.randn(1000, 32)

doc = [1, 2, 3, 4, 5]
query = [10, 11, 12]
W = cosine_similarity(vectors)
# Document
doc_w = W[doc].max(0)

a = maxsim(query, doc, vectors)
b = maxsim_w(query, doc, W)
c = maxsim_doc(query, doc_w)

assert np.isclose(a, b)
assert np.isclose(b, c)

Footnotes

This is typically alleviated through query expansion techniques. SPLADE is also notable in that it automatically performs query/term expansion within the model, in addition to scoring terms that are present. ↩
These dense models are specifically trained to be late interaction models, but their cores are just pre-trained transformers, like the ones we use for dense retrieval. For training details, see the colbert paper and the colbertv2 paper. You can use PyLate to train, it’s easy! ↩
Examples of this include MuVERA, FastPLAID, maxsim-cpu and probably many others. ↩
In the equation we use the dot product similarity, but the cosine similarity can also be used. ↩
This pruning is empirically justified: because of the maxsim, it is unlikely that tokens with very low similarities ever get selected. And even if they do get selected, it is unlikely that this will lead to a meaningful difference in selected documents. The proof of the pudding is in the eating, however. ↩
In practice, ~90% of time is spent on tokenization, so this isn’t the big win it seems. ↩
If you are young and noticed this: good job buddy! ↩
Bad because it does not have any of the things, i.e., length, query weights, IDF, that make BM25 actually good. ↩
haha ↩

Better Greedy Tokenizers: Handling WordPiece's [UNK] Problem

Thu, 18 Sep 2025 00:00:00 +0000

In a previous post, I showed that making a tokenizer greedy, that is, always picking the longest matching subword like WordPiece does, can improve results without retraining. But WordPiece can unfortunately silently break your tokenization.

Consider this example:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
result = tokenizer.encode("talk" * 10_000).tokens
print(result[:10])
# ['[CLS]', '[UNK]', '[SEP]']

Instead of producing many repetitions of talk (or something else), the tokenizer outputs a single [UNK].

This happens because WordPiece enforces a hard limit on the length of each run (the contiguous string passed to it after pretokenization). The length is determined by a parameter max_input_chars_per_word, which is set to 100 by default. As the name suggests, this parameter puts a maximum on the number of characters any pretoken going into the model has. Once you go over this limit, the model doesn’t crash, but silently produces an [UNK] token.

What this means is that if you are used to BPE, WordPiece could often get you [UNK]. In practice, this makes it difficult to use WordPiece with highly multilingual collections, because it becomes much more probable to get long runs. In addition, it also makes it impossible to create a multi-word WordPiece tokenizer.

Why does this parameter exist?

The reason why this happens is because of WordPiece itself. The WordPiece algorithm, as implemented in the Hugging Face tokenizers package is as follows:

For a given input string and a vocabulary of subwords, do the following:

Initialize two pointers, one at the start of the string, S, and one at the end of the string, E
Decrement E by 1 and see if the run from S to E forms a valid token.
Once you find a valid token, increment S by the length of the token you found.

If you ever don’t find a token, you just emit [UNK] for the whole run. As you can probably see, this algorithm is quadratic as a function of input length. For every subword you find, you will skip to the end of the run and walk back to the near-start. There’s lots of low-hanging fruit to make this more efficient, but that is not what this post is about.

So, this also hopefully makes clear why max_input_chars_per_word exists: when encountering a single run of, say, 100k characters, the WordPiece inference algorithm could conceivably take hours. For example, on my machine, encoding a 500 character string takes 3.66ms (0.007 ms per character), a 5000 character string takes 638ms (0.12 ms per character, a 17x increase), while encoding a 50000 character string takes … too long(¹). It would be really silly to wait for such a long time.(²)

Fixing the parameter

Since we are stuck with the parameter, we might as well make the best of it. As it turns out, there is a very nice solution we can leverage within the Hugging Face ecosystem: the FixedLength pretokenizer. The FixedLength pretokenizer simply splits strings up into tokens of a pre-specified length.

So picture this: you have a tokenizer you like, with a pretokenizer you like. But sometimes, due to the domain you find yourself operating on, you end up with a run that is longer than max_input_chars_per_word. Adding a FixedLength pretokenizer to the pretokenizer you already had solves exactly this issue: pretokenization proceeds as it normally would, but any runs coming out of your previous pretokenizer that are too long are then split up into usable chunks. Problem solved. The only issue you could run into is that you miss tokens you otherwise could have found.

Implementation in skeletoken

This is fully implemented in skeletoken. Let’s return to the example from the top of the article:

from skeletoken import TokenizerModel

model = TokenizerModel.from_pretrained("bert-base-uncased")
model = model.make_model_greedy()

# Make fixedlength really low for demonstration purposes
model.pre_tokenizer.pretokenizers[-1].length = 10
tokenizer = model.to_tokenizer()

result = tokenizer.encode("talk" * 10_000).tokens
print(result[:10])
# ['[CLS]', 'talk', '##talk', '##ta', 'l', '##kt', '##al', '##kt', '##al', '##k']

As you can see, the third talk is chopped up into ##ta and lk, because that’s where the pretokenizer boundaries fell. In practice though, this should almost never occur or matter.

Note that this also makes it possible to use greedy tokenizers without any form of pretokenization, and enables the use of multi-word units in WordPiece tokenizers: because multiword tokenizers generally don’t pretokenize at all, any sequence over 100 characters would produce an [UNK].

Future work

In a future post I’ll dive into how you can make a much faster greedy tokenizer by imposing specific restrictions on the tokenizer model, and then using the Aho-Corasick algorithm with backtracking to find subwords with much lower complexity.

This actually took to long to run. Sorry! ↩
It is equally silly to not implement a more efficient variant when such things exist. For example, moving the end pointer not to the end of the string, but forward by the maximum subword length would completely solve this issue and literally lead to the same solution. ↩

Note: alternative to regex splitting in byte tokenizers

Tue, 12 Aug 2025 00:00:00 +0000

In a previous note, I discussed an alternative for setting split to true in a ByteLevel pretokenizer. I suggested using a ByteLevel normalizer first, and then splitting using a complicated regex in “byte space”. However, this turned out to not work very well: there are certain character classes in an original Regex, such as \s, that are very difficult to convert to a pattern in byte space.

I was wondering about how others did this, and discovered that you can stack multiple pretokenizers by first using a Split pretokenizer with a regex, and then using a ByteLevel pretokenizer with split set to False. This is, e.g., what Qwen/Qwen3-Embedding-0.6B uses. Doing it this way is correct and achieves my original proposal: a way to split using a regex of your own design, with Byte normalization.

Here’s what that looks like:

from tokenizers import Regex
from tokenizers.pre_tokenizers import Split, ByteLevel, Sequence

pattern = r"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"
split = Split(Regex(pattern), behavior="isolated")
byte = ByteLevel(use_regex=False, add_prefix_space=False)
pretokenizer = Sequence([split, byte])

original = ByteLevel(use_regex=True, add_prefix_space=False)

s = "hello, ご　「きげんよう?」？”" 

print(pretokenizer.pre_tokenize_str(s))
print(original.pre_tokenize_str(s))

This allows you to freely change your regex without any difficulties. One thing to note is that add_prefix_space needs to be unset for this to be totally equivalent. If not, you will need to add a Prepend normalizer.

Separate Normalization from Splitting in ByteLevel tokenizers

Tue, 12 Aug 2025 00:00:00 +0000

This note is wrong! This was revealed to me by Sasuke___420. As it turns out, the regex does not work the same as the original one, specifically for non-ascii spaces. Upon further reflection, I don’t think you should really use this.

This is a short note to dissuade you from using a ByteLevel pretokenizer in your tokenizers. The ByteLevel pretokenizer, as implemented in Hugging Face tokenizers does three things:

Possibly inserts a space in front of your string (if add_prefix_space is True (default))
Encodes your string into a byte encoding
Tokenizes using a regex that is specific to English (if use_regex is True (default))

Here’s an example:

from tokenizers.pretokenizers import ByteLevel

b = ByteLevel()
b.pre_tokenize_str("hello, こんにちは")
# A list of three tokens.
# [('Ġhello', (0, 5)), (',', (5, 6)), ('ĠãģĵãĤĵãģ«ãģ¡ãģ¯', (6, 12))]
# The tokenizer inserted a space before "hello"
# It converted to bytes
# And then split.

In the tokenizers package, there’s a distinction between a normalizer and a pretokenizer. A normalizer simply changes your string, but doesn’t split it. For example, if your tokenizer lowercases your input, you’ll use a Lowercase normalizer. A pretokenizer splits your string into “words”, which can then get decomposed into actual tokens. A “word”, in this definition, is a boundary across which you can never find a subword token. For example, if your pretokenizer splits on "-", the string "bench-maxx" will be split into ["bench", "-", "maxx"]. Even if your vocabulary contains a token like "h-m", it will never be found.

In this framework, it makes sense to express steps 1. and 2. above as normalizations, and decouple them from the splitting. This also makes sense from a multilingual point of view: the pretokenization regex used by the Hugging Face pretokenizer is outdated and only works for English. This regex is:

"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"

As you can see, it contains common contractions, which only work for English. In fact, applying this to other languages might destroy their tokenization.

Luckily for us, Hugging Face tokenizers contains an equivalent transformation using normalizers and a regex splitter. Unfortunately for us, however, we need to change the regex above because otherwise it splits on various byte tokens.

from tokenizers import Regex
from tokenizers.normalizers import ByteLevel as ByteLevelNormalization, Prepend, Sequence
from tokenizers.pre_tokenizers import Split, ByteLevel

normalizer = Sequence([Prepend(" "), ByteLevelNormalization()])

# Change it to split only on ASCII punctuation
pattern = r"'s|'t|'re|'ve|'m|'ll|'d|Ġ?(?:[\p{L}&&[^Ġ]]|[\p{P}\p{S}&&[^\x00-\x7F]])+|Ġ?\p{N}+|Ġ?[\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E]+"
pretokenizer = Split(Regex(pattern), behavior="isolated")

b = ByteLevel()

s = "hello, ごきげんよう?" 

print(pretokenizer.pre_tokenize_str(normalizer.normalize_str(s)))
print(b.pre_tokenize_str(s))

And that’s a wrap! You can now safely add or remove whatever you want to the regex defined above, split however you like, and it will work. One downside of this approach is that writing and interpreting a regex for bytes is quite difficult.

Turning any tokenizer into a greedy one

Sun, 10 Aug 2025 00:00:00 +0000

I recently re-read Greed is All You Need: An Evaluation of Tokenizer Inference Methods. In this paper, the authors show that switching out inference methods for tokenizers can improve performance on various tasks.

In this post, I talk about how this could be interesting, introduce an implementation to switch out inference methods for a HF tokenizer, and present the results on some experiments.

Preliminaries

A tokenizer is, simply put, a program that, given a vocabulary of tokens V, can segment text into a sequence of tokens. These tokens are suitable for input into neural networks, because each token is actually just an index to an embedding table.

Crucially, the vocabulary V can be automatically learned from a large corpus of text. There are many algorithms for doing so, but the most well-known are WordPiece, UnigramLM, and Byte Pair Encoding (BPE). I won’t dive into the details of those algorithms here. What is important is that each of these methods does not only differ in what kind of vocabulary V they learn, but also how they actually segment text. For example, the WordPiece algorithm just takes the longest possible prefix at any position (a greedy algorithm), while BPE’s segmentation is governed by a separate merge table.

The experiment

The main contribution by the aforementioned paper is showing that switching out the inference algorithm after training actually works well. That is, if you have a vocabulary V learned by a BPE tokenizer, you can segment text using that same vocabulary and, e.g., the WordPiece inference algorithm. This improves performance, especially when switching to a greedy algorithm. This is not what I would have expected by the way, since you are changing the distributions.

To see what this looks like, here’s the standard and greedy segmentations for two phrases, using the ModernBERT tokenizer.

string: "hellooo phonenumber"
normal: ['hell', 'ooo', 'Ġphon', 'en', 'umber']
greedy: ['hello', 'oo', 'Ġphone', 'number']

string: " unilaterally"
normal: ['Ġun', 'il', 'aterally']
greedy: ['Ġunilateral', 'ly']

As you can see, the greedy tokenizer matches our intuitions about language much more closely: hellooo is not related to hell, and unilaterally does not use the prefix un (it should be uni). This is in line with what the authors of the aforementioned paper found: when examining performance on morphological tasks, switching to a greedy algorithm made performance go up.

Implementation

I implemented greedy tokenization by simply switching out the tokenizer model from whatever it was to a WordPiece implementation. This is easy in my package tokenizer-datamodels.

from tokenizerdatamodels import TokenizerModel

# This is a pydantic model.
datamodel = TokenizerModel.from_pretrained("answerdotai/ModernBERT-base")
# This is a HF tokenizer, you can just use it.
greedy = datamodel.make_model_greedy().to_tokenizer()

greedy.encode("hello phonenumber")

tokenizer-datamodels, as the name implies, is just a collection of models that can be used to parse and edit a tokenizer.json, which is the Hugging Face tokenizers construct that contains all information about a tokenizer. It has many of these tiny features, and I’ll be adding more soon, so check it out if that sounds interesting.

Experiments

As mentioned above, greedy works well on intrinsic tasks. But does it actually improve performance on downstream tasks? To find out, I ran two models, multilingual-e5-base and modernbert-embed-base on NanoBEIR. This is very similar to the setup in my previous blog post about decasing.

	ModernBERT	e5
Original	57.68	57.27
Greedy	55.20	55.90

So, interestingly, switching to a greedy tokenizer completely tanks the scores of the models: it is literally worse on all datasets in NanoBEIR for both models. While we could consider this to be in direct opposition to the results in the paper, I don’t think this is the case.

Discussion

To see why, recall that the results from the paper were based on the tokenization itself; no downstream models were trained. In these experiments, we instead change the tokenizer of a model without retraining it. Now, to see why this is bad, we should realize that tokens don’t have any intrinsic meaning to a model; the model does not know that “hell” is not a good prefix for the word “helloooo”, and “hello” is a better one. To the model, these are just indices to an embedding table. So changing the segmentation of words can’t realistically help the model, because we are changing the underlying token distribution feeding into the model without telling the model about it.

My hypothesis: pre-training an encoder model with a greedy tokenizer leads to better results than training one with a regular BPE tokenizer. Having tokens that more closely follow morphology is probably good for model performance: the model has to learn fewer exceptions, and can rely more on surface form.

Related hypotheses: if you have trillions of tokens, is that still relevant? Aren’t all possible segmentations covered and memorizable? Even if following morphology is better, does it only impact training time, or is the resulting model actually better? Many interesting questions, and things I am eager to explore.