From Chesterton's fence to Chesterton's gap
The British Writer and Catholic apologist G. K. Chesterton is, perhaps, most well known to programmers through a paragraph in which he introduces what is now known as “Chesterton’s fence”. It’s a very simple idea: You walk through a field and see a fence which, seemingly, has no purpose. Instead of tearing it down because it seemingly has no use, try to understand or ask why somebody put it there. 1 That’s it!
-
The full quote is: ‘IN the matter of reforming things, as distinct from deforming them, there is one plain and simple principle; a principle which will probably be called a paradox. There exists in such a case a certain institution or law; let us say for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, “I don’t see the use of this; let us clear it away.” To which the more intelligent type of reformer will do well to answer: “If you don’t see the use of it, I certainly won’t let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.”’ [source] ↩
Make something for someone
I recently listened to the album Music for Existing by producer Martyn. I wasn’t really familiar with him, and in fact it got algorithmically recommended by me because the album features Duval Timothy, whose album Meeting with a Judas Tree I adore.
Scikit-learn's fit transform paradigm is probably not for you
If you’ve ever used code from scikit-learn, you will have seen the following pattern:
Evaluating static models on RTEB
The group of researchers associated with the Massive Text Embedding Benchmark (MTEB) has released a new benchmark: the Retrieval Text Embedding Benchmark. As you may know, MTEB ranks models on their ability to perform well at a variety of tasks in a zero-shot setting, and is meant to reflect how well your model transfers to new tasks. Ranking high on MTEB can make or break your model, so it has become something that people optimize for, and as Goodhart put it: “when a measure becomes a target, it ceases to become a good measure”.
Comparing PCA and MRL for static models
Without reducing dimensionality, static models can be hundreds of MB large. Choosing the right dimensionality-reduction technique can shrink them without sacrificing retrieval quality. I was always a huge fan of Principal Component Analysis (PCA) for making static models smaller. For example, PCA is used in model2vec and was used in an older version of tokenlearn to post-process models, and is used in the newer version of tokenlearn to reduce the dimension of the teacher models. PCA was always on my mind as a good option for reducing dimensions. Recently, however, I started experimenting with Matryoshka Representation Learning (MRL) for reducing dimensions and have found it to be superior, which I found surprising. This blog post thus tries to answer the question: when should you be using PCA and MRL? If one is better than the other, why? I discuss both techniques, why applying dimensionality reduction to static embeddings makes sense, and some options for future work.
Static late interaction models
Late interaction is an interesting paradigm for computing the similarity between two documents, and can be seen as a hybrid of sparse and dense retrieval. In this post, I will show how static models in a late interaction setting actually reduce to sparse models. I will also argue that, in absence of empirical evidence to the contrary, there’s no good reason to assume that static late interaction models will be much better than their dense counterparts. But first, let’s dive into some fundamentals: I’ll explain what sparse retrieval and dense retrieval are, and how late interaction fits in with both of those paradigms.
Better Greedy Tokenizers: Handling WordPiece's [UNK] Problem
In a previous post, I showed that making a tokenizer greedy, that is, always picking the longest matching subword like WordPiece does, can improve results without retraining. But WordPiece can unfortunately silently break your tokenization.
Note: alternative to regex splitting in byte tokenizers
In a previous note, I discussed an alternative for setting split to true in a ByteLevel pretokenizer. I suggested using a ByteLevel normalizer first, and then splitting using a complicated regex in “byte space”. However, this turned out to not work very well: there are certain character classes in an original Regex, such as \s, that are very difficult to convert to a pattern in byte space.
Separate Normalization from Splitting in ByteLevel tokenizers
This note is wrong! This was revealed to me by Sasuke___420. As it turns out, the regex does not work the same as the original one, specifically for non-ascii spaces. Upon further reflection, I don’t think you should really use this.
Turning any tokenizer into a greedy one
I recently re-read Greed is All You Need: An Evaluation of Tokenizer Inference Methods. In this paper, the authors show that switching out inference methods for tokenizers can improve performance on various tasks.