<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Stéphan Tulkens</title>
    <description>NLP Person</description>
    <link>https://stephantul.github.io/</link>
    <atom:link href="https://stephantul.github.io/feed.xml" rel="self" type="application/rss+xml" />
    <pubDate>Wed, 17 Jun 2026 09:12:15 +0000</pubDate>
    <lastBuildDate>Wed, 17 Jun 2026 09:12:15 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>From Chesterton&apos;s fence to Chesterton&apos;s gap</title>
        <description>&lt;p&gt;The English Writer and Christian apologist &lt;a href=&quot;https://en.wikipedia.org/wiki/G._K._Chesterton&quot;&gt;G. K. Chesterton&lt;/a&gt; is, perhaps, most well known to programmers through a paragraph in which he introduces what is now known as “Chesterton’s fence”. It’s a very simple idea: You walk through a field and see a fence which, seemingly, has no purpose. Instead of tearing it down because it seemingly has no use, try to understand or ask why somebody put it there. (&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;) That’s it!&lt;/p&gt;

&lt;p&gt;Paraphrasing: if you think somebody built something bad or in a bad way, try to understand why they did it that way before undoing their work. Being burned while ignoring Chesterton’s fence is a rite of passage for every programmer: you see a piece of code and think “who the hell wrote this”. Then, when rewriting it, you break production, and realize that there was a good reason somebody did the things they did. They weren’t stupid after all. Or, you rewrite it and it’s actually better, and you now know more about the person who wrote it, and maybe can teach them how to build better together.&lt;/p&gt;

&lt;p&gt;Chesterton’s fence urges us to slow down, and retrace the thinking steps of the person who built before you, thus putting you in their shoes. Keeping Chesterton’s fence in mind does not only make you a better engineer, but it also makes you empathize more with the people around you, the ones that came before you. It shows you the limits of your own knowledge, but simultaneously shows you what you can teach others around you.&lt;/p&gt;

&lt;h1 id=&quot;chestertons-gap&quot;&gt;Chesterton’s gap&lt;/h1&gt;

&lt;p&gt;So having said that, I think there’s an interesting new dynamic at play in software land, which I will call Chesterton’s gap. It’s like Chesterton’s fence, except that people walk through the field and ask themselves why somebody &lt;em&gt;hasn’t&lt;/em&gt; built a fence there yet, and then, without asking, build a fence.&lt;/p&gt;

&lt;p&gt;To me, this is what it feels like to build open source libraries. The cost of creating lines of code has dropped to ~0, which causes people (&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;) to submit 10k line PRs without even opening an issue first. The thing is, these PRs make sense. They are not bad! They’re just not necessary. They add features to projects that nobody asked for, add tools that are marginally useful, add configuration scaffolding for IDEs that barely anyone uses. (&lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;) To return to the parable: the fences are well built, they are sturdy, they may even serve a purpose. But I don’t need a fence in that specific location, even if it is free. I just don’t need it, and I don’t want it.&lt;/p&gt;

&lt;p&gt;I can also write lines of code for free. I have the same superpowers you do, so if I didn’t add some feature to a project I own, there’s probably a good reason I didn’t add that specific feature. If you’re wondering why I didn’t add it myself, just ask. Don’t build the fence.&lt;/p&gt;

&lt;h2 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h2&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;The full quote is: ‘IN the matter of reforming things, as distinct from deforming them, there is one plain and simple principle; a principle which will probably be called a paradox. There exists in such a case a certain institution or law; let us say for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, “I don’t see the use of this; let us clear it away.” To which the more intelligent type of reformer will do well to answer: “If you don’t see the use of it, I certainly won’t let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.”’ [&lt;a href=&quot;https://catholiclibrary.org/library/view?chunk.id=00000011&amp;amp;docId=%2FContemporary-EN%2FXCT.165.html&quot;&gt;source&lt;/a&gt;] &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;who am I kidding. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;I have ignored maintenance here, but maintenance is a big part of why you should not just accept 10k line PRs. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Sun, 14 Jun 2026 00:00:00 +0000</pubDate>
        <link>https://stephantul.github.io/blog/unfence/</link>
        <guid isPermaLink="true">https://stephantul.github.io/blog/unfence/</guid>
        
        
      </item>
    
      <item>
        <title>Make something for someone</title>
        <description>&lt;p&gt;I recently listened to the album &lt;a href=&quot;https://martyn.bandcamp.com/album/music-for-existing&quot;&gt;Music for Existing&lt;/a&gt; by producer &lt;a href=&quot;https://martyn.bandcamp.com/&quot;&gt;Martyn&lt;/a&gt;. I wasn’t really familiar with him, and in fact it got algorithmically recommended by me because the album features &lt;a href=&quot;https://duvaltimothy.co.uk/&quot;&gt;Duval Timothy&lt;/a&gt;, whose album &lt;a href=&quot;https://duvaltimothy.bandcamp.com/album/meeting-with-a-judas-tree&quot;&gt;Meeting with a Judas Tree&lt;/a&gt; I adore.&lt;/p&gt;

&lt;p&gt;The track &lt;a href=&quot;https://martyn.bandcamp.com/track/musa-at-erbil&quot;&gt;Musa at Erbil&lt;/a&gt;, which features the voice and words of &lt;a href=&quot;https://www.musaokwonga.com/about&quot;&gt;Musa Okwonga&lt;/a&gt; features a beautiful reflection on modern life, which really resonated with me as a software professional, although I realize that that last part probably wasn’t his intention. It is worth quoting here in full, but please also &lt;a href=&quot;https://martyn.bandcamp.com/track/musa-at-erbil&quot;&gt;listen to the track and possibly buy it&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;The main thing is to make something together with someone else, it could be a 
meal, a mess, whatever.
The point is, you have to make something. 
It&apos;s not what you make that matters, it&apos;s that you make it. It&apos;s that you 
enjoy making it. 
Look, so much of modern life is about outcome, capitalism, outcome, product, 
outcome, AI, outcome.
But those are not outcomes you achieve in group, they&apos;re outcomes you 
achieve alone. 
Sit at your desk, you write something for a deadline for payment, outcome 
alone. 
AI, outcome alone, you type something in, outcome, alone. 
The thing we&apos;re missing from that is human process, shared process. 
Shared journeys, it&apos;s about that, I don&apos;t know why you&apos;d want to do that any 
other way, because it&apos;s about shared journey, community, right. 
Community is what makes us human. 
Why would you want to cut that out, why would you go, why would you go from 
A to Z, why would you miss out on the alphabet in between. 
That&apos;s the, that&apos;s the fun, that&apos;s the joy. 
You don&apos;t watch a movie and go straight to the end credits, you experience it. 
Why would you go, why would you do that? 
You don&apos;t do that in a movie. 
Why would you do that in life? 
How is a movie serving us in the way we live each day. 
Movies were created to describe the human experience, and now they&apos;re better than
human experience. 
What are we doing here? 
What are we actually doing? 
So yeah, make something with someone. 
If you can make something and laugh with someone... but best of all, just love 
making something with someone.

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This blew me away.&lt;/p&gt;

&lt;p&gt;Without beating around the bush, we’re trying to automate ourselves out a job. I do this, and for a large part this has led me to be very productive. I have built systems I could not have built because I used coding assistance, and have learned things I would not have otherwise learned. This to say that this is not something I shy away from.&lt;/p&gt;

&lt;p&gt;But at the same time I am afraid of losing the fun part. This is the part which makes the creation of software &lt;em&gt;feel&lt;/em&gt; like &lt;em&gt;creation&lt;/em&gt;. This is when you make something by hand, together with another human being. You brainstorm about what you want to achieve, share your goals and your dreams, loudly ask daring questions about which machinery is missing, and dream about building this together. And then, you tap a bunch of keys which makes symbols appear on a screen in a little box and you then run some commands and then suddenly the thing you made is alive. You have &lt;em&gt;made&lt;/em&gt; something! For a while last year, I worked by myself on a bunch of open source projects, but it just wasn’t fun at all. It wasn’t intellectually stimulating in the same way as writing code together is. This is the dreaming part, it’s making things together, being part of a well-oiled machine together, fixing the machine together when it’s broken, saying sorry when you break it, be a little bit mad at your partner for breaking it but also help them fix it.&lt;/p&gt;

&lt;p&gt;The biggest risk in giving away your agency to a machine is not downskilling, making yourself vulnerable to bugs, it’s about losing touch with others, and unlearning what it means to make something together, or even despising working together with others. The worst thing we can become is a bunch of people sitting alone and just efficiently but soulleslly contributing code to some god-forsaken pile nobody cares about. To quote the text above: “what are we doing?” We know that good things happen when people actually care about what they make, we just need to be brave enough to accept the consequences of working with humans.&lt;/p&gt;
</description>
        <pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate>
        <link>https://stephantul.github.io/blog/make-something/</link>
        <guid isPermaLink="true">https://stephantul.github.io/blog/make-something/</guid>
        
        
        <category>creation</category>
        
      </item>
    
      <item>
        <title>Scikit-learn&apos;s fit transform paradigm is probably not for you</title>
        <description>&lt;p&gt;If you’ve ever used code from &lt;a href=&quot;https://scikit-learn.org/stable/index.html&quot;&gt;scikit-learn&lt;/a&gt;, you will have seen the following pattern:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;numpy&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;

&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StandardScaler&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;scaler&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StandardScaler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;scaler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;X_transformed&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scaler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Or equivalently
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X_transformed&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scaler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit_transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For all &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;scikit-learn&lt;/code&gt; transformers (&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;), the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fit&lt;/code&gt; call sets the internal state of the object, while the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;transform&lt;/code&gt; call uses the set internal state to transform some data into something else. (&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;) This paradigm is really useful because it allows for zero-cost chaining: any sequence of transformations can be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fit_transform&lt;/code&gt;ed by simply calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fit_transform&lt;/code&gt; on all transformations in sequence.&lt;/p&gt;

&lt;h2 id=&quot;conflation-between-construction-and-usage&quot;&gt;Conflation between construction and usage&lt;/h2&gt;

&lt;p&gt;The main point I’ll be making in this article is that scikit-learn’s fit transform paradigm mixes up the factory pattern, that is, an object that instantiates other objects, with the actual objects. This is used really well by scikit-learn, but probably doesn’t fit your codebase.&lt;/p&gt;

&lt;p&gt;To illustrate, let’s reimplement the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;StandardScaler&lt;/code&gt; using numpy: (&lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;)&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;__future__&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;annotations&lt;/span&gt;

&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;numpy&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StandardScaler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;with_mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;with_std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_mean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;with_mean&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_std&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;with_std&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StandardScaler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;

    &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;property&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_is_fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bool&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_mean&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;is&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_std&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;is&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_is_fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;raise&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;ValueError&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Standardscaler has not been fit&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fit_transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Let’s first talk about the initializer. In a scikit-learn initializer, you are only supposed to set the so-called &lt;em&gt;hyperparameters&lt;/em&gt; of a transformer or estimator.That is, you should only set attribues that do not depend on the data you will use to fit the model. So, in this case, the parameters of the initializer determine what the behavior of the instantiated &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;StandardScaler&lt;/code&gt; will be. So, in our case, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;with_mean&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;with_std&lt;/code&gt; determine what the behavior is of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;StandardScaler&lt;/code&gt; that is &lt;em&gt;produced&lt;/em&gt; by fitting the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;StandardScaler&lt;/code&gt; on some data; if we set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;with_mean&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;False&lt;/code&gt;, we actually get a different object than we would get if we set it to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;True&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Second, note that the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fit&lt;/code&gt; function is destructive. It erases the original state, and introduces a completely new state. From a python perspective, however, the same &lt;em&gt;object&lt;/em&gt; is returned, its only the internal state that is reset.&lt;/p&gt;

&lt;p&gt;Third, note that there is no need to store the hyperparameters once you’ve fit the transformer. &lt;sup id=&quot;fnref:4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Fourth, for a given &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;StandardScaler&lt;/code&gt;, it is impossible to know whether it has been fit or not. So, whenever you work with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;scikit-learn&lt;/code&gt;’s internals, you’ll have to continuously check whether the estimators and transformers you work with actually have their internal state set.&lt;/p&gt;

&lt;p&gt;Fifth, when you write your own transformers and estimators, it is very easy to incorrectly implement this state. (&lt;sup id=&quot;fnref:5&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;)&lt;/p&gt;

&lt;h2 id=&quot;splitting-out-the-factory&quot;&gt;Splitting out the factory&lt;/h2&gt;

&lt;p&gt;So, now on to my main thesis: this whole problem can be avoided by conceding that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;StandardScaler&lt;/code&gt; is both a factory and the object that is constructed by the factory. As such, if we split this up into two separate classes, we’ll see that we’ll end up with much cleaner code.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;__future__&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;annotations&lt;/span&gt;

&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;numpy&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StandardScaler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;is&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;is&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StandardScalerFactory&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;with_mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;with_std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_mean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;with_mean&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_std&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;with_std&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StandardScaler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StandardScaler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fit_transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;tuple&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StandardScaler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;scaler&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scaler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scaler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As you can see, we’ve changed the structure considerably. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fit&lt;/code&gt; now returns an object which implements &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;transform&lt;/code&gt;, and only implements &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;transform&lt;/code&gt;. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fit_transform&lt;/code&gt; returns a tuple, the first item of which is the fit object, the second of which is the transformed data. This still allows us to forward state in a single call as follows:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;transformers&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[...]&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Some list of transformers
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# some numpy array:
&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fit_transformers&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transformer&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transformers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;fit_transformer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transformer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit_transformer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;fit_transformers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit_transformer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So what did we gain? A couple of things:&lt;/p&gt;

&lt;p&gt;1) We can guarantee that the object we’re dealing with has been fit on some data, and is usable.
2) We clearly separate between creation (the factory) and usage.
3) We have much fewer checks&lt;/p&gt;

&lt;p&gt;The main advantage to this is that we have very strong typing guarantees. For every &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fit&lt;/code&gt;, we can statically detect what the type object is, and whether it is usable to transform and predict. For example, with base classes:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;typing&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Generic&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;BaseTransformer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TypeVar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;T&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BaseTransformer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;BaseFactory&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Generic&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]):&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;One downside of this pattern is that the hyperparameters are no longer accessible on the fit object.&lt;/p&gt;

&lt;p&gt;In a follow-up post, we’ll investigate how we can improve on this pattern and have our cake and eat it to.&lt;/p&gt;

&lt;h3 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h3&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;A transformer here is something that transforms some data, not a transformer in the machine learning sense. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;scikit-learn also implements predictors, which have &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fit&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;predict&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fit_predict&lt;/code&gt; functions. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;In a serious implementation, we’d derive from a base class, use generics, etc. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Although doing so is very useful for reproducing research. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;I don’t think this is a problem of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;scikit-learn&lt;/code&gt; itself though. Their estimators are all implemented correctly. This is easy to get wrong, however. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Sun, 17 May 2026 00:00:00 +0000</pubDate>
        <link>https://stephantul.github.io/blog/fit-transform/</link>
        <guid isPermaLink="true">https://stephantul.github.io/blog/fit-transform/</guid>
        
        
        <category>python</category>
        
      </item>
    
      <item>
        <title>Evaluating static models on RTEB</title>
        <description>&lt;p&gt;The group of researchers associated with the &lt;a href=&quot;https://huggingface.co/spaces/mteb/leaderboard&quot;&gt;Massive Text Embedding Benchmark&lt;/a&gt; (MTEB) has released a new benchmark: the &lt;a href=&quot;https://huggingface.co/blog/rteb&quot;&gt;Retrieval Text Embedding Benchmark&lt;/a&gt;. As you may know, MTEB ranks models on their ability to perform well at a variety of tasks in a zero-shot setting, and is meant to reflect how well your model transfers to new tasks. Ranking high on MTEB can make or break your model, so it has become something that people optimize for, and as &lt;a href=&quot;https://en.wikipedia.org/wiki/Goodhart%27s_law&quot;&gt;Goodhart put it&lt;/a&gt;: “when a measure becomes a target, it ceases to become a good measure”.&lt;/p&gt;

&lt;p&gt;The mechanism behind Goodhart’s law is particularly problematic for MTEB, since all the datasets and evaluations behind it are completely open, making it feasible to hill-climb MTEB without actually training directly on any data from the benchmark. (&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;) RTEB solves this issue by keeping a portion of the leaderboard private, a practice that used to be common in so-called shared tasks. (&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;) Users wishing to appear on the leaderboard need to provide their model and have it tested on the private subset. This solves the issue of adversaries with a lot of compute being able to hill-climb the leaderboard by themselves. The downside of doing this is obviously that keeping the leaderboard up to date is a substantial effort. In addition, RTEB exclusively focuses on retrieval, and only uses datasets that are relevant for retrieval.&lt;/p&gt;

&lt;h3 id=&quot;training-static-models&quot;&gt;Training static models&lt;/h3&gt;

&lt;p&gt;By and large, there are two good ways to train a static model: (&lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;)&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Knowledge distillation: used to create the &lt;a href=&quot;https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062&quot;&gt;potion models&lt;/a&gt; by &lt;a href=&quot;https://minish.ai/&quot;&gt;Minish&lt;/a&gt;. This approach performs basic knowledge distillation using a larger teacher model and the cosine similarity or MSE as a loss function.&lt;/li&gt;
  &lt;li&gt;Supervised training: detailed in &lt;a href=&quot;https://huggingface.co/blog/static-embeddings&quot;&gt;Tom Aarsen’s blog post&lt;/a&gt;. This is simply performing supervised training using, e.g., a ranking loss, on very large datasets, without doing any finetuning.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As far as I could tell, both approaches are roughly competitive. Here are the scores for the models on MTEB, where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;potion&lt;/code&gt; models are trained via knowledge distillation, and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;static-.+mrl&lt;/code&gt; models are trained on large datasets of sentences.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Name&lt;/th&gt;
      &lt;th&gt;MTEB avg score&lt;/th&gt;
      &lt;th&gt;MTEB subset&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://huggingface.co/minishlab/potion-multilingual-128M&quot;&gt;potion-multilingual-128m&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;47.23&lt;/td&gt;
      &lt;td&gt;multilingual&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://huggingface.co/sentence-transformers/static-similarity-mrl-multilingual-v1&quot;&gt;static-similarity-mrl-multilingual-v1&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;47.24&lt;/td&gt;
      &lt;td&gt;multilingual&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://huggingface.co/minishlab/potion-base-8M&quot;&gt;potion-base-8m&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;53.3&lt;/td&gt;
      &lt;td&gt;english&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://huggingface.co/sentence-transformers/static-retrieval-mrl-en-v1&quot;&gt;static-retrieval-mrl-en-v1&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;51.25&lt;/td&gt;
      &lt;td&gt;english&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;As you can see, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;potion-base-8m&lt;/code&gt; is on average better than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;static-retrieval-mrl-en-v1&lt;/code&gt;. At retrieval, however, the supervised model is better than the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;potion&lt;/code&gt; model. For the multilingual models, knowledge distillation and the supervised approach seem to do equally well. The conclusion so far: training on sentence datasets leads to pretty good general models, but very good models at whatever you are training on (retrieval), and multilingual semantics can be learned really well from sentence datasets, even without prior language model training.&lt;/p&gt;

&lt;p&gt;Now, let’s turn to RTEB. As noted above, RTEB is specifically meant for retrieval, and also has an English and multilingual subset. This allows us to answer the following question: does knowledge distillation-based training lead to better performance on held-out data than straight supervision? Because we have English and multilingual models in both conditions, we have a very nice way to test. My personal prediction is that knowledge distillation would be better than supervision; even though the supervised models have been trained on large amounts of data, they have been trained to solve specific problems. Knowledge distillation, on the other hand, is about generally mimicking a larger model, and should thus generalize better to unseen datasets.&lt;/p&gt;

&lt;h3 id=&quot;results&quot;&gt;Results&lt;/h3&gt;

&lt;p&gt;Overall, supervised models outperform knowledge-distilled ones, particularly on the private leaderboard. I also didn’t do anything; &lt;a href=&quot;https://www.linkedin.com/in/kennethenevoldsen/&quot;&gt;Kenneth Enevoldsen&lt;/a&gt; ran the models on RTEB. (&lt;sup id=&quot;fnref:4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;) I can’t really report much more than the actual results, so let’s dive right in. Note that, as before, the top two rows are on the English subset, while the bottom ones are on the multilingual subset.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Name&lt;/th&gt;
      &lt;th&gt;RTEB public score&lt;/th&gt;
      &lt;th&gt;RTEB private score&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://huggingface.co/minishlab/potion-multilingual-128M&quot;&gt;potion-multilingual-128m&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;23.23&lt;/td&gt;
      &lt;td&gt;36.63&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://huggingface.co/sentence-transformers/static-similarity-mrl-multilingual-v1&quot;&gt;static-similarity-mrl-multilingual-v1&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;24.54&lt;/td&gt;
      &lt;td&gt;43.73&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://huggingface.co/minishlab/potion-base-8M&quot;&gt;potion-base-8m&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;24.11&lt;/td&gt;
      &lt;td&gt;37.45&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://huggingface.co/sentence-transformers/static-retrieval-mrl-en-v1&quot;&gt;static-retrieval-mrl-en-v1&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;29.09&lt;/td&gt;
      &lt;td&gt;44.48&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;As you can see both of the supervised mrl models surpass their knowledge distilled counterparts on the private set. This is especially striking for the multilingual model: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;potion-base-128m&lt;/code&gt; tracks the mrl model very closely on the public set, but is much worse on the private set. This is very interesting, and ran counter to my expectation, as all these models were more or less evenly matched on the full MTEB set.&lt;/p&gt;

&lt;h3 id=&quot;discussion&quot;&gt;Discussion&lt;/h3&gt;

&lt;p&gt;This provides some interesting insights for future models. Knowledge distillation is basically free: all you need is a model for whatever domain you want, and a relatively small corpus, but it does not perform as well as supervised learning, even if the data you have does not match your task directly. The main data point here is the performance of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;static-similarity-mrl-multilingual&lt;/code&gt;, which was only trained on similarity datasets, and not on retrieval, but still outperforms the knowledge distilled &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;potion&lt;/code&gt; model on retrieval.&lt;/p&gt;

&lt;p&gt;Another interesting observation missing from this chart is hybrid models; it could be that first performing knowledge distillation and then supervised learning (&lt;sup id=&quot;fnref:5&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;) outperforms doing either of them alone. One issue with static models, however, is that they are extremely susceptible to catastrophic forgetting; without any intervening model, the embeddings just change shape to suit whatever task you train them on.&lt;/p&gt;

&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;I think that hybrid training and knowledge distillation, and especially knowledge distillation on a larger and more diverse set of documents, could be beneficial. In addition, I think the solution space of knowledge distillation for static models remains unexplored. For example, I don’t think anyone has trained a model using hard negatives, or using logit scores. These things will surely be tried by someone (&lt;sup id=&quot;fnref:6&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;)&lt;/p&gt;

&lt;h3 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h3&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;This is obviously against the spirit of the leaderboard, but also how science progresses. This is not necessarily an issue, because when parties are forced to disclose whatever made them take a step up the hill, we learn a little bit. The main issue, in my opinion, is that a single user takes many steps in private, and then only discloses the tricks that worked. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;See for example the *SEM shared task series, which gave us the well-known sts datasets. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Note that I am excluding &lt;a href=&quot;https://github.com/MinishLab/model2vec&quot;&gt;model2vec&lt;/a&gt; from training because I view that as an initialization strategy. Models that come straight from model2vec are not competitive; performing knowledge distillation or training is always better. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Thanks! 🙏🙏🙏 &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Or the other way around, or both &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Probably me… &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Wed, 08 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://stephantul.github.io/blog/rteb-static/</link>
        <guid isPermaLink="true">https://stephantul.github.io/blog/rteb-static/</guid>
        
        
        <category>static models</category>
        
      </item>
    
      <item>
        <title>Comparing PCA and MRL for static models</title>
        <description>&lt;p&gt;Without reducing dimensionality, static models can be hundreds of MB large. Choosing the right dimensionality-reduction technique can shrink them without sacrificing retrieval quality. I was always a huge fan of Principal Component Analysis (PCA) for making static models smaller. For example, PCA is used in &lt;a href=&quot;https://github.com/MinishLab/model2vec&quot;&gt;model2vec&lt;/a&gt; and was used in an older version of &lt;a href=&quot;https://github.com/MinishLab/model2vec&quot;&gt;tokenlearn&lt;/a&gt; to post-process models, and is used in the newer version of tokenlearn to reduce the dimension of the teacher models. PCA was always on my mind as a good option for reducing dimensions. Recently, however, I started experimenting with Matryoshka Representation Learning (MRL) for reducing dimensions and have found it to be superior, which I found surprising. This blog post thus tries to answer the question: when should you be using PCA and MRL? If one is better than the other, why? I discuss both techniques, why applying dimensionality reduction to static embeddings makes sense, and some options for future work.&lt;/p&gt;

&lt;h3 id=&quot;static-models-and-pca&quot;&gt;Static models and PCA&lt;/h3&gt;

&lt;p&gt;First, let’s talk about static models and PCA. Static models are just embedding tables indexed by a tokenizer. Just like good old word embeddings, but better. One determining factor in the performance of a static model is that the embedding space does not model irrelevant or redundant information; because no downstream task exists to process or ignore information, the embedding space needs to handle all of this.&lt;/p&gt;

&lt;p&gt;So now, on to PCA. PCA finds an orthonormal basis of vectors (principal components) such that each successive component captures as much of the remaining variance as possible. (&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;) Transformed embeddings are then expressed as linear combinations of these components. As it turns out, in addition to being used for reducing dimensionality, PCA also has the property of making the individual dimensions of your embedding space &lt;em&gt;uncorrelated&lt;/em&gt;; i.e., a space for which the expected cosine similarity is close to 0. The expected cosine being close to 0 is caused by all dimensions being centered around 0, and also uncorrelated with other dimensions.&lt;/p&gt;

&lt;p&gt;In addition to uncorrelating them, PCA orders the components by the variance they explain. This allows you to truncate embedding spaces to a specified dimension without losing a lot of performance, a property MRL also has.&lt;/p&gt;

&lt;p&gt;The code below demonstrates that PCA creates an expected cosine similarity close to 0:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;numpy&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.decomposition&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PCA&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.metrics.pairwise&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pairwise_distances&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;cosine_similarity&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pairwise_distances&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;cosine&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Uniform embeddings are not isotropic.
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;state&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RandomState&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;42&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;random_uniform&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uniform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8192&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Compute similarity
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sim&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cosine_similarity&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random_uniform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Compute mean score of the upper triangular matrix
# (otherwise we count double)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean_score&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;triu_indices_from&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# mean_score ~= 0.75
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PCA&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n_components&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;transformed&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit_transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random_uniform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;sim&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cosine_similarity&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transformed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;mean_score&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;triu_indices_from&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# mean_score ~= 0.0001
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So, as you can see, even without reducing dimensionality, we get the expected mean cosine of 0, simply because of the new basis.&lt;/p&gt;

&lt;p&gt;This property of PCA was also surprising to us when we made &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;model2vec&lt;/code&gt;: we used PCA to just reduce the dimensionality to directly compare to traditional embeddings, such as &lt;a href=&quot;https://nlp.stanford.edu/projects/glove/&quot;&gt;GloVe&lt;/a&gt;, but we saw that even when not reducing dimensionality, performance improved.&lt;/p&gt;

&lt;p&gt;So far so good, I loved PCA. I was a PCA apologist. (&lt;sup id=&quot;fnref:2:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;)&lt;/p&gt;

&lt;h3 id=&quot;static-models-and-mrl&quot;&gt;Static models and MRL&lt;/h3&gt;

&lt;p&gt;Matryoshka Representation Learning (MRL) is a relatively new technique. It was proposed in &lt;a href=&quot;https://arxiv.org/abs/2205.13147&quot;&gt;a 2022 paper&lt;/a&gt;, but as far as I know really rose to prominence once OpenAI included it in their embedding models. See &lt;a href=&quot;https://huggingface.co/blog/matryoshka&quot;&gt;this blog post by Tom Aarsen&lt;/a&gt; for more information.&lt;/p&gt;

&lt;p&gt;The idea behind MRL is that, if you train a network with some kind of loss function that operates on vectors, you can evaluate that loss on many contiguous subspaces of those vectors in a single forward pass. This works as follows: you first perform a forward pass to obtain the vectors, and then, for a set of dimensions &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;D&lt;/code&gt;, you evaluate the loss at that specific dimension. For example, if our vector is 256-dimensional, and our dimensions &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;D&lt;/code&gt; are 32, 64, 128 and 256, we will evaluate the loss four times for each forward pass. This has a few important consequences:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The model learns to create useful representations in the subspaces specified by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;D&lt;/code&gt;, but also in intermediate subspaces.&lt;/li&gt;
  &lt;li&gt;The model upweights “lower” dimensions, because these are effectively evaluated more often. For example, if there are four dimensions in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;D&lt;/code&gt;, the first dimension of the space is updated four times for each forward pass, while the last dimension is only updated once.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Note that MRL does not guarantee that dimensions are uncorrelated or have an expected cosine of 0. This needs to be a property of the &lt;em&gt;loss function&lt;/em&gt; to which MRL is applied. MRL merely guarantees that performance is maintained when the vector is truncated. In practice, static models trained with something like a cosine loss have an expected cosine of 0; this is a useful property to have, so the model should naturally learn and exploit it. Below, we’ll test whether this is actually the case.&lt;/p&gt;

&lt;h1 id=&quot;experiments&quot;&gt;Experiments&lt;/h1&gt;

&lt;p&gt;The MRL paper shows that PCA is worse than MRL; PCA performance degrades more rapidly than MRL performance when the dimensionality is decreased. There’s a possibility that this conclusion does not transfer to static models, as we’re not applying PCA to the output of a model, but to the model itself, something which is impossible for a regular model. So it could be that, for static models, the fact that the whole model can be optimized by PCA is still better than using MRL.&lt;/p&gt;

&lt;p&gt;There’s also the caveat that MRL requires a loss function to be optimized, although I think this is easily circumvented in practice. (&lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;)&lt;/p&gt;

&lt;p&gt;To see what it all means, I trained two static models: one with MRL, and the other without MRL. They were trained using the recipe from &lt;a href=&quot;https://huggingface.co/blog/static-embeddings&quot;&gt;Tom Aarsen’s blog about static models&lt;/a&gt;, although I left out the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;paq&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;s2orc&lt;/code&gt; datasets. TLDR; it’s just supervised finetuning on a whole bunch of retrieval datasets using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MultipleNegativesRanking&lt;/code&gt; loss (also known as InfoNCE). The models were trained for 1 epoch using a very high learning rate of 0.2, 10% warmup and a linear cooldown. I experimented with other configurations but most of this had no effect.&lt;/p&gt;

&lt;p&gt;I then evaluated both models on &lt;a href=&quot;https://huggingface.co/collections/zeta-alpha-ai/nanobeir-66e1a0af21dfd93e620cd9f6&quot;&gt;NanoBEIR&lt;/a&gt;. The results in the table below are the mean NDCG@10 over all datasets.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Dim&lt;/th&gt;
      &lt;th&gt;MRL&lt;/th&gt;
      &lt;th&gt;no MRL&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;32&lt;/td&gt;
      &lt;td&gt;32.52&lt;/td&gt;
      &lt;td&gt;25.93&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;64&lt;/td&gt;
      &lt;td&gt;39.71&lt;/td&gt;
      &lt;td&gt;34.51&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;128&lt;/td&gt;
      &lt;td&gt;45.20&lt;/td&gt;
      &lt;td&gt;42.36&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;256&lt;/td&gt;
      &lt;td&gt;48.10&lt;/td&gt;
      &lt;td&gt;47.20&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;512&lt;/td&gt;
      &lt;td&gt;49.49&lt;/td&gt;
      &lt;td&gt;49.63&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;1024&lt;/td&gt;
      &lt;td&gt;50.30&lt;/td&gt;
      &lt;td&gt;50.56&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;As you can see, there is not really a big downside to not using MRL. The scores using the full dimensionality are a bit lower, but this is a discrepancy I think will disappear. For lower dimensionalities, MRL is much better than not doing MRL, leading to a 7 point gain at very low dimensions.&lt;/p&gt;

&lt;p&gt;Now, let’s apply PCA to both of them:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Dim&lt;/th&gt;
      &lt;th&gt;MRL + PCA&lt;/th&gt;
      &lt;th&gt;no MRL + PCA&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;32&lt;/td&gt;
      &lt;td&gt;32.41 (-0.10)&lt;/td&gt;
      &lt;td&gt;26.34 (+0.40)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;64&lt;/td&gt;
      &lt;td&gt;39.65 (-0.10)&lt;/td&gt;
      &lt;td&gt;34.60 (+0.10)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;128&lt;/td&gt;
      &lt;td&gt;44.95 (-0.30)&lt;/td&gt;
      &lt;td&gt;42.24 (-0.10)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;256&lt;/td&gt;
      &lt;td&gt;48.04 (-0.10)&lt;/td&gt;
      &lt;td&gt;46.95 (-0.30)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;512&lt;/td&gt;
      &lt;td&gt;49.52 (+0.00)&lt;/td&gt;
      &lt;td&gt;49.34 (-0.30)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;1024&lt;/td&gt;
      &lt;td&gt;50.24 (-0.10)&lt;/td&gt;
      &lt;td&gt;50.59 (+0.00)&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;So, surprisingly, applying PCA after training does not really add any performance, even for really low dimensions. What is also surprising is that this holds regardless of whether we trained with or without MRL; I expected the non-MRL model to benefit from PCA.&lt;/p&gt;

&lt;p&gt;Numerically, training with MRL appears consistently better. Applying PCA after training is not useful. Unfortunately for me, PCA is clearly outperformed by MRL. In short, &lt;em&gt;“Friendship ended with PCA, MRL is now my friend”&lt;/em&gt;.&lt;/p&gt;

&lt;h3 id=&quot;discussion&quot;&gt;Discussion&lt;/h3&gt;

&lt;p&gt;Now, we need to square why we saw such large improvements when we applied PCA in model2vec, while not seeing any improvements here. Recall that the main reason for performance improvements seems to be that PCA transforms the vectors to an orthogonal basis, which has an expected mean cosine of 0; embedding spaces that do not have this property are worse for static models.&lt;/p&gt;

&lt;p&gt;As it turns out, models directly optimized through gradient descent already have this property. For both the MRL and non-MRL models above, the expected cosine distance between the embeddings approaches 0. So the renormalizing effect of PCA had no additional impact, and hence applying PCA does not have any additional uses beyond actually reducing dimensionality for models not trained with MRL.&lt;/p&gt;

&lt;h3 id=&quot;conclusion--future-work&quot;&gt;Conclusion &amp;amp; Future work&lt;/h3&gt;

&lt;p&gt;I think this opens up some interesting areas for improvement in static model initializers, such as model2vec. For example, you could just initialize the model randomly, and then train a small auto-encoder with MRL to get the scale-free behavior displayed by MRL. Whether this works better in practice than PCA remains to be seen.&lt;/p&gt;

&lt;p&gt;Practical advice:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;PCA helps when embeddings aren’t zero-mean and you’re not doing any training&lt;/li&gt;
  &lt;li&gt;MRL learns truncation robustness directly&lt;/li&gt;
  &lt;li&gt;combining them doesn’t help&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h3&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;An apologist is not someone who apologizes, but I guess they do sometimes apologize. Sorry if this was confusing. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:2:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;PCA is mathematically equivalent to a 1-layer auto-encoder with linear activation function, albeit without the ordered dimensions property of PCA. As such, we can easily rewrite a “PCA” using MRL by training a reconstruction loss with MRL, which would give it the ordered dimensions property. I think there’s a lot of interesting low-hanging fruit here, because you can easily modify the loss of the auto-encoder, make the network deeper, etc. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Mon, 06 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://stephantul.github.io/blog/mrl-pca/</link>
        <guid isPermaLink="true">https://stephantul.github.io/blog/mrl-pca/</guid>
        
        
        <category>static models</category>
        
      </item>
    
      <item>
        <title>Static late interaction models</title>
        <description>&lt;p&gt;Late interaction is an interesting paradigm for computing the similarity between two documents, and can be seen as a hybrid of sparse and dense retrieval. In this post, I will show how &lt;a href=&quot;https://huggingface.co/blog/static-embeddings&quot;&gt;static models&lt;/a&gt; in a late interaction setting actually reduce to sparse models. I will also argue that, in absence of empirical evidence to the contrary, there’s no good reason to assume that static late interaction models will be much better than their dense counterparts. But first, let’s dive into some fundamentals: I’ll explain what sparse retrieval and dense retrieval are, and how late interaction fits in with both of those paradigms.&lt;/p&gt;

&lt;h3 id=&quot;sparse&quot;&gt;Sparse&lt;/h3&gt;

&lt;p&gt;Sparse retrieval assigns each individual token one or more coefficients, and only scores tokens when they are present in the document. For example, a query like “pet stores” will only return documents that contain those terms. In other words, sparse retrieval does not retrieve semantically related words; it only indexes documents based on the terms that are actually present. Examples of sparse retrieval techniques are &lt;a href=&quot;https://arxiv.org/abs/2107.05720&quot;&gt;SPLADE&lt;/a&gt;, &lt;a href=&quot;https://arxiv.org/abs/2104.12016&quot;&gt;DeepImpact&lt;/a&gt;, &lt;a href=&quot;https://en.wikipedia.org/wiki/Okapi_BM25&quot;&gt;BM25&lt;/a&gt; and &lt;a href=&quot;https://arxiv.org/abs/2106.14807&quot;&gt;uniCOIL&lt;/a&gt;. Sparse retrieval tends to be precise because it matches terms exactly, but for the same reason also has trouble bridging the gap between semantically related terms. (&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;) In general, sparse retrievers don’t do well if there’s little to no lexical overlap between queries and documents, or if there’s lots of semantic ambiguity.&lt;/p&gt;

&lt;h3 id=&quot;dense&quot;&gt;Dense&lt;/h3&gt;

&lt;p&gt;In contrast, in dense retrieval, we assign a single vector to a whole document, and just compute dot-product based similarities between vectors to find the most similar ones. Putting everything in a single vector means that we mix up all words in the documents, thus allowing for queries with no lexical overlap to still retrieve relevant documents. The downside of this is that things can get too mixed up; a vector can implicitly model many related things, and it is difficult to predict a priori which vectors will match and why. In a way, it is naïve to expect a single vector to fully express the semantics of a document of arbitrary length.&lt;/p&gt;

&lt;h3 id=&quot;late-interaction&quot;&gt;Late interaction&lt;/h3&gt;

&lt;p&gt;Which brings us to late interaction. In late interaction, we use a dense model (&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;) to create one vector for &lt;em&gt;each token in the document&lt;/em&gt;. If this sounds excessive, don’t worry, there’s many tricks to alleviate the burden of storing and retrieving this many documents. (&lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;) At query time, instead of calculating the dot product between vectors, we calculate the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;maxsim&lt;/code&gt; similarity. For a given query &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Q&lt;/code&gt; with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;m&lt;/code&gt; tokens and document &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;D&lt;/code&gt; with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;n&lt;/code&gt; tokens, the similarity is as follows:&lt;/p&gt;

\[s(Q,D)
= \sum_{i=1}^{m} \max_{1 \le j \le n} \;\Big\langle \widehat{\mathbf q}_i,\;\widehat{\mathbf d}_j \Big\rangle\]

&lt;p&gt;So, for each query token, we first calculate the similarity (&lt;sup id=&quot;fnref:4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;) to each document token, and then take the max of those similarities. The sum over all of the query tokens is the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;maxsim&lt;/code&gt; score.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;maxsim&lt;/code&gt; allows late interaction models to attach scores to specific tokens, like sparse retrieval models, but also allows for a graded similarity between related tokens, like dense models (and unlike sparse models). As such, we can think of late interaction models as a hybrid between dense and sparse models. There’s many other aspects to dig into, which I won’t cover here, so please read &lt;a href=&quot;https://medium.com/@varun030403/colbert-a-complete-guide-1552468335ae&quot;&gt;one&lt;/a&gt; &lt;a href=&quot;https://jina.ai/news/jina-colbert-v2-multilingual-late-interaction-retriever-for-embedding-and-reranking/&quot;&gt;of&lt;/a&gt; &lt;a href=&quot;https://weaviate.io/blog/late-interaction-overview&quot;&gt;the&lt;/a&gt; &lt;a href=&quot;https://qdrant.tech/articles/late-interaction-models/&quot;&gt;many&lt;/a&gt; &lt;a href=&quot;https://www.answer.ai/posts/colbert-pooling.html&quot;&gt;good&lt;/a&gt; posts on the subject.&lt;/p&gt;

&lt;h3 id=&quot;static-models&quot;&gt;Static models&lt;/h3&gt;

&lt;p&gt;The reason late interaction models work is not just because of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;maxsim&lt;/code&gt;, but also because the underlying models are &lt;em&gt;trained&lt;/em&gt; to maximize the similarity between a query token and a related document token. These models are &lt;em&gt;contextualized&lt;/em&gt;, which means that the model produces different vectors for tokens in different contexts. Static models, on the other hand, always produce the same vector for each token, regardless of the context. This makes static vectors worse, but also much faster. Why and when this is useful is the topic of an upcoming post, but for now let’s assume this is a useful property.&lt;/p&gt;

&lt;h3 id=&quot;static-late-interaction&quot;&gt;Static late interaction&lt;/h3&gt;

&lt;p&gt;Now, I will argue that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;maxsim&lt;/code&gt;, when applied to a static model, implicitly leads to a sparse model. First, recall that, in a static model, every occurrence of a token always gets the same vector. This also implies that the similarity between two tokens is always exactly the same: if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dog&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cat&lt;/code&gt; always get the same vector, then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sim(dog, cat)&lt;/code&gt; is always the same value. So, this gives us a nice optimization: we can precompute all possible similarities. For a vocabulary &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;V&lt;/code&gt; with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;t&lt;/code&gt; tokens, this leads to a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;t x t&lt;/code&gt;-sized matrix, which we call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;W&lt;/code&gt;. Note that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;W&lt;/code&gt; is very big! For a vocabulary size of 30k, this already is a 900 million parameter matrix. In practice we can easily make this matrix extremely sparse by pruning any items below a certain threshold. (&lt;sup id=&quot;fnref:5&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;)&lt;/p&gt;

&lt;p&gt;Now, given &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;W&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;maxsim&lt;/code&gt; reduces to:&lt;/p&gt;

\[s(Q,D)
= \sum_{i=1}^{m} \max_{1 \le j \le n} \;W_{Q_iD_j}\]

&lt;p&gt;This formulation means that we only need to store token indices and compute query indices to get the same result as we would have gotten when storing all vectors and computing vectors at query time. (&lt;sup id=&quot;fnref:6&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;) We also still need to store &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;W&lt;/code&gt;, however. In addition, it is also unclear whether this is actually efficient.&lt;/p&gt;

&lt;p&gt;Fortunately for us there’s yet another shortcut: for each token in document &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;D&lt;/code&gt;, we can index the &lt;em&gt;columns&lt;/em&gt; from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;W&lt;/code&gt;, and take the &lt;em&gt;max&lt;/em&gt;. This leads to a single &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;V&lt;/code&gt;-sized vector, which we call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Y&lt;/code&gt;. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Y&lt;/code&gt; contains the pre-computed max from the document to each possible token. This effectively precomputes the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max&lt;/code&gt; for each possible token for each document. So, if we do this, the only thing we need to do at query time is index this vector using the query tokens, and take the sum. Because the vectors are pretty sparse, and the sparsity is controllable, this leads to a small memory footprint, and small query-time compute. Here’s the equation:&lt;/p&gt;

\[s(Q,Y)
= \sum_{i=1}^{m} Y_{Q_i}\]

&lt;p&gt;To repeat: during query time, the only thing we do is index. The index consists of a single document-term matrix, with the number of rows equal to the number of documents, and number of rows equal to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;t&lt;/code&gt;, the vocabulary size.&lt;/p&gt;

&lt;p&gt;One question this raises is whether, for a decently-sized corpus, this document-term matrix is actually smaller than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;W&lt;/code&gt;. The answer is: no, except for really small numbers of documents. This is caused by the fact that a document vector is the max of a lot of tokens, and there tends to have a lot of non-zero coefficients. So, in practice, if space is an issue, it might actually be better to still use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;W&lt;/code&gt;. If speed is a concern, it might be better to bite the bullet, and store the extra coefficients.&lt;/p&gt;

&lt;h3 id=&quot;sparsity&quot;&gt;Sparsity&lt;/h3&gt;

&lt;p&gt;The older people among you will point to this and say: this is just a sparse index, but with soft weights on related terms! (&lt;sup id=&quot;fnref:7&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;) And you would be right! In fact, if we set the similarity threshold on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;W&lt;/code&gt; to 1.0 we get a very bad version of BM25. (&lt;sup id=&quot;fnref:8&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:8&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;) Note that this behavior does not appear because we perform some magic trick or manipulation: it is inherent to the way &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;maxsim&lt;/code&gt; works. So even if you compute &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;maxsim&lt;/code&gt; as in the original equation, you will get this BM25-like behavior.&lt;/p&gt;

&lt;p&gt;This explains why I think that just computing the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;maxsim&lt;/code&gt; with a static model as a regular late interaction model will never work well: BM25 contains a lot of cool tricks to make sure retrieval works well, including different weighting schemes for queries and documents, a length bias, and two tunable parameters. These are all missing from this algorithm. An interesting task, then, could be to re-add these terms: &lt;a href=&quot;https://github.com/MinishLab/model2vec&quot;&gt;model2vec&lt;/a&gt; for example, adds weighting by inflating and shrinking the norms by token frequency. These weighting terms can be re-added on the query tokens, or put in the index. Similarly (&lt;sup id=&quot;fnref:9&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:9&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;9&lt;/a&gt;&lt;/sup&gt;), the length bias in BM25 can also be integrated into this formulation.&lt;/p&gt;

&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;This is all preliminary theoretical work, but which can be very promising. One thing that specifically is interesting is &lt;em&gt;asymmetric&lt;/em&gt; static models, i.e., using different static models to encode queries and documents, which is something I am actively working on. It is currently unclear whether training static models as late interaction models is actually useful. I have trained some static models using &lt;a href=&quot;https://github.com/lightonai/pylate&quot;&gt;PyLate&lt;/a&gt;, but this did not lead to good results; training them as regular dense retrievers works much better. More research is needed, as always. Feel free to reach out if you have ideas, I’m always open to talk.&lt;/p&gt;

&lt;h3 id=&quot;acknowledgments&quot;&gt;Acknowledgments&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Thanks &lt;a href=&quot;https://x.com/drexalt&quot;&gt;jonah&lt;/a&gt; for proofreading and helpful suggestions about SPLADE.&lt;/li&gt;
  &lt;li&gt;Thanks &lt;a href=&quot;https://x.com/bclavie&quot;&gt;Ben&lt;/a&gt; for suggesting blogs to link to.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;appendix-code-sample&quot;&gt;Appendix: code sample&lt;/h3&gt;

&lt;p&gt;Here’s some code showing the methods are equivalent. We don’t precompute the document representations, but in the last function you could just do that.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;numpy&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.metrics.pairwise&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cosine_similarity&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;maxsim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vecs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;Compute the maxsim&quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;q_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vecs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;d_x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vecs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;sim&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cosine_similarity&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;q_x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d_x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# q, d matrix
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;maxes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# q vector
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;maxes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;


&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;maxsim_w&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;Compute the maxsim with W.&quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;vectors&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# d, V matriw
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;indexed&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vectors&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# d, q matrix
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;indexed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;


&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;maxsim_doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doc_w&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;Last step, precompute the documents.&quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doc_w&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;q&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RandomState&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;42&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;vectors&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;W&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cosine_similarity&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vectors&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Document
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;doc_w&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;maxsim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vectors&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;maxsim_w&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;maxsim_doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doc_w&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;isclose&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;isclose&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h3&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;This is typically alleviated through query expansion techniques. SPLADE is also notable in that it automatically performs query/term expansion within the model, in addition to scoring terms that are present. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;These dense models are specifically trained to be late interaction models, but their cores are just pre-trained transformers, like the ones we use for dense retrieval. For training details, see &lt;a href=&quot;https://arxiv.org/abs/2004.12832&quot;&gt;the colbert paper&lt;/a&gt; and &lt;a href=&quot;https://arxiv.org/abs/2112.01488&quot;&gt;the colbertv2 paper&lt;/a&gt;. You can use &lt;a href=&quot;https://github.com/lightonai/pylate&quot;&gt;PyLate&lt;/a&gt; to train, it’s easy! &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Examples of this include &lt;a href=&quot;https://arxiv.org/abs/2405.19504&quot;&gt;MuVERA&lt;/a&gt;, &lt;a href=&quot;https://www.lighton.ai/lighton-blogs/fastplaid&quot;&gt;FastPLAID&lt;/a&gt;, &lt;a href=&quot;https://www.mixedbread.com/blog/maxsim-cpu&quot;&gt;maxsim-cpu&lt;/a&gt; and probably many others. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;In the equation we use the dot product similarity, but the cosine similarity can also be used. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;This pruning is empirically justified: because of the maxsim, it is unlikely that tokens with very low similarities ever get selected. And even if they do get selected, it is unlikely that this will lead to a meaningful difference in selected documents. The proof of the pudding is in the eating, however. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;In practice, ~90% of time is spent on tokenization, so this isn’t the big win it seems. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;If you are young and noticed this: good job buddy! &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:8&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Bad because it does not have any of the things, i.e., length, query weights, IDF, that make BM25 actually good. &lt;a href=&quot;#fnref:8&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:9&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;haha &lt;a href=&quot;#fnref:9&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Tue, 30 Sep 2025 00:00:00 +0000</pubDate>
        <link>https://stephantul.github.io/blog/static-colbert/</link>
        <guid isPermaLink="true">https://stephantul.github.io/blog/static-colbert/</guid>
        
        
        <category>static models</category>
        
      </item>
    
      <item>
        <title>Better Greedy Tokenizers: Handling WordPiece&apos;s [UNK] Problem</title>
        <description>&lt;p&gt;In &lt;a href=&quot;https://stephantul.github.io/blog/greedy/&quot;&gt;a previous post&lt;/a&gt;, I showed that making a tokenizer greedy, that is, always picking the longest matching subword like WordPiece does, can improve results without retraining. But &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WordPiece&lt;/code&gt; can unfortunately silently break your tokenization.&lt;/p&gt;

&lt;p&gt;Consider this example:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;tokenizers&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tokenizer&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_pretrained&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;bert-base-uncased&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;talk&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10_000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokens&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# [&apos;[CLS]&apos;, &apos;[UNK]&apos;, &apos;[SEP]&apos;]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Instead of producing many repetitions of talk (or something else), the tokenizer outputs a single &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[UNK]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This happens because WordPiece enforces a hard limit on the length of each run (the contiguous string passed to it after pretokenization). The length is determined by a parameter &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max_input_chars_per_word&lt;/code&gt;, which is set to 100 by default. As the name suggests, this parameter puts a maximum on the number of characters any pretoken going into the model has. Once you go over this limit, the model doesn’t crash, but silently produces an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[UNK]&lt;/code&gt; token.&lt;/p&gt;

&lt;p&gt;What this means is that if you are used to BPE, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WordPiece&lt;/code&gt; could often get you &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[UNK]&lt;/code&gt;. In practice, this makes it difficult to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WordPiece&lt;/code&gt; with highly multilingual collections, because it becomes much more probable to get long runs. In addition, it also makes it impossible to create a multi-word &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WordPiece&lt;/code&gt; tokenizer.&lt;/p&gt;

&lt;h1 id=&quot;why-does-this-parameter-exist&quot;&gt;Why does this parameter exist?&lt;/h1&gt;

&lt;p&gt;The reason why this happens is because of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WordPiece&lt;/code&gt; itself. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WordPiece&lt;/code&gt; algorithm, as implemented in the Hugging Face tokenizers package is as follows:&lt;/p&gt;

&lt;p&gt;For a given input string and a vocabulary of subwords, do the following:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Initialize two pointers, one at the start of the string, S, and one at the end of the string, E&lt;/li&gt;
  &lt;li&gt;Decrement E by 1 and see if the run from S to E forms a valid token.&lt;/li&gt;
  &lt;li&gt;Once you find a valid token, increment S by the length of the token you found.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you ever don’t find a token, you just emit &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[UNK]&lt;/code&gt; for the whole run. As you can probably see, this algorithm is quadratic as a function of input length. For every subword you find, you will skip to the end of the run and walk back to the near-start. There’s lots of low-hanging fruit to make this more efficient, but that is not what this post is about.&lt;/p&gt;

&lt;p&gt;So, this also hopefully makes clear why &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max_input_chars_per_word&lt;/code&gt; exists: when encountering a single run of, say, 100k characters, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WordPiece&lt;/code&gt; inference algorithm could conceivably take hours. For example, on my machine, encoding a 500 character string takes 3.66ms (0.007 ms per character), a 5000 character string takes 638ms (0.12 ms per character, a 17x increase), while encoding a 50000 character string takes … too long(&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;). It would be really silly to wait for such a long time.(&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;)&lt;/p&gt;

&lt;h1 id=&quot;fixing-the-parameter&quot;&gt;Fixing the parameter&lt;/h1&gt;

&lt;p&gt;Since we are stuck with the parameter, we might as well make the best of it.
As it turns out, there is a very nice solution we can leverage within the Hugging Face ecosystem: the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FixedLength&lt;/code&gt; pretokenizer. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FixedLength&lt;/code&gt; pretokenizer simply splits strings up into tokens of a pre-specified length.&lt;/p&gt;

&lt;p&gt;So picture this: you have a tokenizer you like, with a pretokenizer you like. But sometimes, due to the domain you find yourself operating on, you end up with a run that is longer than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max_input_chars_per_word&lt;/code&gt;. Adding a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FixedLength&lt;/code&gt; pretokenizer to the pretokenizer you already had solves exactly this issue: pretokenization proceeds as it normally would, but any runs coming out of your previous pretokenizer that are too long are then split up into usable chunks. Problem solved. The only issue you could run into is that you miss tokens you otherwise could have found.&lt;/p&gt;

&lt;h1 id=&quot;implementation-in-skeletoken&quot;&gt;Implementation in skeletoken&lt;/h1&gt;

&lt;p&gt;This is fully implemented in &lt;a href=&quot;https://github.com/stephantul/skeletoken&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;skeletoken&lt;/code&gt;&lt;/a&gt;. Let’s return to the example from the top of the article:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;skeletoken&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TokenizerModel&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TokenizerModel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_pretrained&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;bert-base-uncased&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;make_model_greedy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Make fixedlength really low for demonstration purposes
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pre_tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pretokenizers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;length&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;talk&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10_000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokens&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# [&apos;[CLS]&apos;, &apos;talk&apos;, &apos;##talk&apos;, &apos;##ta&apos;, &apos;l&apos;, &apos;##kt&apos;, &apos;##al&apos;, &apos;##kt&apos;, &apos;##al&apos;, &apos;##k&apos;]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As you can see, the third &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;talk&lt;/code&gt; is chopped up into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;##ta&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lk&lt;/code&gt;, because that’s where the pretokenizer boundaries fell. In practice though, this should almost never occur or matter.&lt;/p&gt;

&lt;p&gt;Note that this also makes it possible to use greedy tokenizers &lt;em&gt;without&lt;/em&gt; any form of pretokenization, and enables the use of multi-word units in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WordPiece&lt;/code&gt; tokenizers: because multiword tokenizers generally don’t pretokenize at all, any sequence over 100 characters would produce an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[UNK]&lt;/code&gt;.&lt;/p&gt;

&lt;h1 id=&quot;future-work&quot;&gt;Future work&lt;/h1&gt;

&lt;p&gt;In a future post I’ll dive into how you can make a much faster greedy tokenizer by imposing specific restrictions on the tokenizer model, and then using the &lt;a href=&quot;https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm&quot;&gt;Aho-Corasick algorithm&lt;/a&gt; with backtracking to find subwords with much lower complexity.&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;This actually took to long to run. Sorry! &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;It is equally silly to not implement a more efficient variant when such things exist. For example, moving the end pointer not to the end of the string, but forward by the maximum subword length would completely solve this issue and literally lead to the same solution. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
        <pubDate>Thu, 18 Sep 2025 00:00:00 +0000</pubDate>
        <link>https://stephantul.github.io/blog/better-greedy/</link>
        <guid isPermaLink="true">https://stephantul.github.io/blog/better-greedy/</guid>
        
        
        <category>tokenization</category>
        
      </item>
    
      <item>
        <title>Note: alternative to regex splitting in byte tokenizers</title>
        <description>&lt;p&gt;In a &lt;a href=&quot;_posts/2025-08-10-note-byte.markdown&quot;&gt;previous note&lt;/a&gt;, I discussed an alternative for setting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;split&lt;/code&gt; to true in a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ByteLevel&lt;/code&gt; pretokenizer. I suggested using a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ByteLevel&lt;/code&gt; &lt;em&gt;normalizer&lt;/em&gt; first, and then splitting using a complicated regex in “byte space”. However, this turned out to not work very well: there are certain character classes in an original Regex, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;\s&lt;/code&gt;, that are very difficult to convert to a pattern in byte space.&lt;/p&gt;

&lt;p&gt;I was wondering about how others did this, and discovered that you can stack multiple pretokenizers by first using a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Split&lt;/code&gt; pretokenizer with a regex, and then using a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ByteLevel&lt;/code&gt; pretokenizer with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;split&lt;/code&gt; set to False. This is, e.g., what &lt;a href=&quot;https://huggingface.co/Qwen/Qwen3-Embedding-0.6B&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen/Qwen3-Embedding-0.6B&lt;/code&gt;&lt;/a&gt; uses. Doing it this way is correct and achieves my original proposal: a way to split using a regex of your own design, with Byte normalization.&lt;/p&gt;

&lt;p&gt;Here’s what that looks like:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;tokenizers&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Regex&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;tokenizers.pre_tokenizers&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ByteLevel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Sequence&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;pattern&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sa&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&apos;s|&apos;t|&apos;re|&apos;ve|&apos;m|&apos;ll|&apos;d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Regex&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;behavior&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;isolated&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;byte&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ByteLevel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;use_regex&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;add_prefix_space&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;pretokenizer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Sequence&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;byte&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;original&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ByteLevel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;use_regex&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;add_prefix_space&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;hello, ご　「きげんよう?」？”&quot;&lt;/span&gt; 

&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pretokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pre_tokenize_str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;original&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pre_tokenize_str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This allows you to freely change your regex without any difficulties. One thing to note is that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;add_prefix_space&lt;/code&gt; needs to be unset for this to be totally equivalent. If not, you will need to add a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Prepend&lt;/code&gt; normalizer.&lt;/p&gt;
</description>
        <pubDate>Tue, 12 Aug 2025 00:00:00 +0000</pubDate>
        <link>https://stephantul.github.io/blog/note-2-byte/</link>
        <guid isPermaLink="true">https://stephantul.github.io/blog/note-2-byte/</guid>
        
        
        <category>tokenization</category>
        
      </item>
    
      <item>
        <title>Separate Normalization from Splitting in ByteLevel tokenizers</title>
        <description>&lt;p&gt;&lt;em&gt;This note is wrong! This was revealed to me by &lt;a href=&quot;https://x.com/sasuke___420/&quot;&gt;Sasuke___420&lt;/a&gt;. As it turns out, the regex does not work the same as the original one, specifically for non-ascii spaces. Upon further reflection, I don’t think you should really use this.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is a short note to dissuade you from using a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ByteLevel&lt;/code&gt; pretokenizer in your tokenizers. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ByteLevel&lt;/code&gt; pretokenizer, as implemented in &lt;a href=&quot;https://github.com/huggingface/tokenizers&quot;&gt;Hugging Face tokenizers&lt;/a&gt; does three things:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Possibly inserts a space in front of your string (if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;add_prefix_space&lt;/code&gt; is True (default))&lt;/li&gt;
  &lt;li&gt;Encodes your string into a byte encoding&lt;/li&gt;
  &lt;li&gt;Tokenizes using a regex that is specific to English (if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;use_regex&lt;/code&gt; is True (default))&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here’s an example:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;tokenizers.pretokenizers&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ByteLevel&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ByteLevel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pre_tokenize_str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;hello, こんにちは&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# A list of three tokens.
# [(&apos;Ġhello&apos;, (0, 5)), (&apos;,&apos;, (5, 6)), (&apos;ĠãģĵãĤĵãģ«ãģ¡ãģ¯&apos;, (6, 12))]
# The tokenizer inserted a space before &quot;hello&quot;
# It converted to bytes
# And then split.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the tokenizers package, there’s a distinction between a &lt;em&gt;normalizer&lt;/em&gt; and a &lt;em&gt;pretokenizer&lt;/em&gt;. A &lt;em&gt;normalizer&lt;/em&gt; simply changes your string, but doesn’t split it. For example, if your tokenizer lowercases your input, you’ll use a &lt;a href=&quot;https://huggingface.co/docs/tokenizers/api/normalizers#tokenizers.normalizers.Lowercase&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Lowercase&lt;/code&gt;&lt;/a&gt; normalizer. A &lt;em&gt;pretokenizer&lt;/em&gt; splits your string into “words”, which can then get decomposed into actual tokens. A “word”, in this definition, is a boundary across which you can never find a subword token. For example, if your pretokenizer splits on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;-&quot;&lt;/code&gt;, the string &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;bench-maxx&quot;&lt;/code&gt; will be split into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[&quot;bench&quot;, &quot;-&quot;, &quot;maxx&quot;]&lt;/code&gt;. Even if your vocabulary contains a token like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;h-m&quot;&lt;/code&gt;, it will never be found.&lt;/p&gt;

&lt;p&gt;In this framework, it makes sense to express steps 1. and 2. above as normalizations, and decouple them from the splitting. This also makes sense from a multilingual point of view: the pretokenization regex used by the Hugging Face pretokenizer is outdated and only works for English. This regex is:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;s&quot;&gt;&quot;&apos;s|&apos;t|&apos;re|&apos;ve|&apos;m|&apos;ll|&apos;d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As you can see, it contains common contractions, which only work for English. In fact, applying this to other languages might destroy their tokenization.&lt;/p&gt;

&lt;p&gt;Luckily for us, Hugging Face tokenizers contains an equivalent transformation using normalizers and a regex splitter. Unfortunately for us, however, we need to change the regex above because otherwise it splits on various byte tokens.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;tokenizers&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Regex&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;tokenizers.normalizers&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ByteLevel&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ByteLevelNormalization&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Prepend&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Sequence&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;tokenizers.pre_tokenizers&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ByteLevel&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;normalizer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Sequence&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Prepend&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ByteLevelNormalization&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()])&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Change it to split only on ASCII punctuation
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pattern&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sa&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&apos;s|&apos;t|&apos;re|&apos;ve|&apos;m|&apos;ll|&apos;d|Ġ?(?:[\p{L}&amp;amp;&amp;amp;[^Ġ]]|[\p{P}\p{S}&amp;amp;&amp;amp;[^\x00-\x7F]])+|Ġ?\p{N}+|Ġ?[\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E]+&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;pretokenizer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Regex&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;behavior&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;isolated&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ByteLevel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;hello, ごきげんよう?&quot;&lt;/span&gt; 

&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pretokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pre_tokenize_str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;normalizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;normalize_str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pre_tokenize_str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And that’s a wrap! You can now safely add or remove whatever you want to the regex defined above, split however you like, and it will work. One downside of this approach is that writing and interpreting a regex for bytes is quite difficult.&lt;/p&gt;
</description>
        <pubDate>Tue, 12 Aug 2025 00:00:00 +0000</pubDate>
        <link>https://stephantul.github.io/blog/note-byte/</link>
        <guid isPermaLink="true">https://stephantul.github.io/blog/note-byte/</guid>
        
        
        <category>tokenization</category>
        
      </item>
    
      <item>
        <title>Turning any tokenizer into a greedy one</title>
        <description>&lt;p&gt;I recently re-read &lt;a href=&quot;https://arxiv.org/abs/2403.01289&quot;&gt;Greed is All You Need: An Evaluation of Tokenizer Inference Methods&lt;/a&gt;. In this paper, the authors show that switching out inference methods for tokenizers can improve performance on various tasks.&lt;/p&gt;

&lt;p&gt;In this post, I talk about how this could be interesting, introduce an implementation to switch out inference methods for a HF tokenizer, and present the results on some experiments.&lt;/p&gt;

&lt;h1 id=&quot;preliminaries&quot;&gt;Preliminaries&lt;/h1&gt;

&lt;p&gt;A tokenizer is, simply put, a program that, given a vocabulary of tokens &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;V&lt;/code&gt;, can segment text into a sequence of tokens. These tokens are suitable for input into neural networks, because each token is actually just an index to an embedding table.&lt;/p&gt;

&lt;p&gt;Crucially, the vocabulary &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;V&lt;/code&gt; can be automatically learned from a large corpus of text. There are many algorithms for doing so, but the most well-known are &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WordPiece&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UnigramLM&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Byte Pair Encoding (BPE)&lt;/code&gt;. I won’t dive into the details of those algorithms here. What is important is that each of these methods does not only differ in what kind of vocabulary &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;V&lt;/code&gt; they learn, but also how they actually segment text. For example, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WordPiece&lt;/code&gt; algorithm just takes the longest possible prefix at any position (a greedy algorithm), while &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BPE&lt;/code&gt;’s segmentation is governed by a separate merge table.&lt;/p&gt;

&lt;h1 id=&quot;the-experiment&quot;&gt;The experiment&lt;/h1&gt;

&lt;p&gt;The main contribution by the aforementioned paper is showing that switching out the inference algorithm after training actually works well. That is, if you have a vocabulary &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;V&lt;/code&gt; learned by a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BPE&lt;/code&gt; tokenizer, you can segment text using that same vocabulary and, e.g., the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WordPiece&lt;/code&gt; inference algorithm. This improves performance, especially when switching to a &lt;em&gt;greedy&lt;/em&gt; algorithm. This is not what I would have expected by the way, since you are changing the distributions.&lt;/p&gt;

&lt;p&gt;To see what this looks like, here’s the standard and greedy segmentations for two phrases, using the &lt;a href=&quot;https://huggingface.co/answerdotai/ModernBERT-base&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ModernBERT&lt;/code&gt;&lt;/a&gt; tokenizer.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;string: &quot;hellooo phonenumber&quot;
normal: [&apos;hell&apos;, &apos;ooo&apos;, &apos;Ġphon&apos;, &apos;en&apos;, &apos;umber&apos;]
greedy: [&apos;hello&apos;, &apos;oo&apos;, &apos;Ġphone&apos;, &apos;number&apos;]

string: &quot; unilaterally&quot;
normal: [&apos;Ġun&apos;, &apos;il&apos;, &apos;aterally&apos;]
greedy: [&apos;Ġunilateral&apos;, &apos;ly&apos;]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As you can see, the greedy tokenizer matches our intuitions about language much more closely: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hellooo&lt;/code&gt; is not related to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hell&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;unilaterally&lt;/code&gt; does not use the prefix &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;un&lt;/code&gt; (it should be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;uni&lt;/code&gt;). This is in line with what the authors of the aforementioned paper found: when examining performance on morphological tasks, switching to a greedy algorithm made performance go up.&lt;/p&gt;

&lt;h1 id=&quot;implementation&quot;&gt;Implementation&lt;/h1&gt;

&lt;p&gt;I implemented greedy tokenization by simply switching out the tokenizer model from whatever it was to a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WordPiece&lt;/code&gt; implementation. This is easy in my package &lt;a href=&quot;https://github.com/stephantul/tokenizer-datamodels&quot;&gt;tokenizer-datamodels&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;tokenizerdatamodels&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TokenizerModel&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# This is a pydantic model.
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;datamodel&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TokenizerModel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_pretrained&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;answerdotai/ModernBERT-base&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# This is a HF tokenizer, you can just use it.
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;greedy&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;datamodel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;make_model_greedy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;greedy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;hello phonenumber&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tokenizer-datamodels&lt;/code&gt;, as the name implies, is just a collection of models that can be used to parse and edit a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tokenizer.json&lt;/code&gt;, which is the Hugging Face tokenizers construct that contains all information about a tokenizer. It has many of these tiny features, and I’ll be adding more soon, so check it out if that sounds interesting.&lt;/p&gt;

&lt;h1 id=&quot;experiments&quot;&gt;Experiments&lt;/h1&gt;

&lt;p&gt;As mentioned above, greedy works well on intrinsic tasks. But does it actually improve performance on downstream tasks? To find out, I ran two models, &lt;a href=&quot;https://huggingface.co/intfloat/multilingual-e5-base&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;multilingual-e5-base&lt;/code&gt;&lt;/a&gt; and &lt;a href=&quot;https://huggingface.co/nomic-ai/modernbert-embed-base&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;modernbert-embed-base&lt;/code&gt;&lt;/a&gt; on &lt;a href=&quot;https://huggingface.co/collections/zeta-alpha-ai/nanobeir-66e1a0af21dfd93e620cd9f6&quot;&gt;NanoBEIR&lt;/a&gt;. This is very similar to the setup in &lt;a href=&quot;https://stephantul.github.io/blog/uncasing/&quot;&gt;my previous blog post about decasing&lt;/a&gt;.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt; &lt;/th&gt;
      &lt;th&gt;ModernBERT&lt;/th&gt;
      &lt;th&gt;e5&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Original&lt;/td&gt;
      &lt;td&gt;57.68&lt;/td&gt;
      &lt;td&gt;57.27&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Greedy&lt;/td&gt;
      &lt;td&gt;55.20&lt;/td&gt;
      &lt;td&gt;55.90&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;So, interestingly, switching to a greedy tokenizer completely tanks the scores of the models: it is literally worse on all datasets in NanoBEIR for both models. While we could consider this to be in direct opposition to the results in the paper, I don’t think this is the case.&lt;/p&gt;

&lt;h1 id=&quot;discussion&quot;&gt;Discussion&lt;/h1&gt;

&lt;p&gt;To see why, recall that the results from the paper were based on the tokenization itself; no downstream models were trained. In these experiments, we instead change the tokenizer of a model &lt;em&gt;without retraining it&lt;/em&gt;. Now, to see why this is bad, we should realize that tokens don’t have any intrinsic meaning to a model; the model does not know that “hell” is not a good prefix for the word “helloooo”, and “hello” is a better one. To the model, these are just indices to an embedding table. So changing the segmentation of words can’t realistically help the model, because we are changing the underlying token distribution feeding into the model without telling the model about it.&lt;/p&gt;

&lt;p&gt;My hypothesis: &lt;em&gt;pre-training&lt;/em&gt; an encoder model with a greedy tokenizer leads to better results than training one with a regular BPE tokenizer. Having tokens that more closely follow morphology is probably good for model performance: the model has to learn fewer exceptions, and can rely more on surface form.&lt;/p&gt;

&lt;p&gt;Related hypotheses: if you have trillions of tokens, is that still relevant? Aren’t all possible segmentations covered and memorizable? Even if following morphology is better, does it only impact training time, or is the resulting model actually  better? Many interesting questions, and things I am eager to explore.&lt;/p&gt;
</description>
        <pubDate>Sun, 10 Aug 2025 00:00:00 +0000</pubDate>
        <link>https://stephantul.github.io/blog/greedy/</link>
        <guid isPermaLink="true">https://stephantul.github.io/blog/greedy/</guid>
        
        
        <category>tokenization</category>
        
      </item>
    
  </channel>
</rss>
