Static late interaction models
Late interaction is an interesting paradigm for computing the similarity between two documents, and can be seen as a hybrid of sparse and dense retrieval. In this post, I will show how static models in a late interaction setting actually reduce to sparse models. I will also argue that, in absence of empirical evidence to the contrary, there’s no good reason to assume that static late interaction models will be much better than their dense counterparts. But first, let’s dive into some fundamentals: I’ll explain what sparse retrieval and dense retrieval are, and how late interaction fits in with both of those paradigms.
Better Greedy Tokenizers: Handling WordPiece's [UNK] Problem
In a previous post, I showed that making a tokenizer greedy, that is, always picking the longest matching subword like WordPiece does, can improve results without retraining. But WordPiece
can unfortunately silently break your tokenization.
Note: alternative to regex splitting in byte tokenizers
In a previous note, I discussed an alternative for setting split
to true in a ByteLevel
pretokenizer. I suggested using a ByteLevel
normalizer first, and then splitting using a complicated regex in “byte space”. However, this turned out to not work very well: there are certain character classes in an original Regex, such as \s
, that are very difficult to convert to a pattern in byte space.
Turning any tokenizer into a greedy one
I recently re-read Greed is All You Need: An Evaluation of Tokenizer Inference Methods. In this paper, the authors show that switching out inference methods for tokenizers can improve performance on various tasks.
Tokenizer decasing
In this post I will talk about something I call tokenizer de_casing. _decasing is very similar to putting a lowercase normalizer in front a of a tokenizer, but works better.
Using overload to handle tagged union return types
Here’s a function with an idiom I’ve seen a lot (probably copied from sentence-transformers
):
Protocols to make untyped code behave
Working with external untyped code in a typed code base can be challenging, you’ll get lots of Any
or Unknown
, which might propagate through your codebase. This can force you to reach for typing.cast
, or # type: ignore
statements, which kind of defeats the purpose of using static typing in the first place.
Rethinking evaluation and relative performance
Here’s a pop quiz: classifier A
scores 90% accuracy on some benchmark. Classifier B
scores 80%. How much better is A
?