TypeVars and Unions in python
This post will be about Union
s, TypeVar
s, and unions of types with a relevant common subtype (i.e., not object
). I’ll show how union types are often incorrectly used, and how using a TypeVar
can solve some of these problems. So, having said that, let’s dive in!
Enums with superclasses
In the previous post, I wrote about enumerations, and how they can be really handy when refactoring code. One thing I didn’t touch upon in that post is typed enumerations, which are enumerations that also have a type. As we saw in the previous post, an enumeration member is an association between a name
and a value
. But this means we need to call .value
to get the actual value of an enumeration member. This can lead to overly verbose code. Take a logger for example:
Enums and refactoring
Enumerations are types that take a set of pre-defined options, called members which are also assigned values. Usually, enumerations members are, as the name implies, simply mapped to a integer values, but any arbitrary value might work. Here’s an example of an enumeration in Python for colors:
Correctly typing recursive hierarchies in Python
I recently tried to create a recursive type in Python using mypy. Recursive types naturally occur when processing nested collections of arbitrary depth, such as lists or dictionaries. For me, this most often happens when processing JSON data in the wild.
Solving a ForwardRef error in pydantic
I recently ran into the following error when initializing a Pydantic BaseModel
:
The burden of proof for code reviews
This post is about code reviews, and how they can go wrong. A code review is an activity where another team member, who may or may not have been involved in the project or the code, looks at the code, and tries to spot errors or inconsistencies.
Turning a byte pair encoded string back to its surface form
The huggingface tokenizers library, can be used to train many varieties of sub-word tokenizers. In short, a sub-word tokenizer is a tokenizer that learns to split up strings into frequently occurring pieces. Ideally, a sub-word tokenizer is exhaustive, which means that it can split up any string, even strings contains sequences it has never seen before, into sub-word tokens. A truly exhaustive sub-word tokenizer is useful because it will never ever encounter an <UNK>
symbol, i.e., a thing it doesn’t know what to do with. Reaching this state is difficult when tokenizing on the characte level, however, as there are tens of thousands of unique unicode characters, and it is undesirable to give all of these unicode characters separate tokens in the vocabulary.
Fast way to find the rank of an item in pytorch
I recently read an interesting paper by Gehrmann et al. in which the rank of the predictions of a language model is used as a feature vector to distinguish machine-generated from regular text. In implementing this method in pytorch, I ran into an interesting problem that I solved in a really slow way, and subsequently made faster. This blog post shows you how not to do it, and how you can make it faster.
Using itertools.product with dictionaries
In a previous post, I talked about using itertools.product
with lists.
In this post, I used a typical ML experiment as an example, and made a comparison with sklearn’s GridSearchCV
.
It occurred to me that GridSearchCV
uses dictionaries, while my example only used lists, so in this post I will show you how to build a dictionary iterator using product
.
Using itertools.product instead of nested for loops
In many programming situations, you will often have to compare each item in a list to each other item in a list, creating the well-known nested for-loop.