Correctly typing cached functions
Caching, or memoization, is a useful way to speed up repeated calls to expensive, pure, functions. When calling a function, we save the output, using the parameters of the function as a key to the cache. Then, instead of re-calculating the result of a function on each call, we simply return the value that was stored in the cache.
NewType in python
New week, new post! This post is about NewType
, an underused construct in Python, in my opinion, and a good way to show the difference between typingtime and runtime.
TypeVars and Unions in python
This post will be about Union
s, TypeVar
s, and unions of types with a relevant common subtype (i.e., not object
). I’ll show how union types are often incorrectly used, and how using a TypeVar
can solve some of these problems. So, having said that, let’s dive in!
Enums with superclasses
In the previous post, I wrote about enumerations, and how they can be really handy when refactoring code. One thing I didn’t touch upon in that post is typed enumerations, which are enumerations that also have a type. As we saw in the previous post, an enumeration member is an association between a name
and a value
. But this means we need to call .value
to get the actual value of an enumeration member. This can lead to overly verbose code. Take a logger for example:
Enums and refactoring
Enumerations are types that take a set of pre-defined options, called members which are also assigned values. Usually, enumerations members are, as the name implies, simply mapped to a integer values, but any arbitrary value might work. Here’s an example of an enumeration in Python for colors:
Correctly typing recursive hierarchies in Python
I recently tried to create a recursive type in Python using mypy. Recursive types naturally occur when processing nested collections of arbitrary depth, such as lists or dictionaries. For me, this most often happens when processing JSON data in the wild.
Solving a ForwardRef error in pydantic
I recently ran into the following error when initializing a Pydantic BaseModel
:
The burden of proof for code reviews
This post is about code reviews, and how they can go wrong. A code review is an activity where another team member, who may or may not have been involved in the project or the code, looks at the code, and tries to spot errors or inconsistencies.
Turning a byte pair encoded string back to its surface form
The huggingface tokenizers library, can be used to train many varieties of sub-word tokenizers. In short, a sub-word tokenizer is a tokenizer that learns to split up strings into frequently occurring pieces. Ideally, a sub-word tokenizer is exhaustive, which means that it can split up any string, even strings contains sequences it has never seen before, into sub-word tokens. A truly exhaustive sub-word tokenizer is useful because it will never ever encounter an <UNK>
symbol, i.e., a thing it doesn’t know what to do with. Reaching this state is difficult when tokenizing on the characte level, however, as there are tens of thousands of unique unicode characters, and it is undesirable to give all of these unicode characters separate tokens in the vocabulary.