Stéphan Tulkens

NLP Person

TypeVars and Unions in python

This post will be about Unions, TypeVars, and unions of types with a relevant common subtype (i.e., not object). I’ll show how union types are often incorrectly used, and how using a TypeVar can solve some of these problems. So, having said that, let’s dive in!

Read More

Enums with superclasses

In the previous post, I wrote about enumerations, and how they can be really handy when refactoring code. One thing I didn’t touch upon in that post is typed enumerations, which are enumerations that also have a type. As we saw in the previous post, an enumeration member is an association between a name and a value. But this means we need to call .value to get the actual value of an enumeration member. This can lead to overly verbose code. Take a logger for example:

Read More

Enums and refactoring

Enumerations are types that take a set of pre-defined options, called members which are also assigned values. Usually, enumerations members are, as the name implies, simply mapped to a integer values, but any arbitrary value might work. Here’s an example of an enumeration in Python for colors:

Read More

Correctly typing recursive hierarchies in Python

I recently tried to create a recursive type in Python using mypy. Recursive types naturally occur when processing nested collections of arbitrary depth, such as lists or dictionaries. For me, this most often happens when processing JSON data in the wild.

Read More

Solving a ForwardRef error in pydantic

I recently ran into the following error when initializing a Pydantic BaseModel:

Read More

The burden of proof for code reviews

This post is about code reviews, and how they can go wrong. A code review is an activity where another team member, who may or may not have been involved in the project or the code, looks at the code, and tries to spot errors or inconsistencies.

Read More

Turning a byte pair encoded string back to its surface form

The huggingface tokenizers library, can be used to train many varieties of sub-word tokenizers. In short, a sub-word tokenizer is a tokenizer that learns to split up strings into frequently occurring pieces. Ideally, a sub-word tokenizer is exhaustive, which means that it can split up any string, even strings contains sequences it has never seen before, into sub-word tokens. A truly exhaustive sub-word tokenizer is useful because it will never ever encounter an <UNK> symbol, i.e., a thing it doesn’t know what to do with. Reaching this state is difficult when tokenizing on the characte level, however, as there are tens of thousands of unique unicode characters, and it is undesirable to give all of these unicode characters separate tokens in the vocabulary.

Read More

Fast way to find the rank of an item in pytorch

I recently read an interesting paper by Gehrmann et al. in which the rank of the predictions of a language model is used as a feature vector to distinguish machine-generated from regular text. In implementing this method in pytorch, I ran into an interesting problem that I solved in a really slow way, and subsequently made faster. This blog post shows you how not to do it, and how you can make it faster.

Read More

Using itertools.product with dictionaries

In a previous post, I talked about using itertools.product with lists. In this post, I used a typical ML experiment as an example, and made a comparison with sklearn’s GridSearchCV. It occurred to me that GridSearchCV uses dictionaries, while my example only used lists, so in this post I will show you how to build a dictionary iterator using product.

Read More

Using itertools.product instead of nested for loops

In many programming situations, you will often have to compare each item in a list to each other item in a list, creating the well-known nested for-loop.

Read More