Stéphan Tulkens

NLP Person

String casing in python

Below are two ways to check if a string is lower-cased in Python.

Read More

Correctly typing cached functions

Caching, or memoization, is a useful way to speed up repeated calls to expensive, pure, functions. When calling a function, we save the output, using the parameters of the function as a key to the cache. Then, instead of re-calculating the result of a function on each call, we simply return the value that was stored in the cache.

Read More

NewType in python

New week, new post! This post is about NewType, an underused construct in Python, in my opinion, and a good way to show the difference between typingtime and runtime.

Read More

TypeVars and Unions in python

This post will be about Unions, TypeVars, and unions of types with a relevant common subtype (i.e., not object). I’ll show how union types are often incorrectly used, and how using a TypeVar can solve some of these problems. So, having said that, let’s dive in!

Read More

Enums with superclasses

In the previous post, I wrote about enumerations, and how they can be really handy when refactoring code. One thing I didn’t touch upon in that post is typed enumerations, which are enumerations that also have a type. As we saw in the previous post, an enumeration member is an association between a name and a value. But this means we need to call .value to get the actual value of an enumeration member. This can lead to overly verbose code. Take a logger for example:

Read More

Enums and refactoring

Enumerations are types that take a set of pre-defined options, called members which are also assigned values. Usually, enumerations members are, as the name implies, simply mapped to a integer values, but any arbitrary value might work. Here’s an example of an enumeration in Python for colors:

Read More

Correctly typing recursive hierarchies in Python

I recently tried to create a recursive type in Python using mypy. Recursive types naturally occur when processing nested collections of arbitrary depth, such as lists or dictionaries. For me, this most often happens when processing JSON data in the wild.

Read More

Solving a ForwardRef error in pydantic

I recently ran into the following error when initializing a Pydantic BaseModel:

Read More

The burden of proof for code reviews

This post is about code reviews, and how they can go wrong. A code review is an activity where another team member, who may or may not have been involved in the project or the code, looks at the code, and tries to spot errors or inconsistencies.

Read More

Turning a byte pair encoded string back to its surface form

The huggingface tokenizers library, can be used to train many varieties of sub-word tokenizers. In short, a sub-word tokenizer is a tokenizer that learns to split up strings into frequently occurring pieces. Ideally, a sub-word tokenizer is exhaustive, which means that it can split up any string, even strings contains sequences it has never seen before, into sub-word tokens. A truly exhaustive sub-word tokenizer is useful because it will never ever encounter an <UNK> symbol, i.e., a thing it doesn’t know what to do with. Reaching this state is difficult when tokenizing on the characte level, however, as there are tens of thousands of unique unicode characters, and it is undesirable to give all of these unicode characters separate tokens in the vocabulary.

Read More