Stéphan Tulkens

NLP Person

Correctly typing recursive hierarchies in Python

I recently tried to create a recursive type in Python using mypy. Recursive types naturally occur when processing nested collections of arbitrary depth, such as lists or dictionaries. For me, this most often happens when processing JSON data in the wild.

I found it quite hard to correctly define a recursive type. If you simply want to know how to create one, here is how.

from typing import Union

Hierarchy = list[Union["Hierarchy", str]]

This is of course assuming you have nested lists of strings. Your use-case might vary.

First attempts

In order to explain what is going on, I’ll first introduce a function that processes a list of lists of arbitrary depth, and returns all strings from all lists. That is, it flattens the list (I know there are easier ways to flatten lists, bear with me.)

Hierarchy = list[list | str]

def flatten_list(hierarchy: Hierarchy) -> list[str]:
    terminals = []
    for item in hierarchy:
        if isinstance(item, str):
            terminals.append(item)
        else:
            terminals.extend(flatten_list(item))

    return terminals

The type here is usually what I would attempt, the assumption being that it is not easy to define a hierarchical type. This example also passes mypy, but this is because the nested list is short for list[Any]. We can check this by adding reveal_type to the code, and re-running mypy:

Hierarchy = list[list | str]

def flatten_list(hierarchy: Hierarchy) -> list[str]:
    terminals = []
    reveal_type(hierarchy)
    for item in hierarchy:
        reveal_type(item)
        if isinstance(item, str):
            terminals.append(item)
        else:
            terminals.extend(flatten_list(item))

    return terminals  
  

Running mypy on this piece of code gives us:

note: Revealed type is "builtins.list[Union[builtins.list[Any], builtins.str]]"
note: Revealed type is "Union[builtins.list[Any], builtins.str]"

Which shows that the nested list is typed as list[Any].

Making it a bit more objective

One way to make the type hierarchical is to introduce a class that has itself as a member. Note that this requires us to import annotations from __future__.

from __future__ import annotations

from dataclasses import dataclass
from typing import Iterator


@dataclass
class Hierarchy:
    items: list[Hierarchy | str]

    def __iter__(self) -> Iterator[Hierarchy | str]:
        return iter(self.items)

This works! It fully type checks and also works in practice. So this means we’re done, right? We’ve created a type that acts as a hierarchical list of items.

…except we forgot one thing! We haven’t actually instantiated these dataclasses. That is, in order to create Hierarchy dataclasses, we have to traverse the hierarchy. This would, again, require us to define a hierarchical type to define the code to create the Hierarchy objects. One way to fix this, is to create Hierarchy objects during traversal. While this works, but I don’t really see the point: at that point you’re just creating objects to satisfy the type checker.

Going primitive

In order to solve this, we need to be able to define a Hierarchy in terms of the primitives of the data we ingest. For arbitrary JSON data, and python, this means we need to define a type over dict, list, and so on.

My first attempt at this was to define something like this:

from __future__ import annotations

Hierarchy = list[Hierarchy | str]

Now, what is super confusing about this, is that this correctly passes mypy type checking, but fails when actually running the code, and with the following error:

NameError: name 'Hierarchy' is not defined

This is probably the first time I saw something that passed static type checks that was simply wrong. Note that pylance, which is the default type checker in VSCode, does flag this as an error. Usually, I don’t see any discrepancies between the two, so this is interesting. Also note that without annotations imported, this also doesn’t pass mypy.

So, what is happening here is that definitions in Python are built in a bottom-up fashion. In order to define Hierarchy, we need to know which types are used by Hierarchy. But this is not possible, because Hierarchy hasn’t been defined at that point.

One way to circumvent this, is by using string aliases instead of the actual type, as follows:

Hierarchy = list["Hierarchy" | str]

Note that this no longer requires us to import annotations. This does pass type checking, but, again, doesn’t actually work! When running this code, it throws a TypeError.

TypeError: unsupported operand type(s) for |: 'str' and 'type'

What is happening here? As it turns out, when type checking, mypy is happy to pretend that "Hierarchy" actually refers to the type Hierarchy which we defined earlier on the same line. This allows for recursive references. But at run-time, "Hierarchy" no longer represents a type, but is just a good old string. So, what we’re actually doing is the following:

result = "Hierarchy" | str

I.e., we’re using the pipe operator, often used in boolean logic, and apply it to an instance of a str and a type (which happens to be str). Arbitrary classes can implement their own logic for the pipe operator by implementing the dunder method __or__. str doesn’t implement __or__, so using | doesn’t make a lot of sense. Note that this can actually give a really really ugly bug if you do use a type that implements __or__, as it won’t crash, but will instead return the value of whatever __or__ returns.

So, given that this is a syntax error, our last recourse could be to introduce good old Union from typing. From python 3.10 onwards, the | is an alias for Union, and I’ve never imported Union since. This is the first time (lots of first times here), that I’ve seen a replacement for a typing operator not being equal to that typing operator, so, this threw me for a loop again. In fact, I only solved it by reading through the comments of this SO answer.

So, circling back to the top of this post, the correct answer is to use typing.Union.

from typing import Union

Hierarchy = list[Union["Hierarchy", str]]

The revealed types in the function above, are:

note: Revealed type is "builtins.list[Union[builtins.str, ...]]"
note: Revealed type is "Union[builtins.str, builtins.list[Union[builtins.str, ...]]]"

Which is exactly what we expected. Nice!