How Python implements dictionaries

Jessica YungProgrammingLeave a Comment

Python Dictionaries: Not even a space-time tradeoff

If you could choose to store things that you’d want to look up later in a Python dictionary or in a Python list, which would you choose?

It turns out that looking up items in a Python dictionary is much faster than looking up items in a Python list. If you search for a fixed number of keys, if you grow your haystack size (i.e. the size of the collection you’re searching through) by 10,000x from 1k entries to 10M entries, using a dict is over 5000x faster than using a list! (Source: Fluent Python by Luciano Ramalho)

Then why not always use dictionaries? Looking up entries in Python dictionaries is fast, but dicts use a lot of memory. Or that used to be the case anyway. From Python 3.6, dictionaries don’t even use that much memory, so dictionaries are almost always the way to go.

For most of this post, we’ll discuss dictionaries as they’re implemented pre-Python 3.6, and I’ll quickly go over the Python 3.6 changes near the end.

What makes lookups in Python dictionaries so much faster?

Dictionaries in Python are implemented using a hash table.

How hash tables work.

A hash table is a way of doing key-value lookups. You store the values in an array, and then use a hash function to find the index of the array cell that corresponds to your key-value pair. A hash function maps a key (e.g. a string) to a hash value and then to an index of the array.

So there are three main elements in hash tables: the keys, the array and the hash function.

In Python dictionaries, we keep hash tables sparse – that is, they contain many empty cells. Specifically, Python tries to keep at least a third of the cells empty.

sparse_matrix
A sparse matrix. Many of the entries are zero.

Walking Through Lookups

Let’s try to understand how dictionaries are implemented by going through how Python looks up an entry in the dictionary. Adding entries to the dictionary follows a similar process.

1. Hash function: Calculate array index value to look up

Python dictionaries are stored as a contiguous block of memory. That means each array cell would start at dict_start + array_cell_index * cell_size.

First, then, we need to decide which index value to look up. Recall this decision is made by the hash function, which maps the item key to a hash value and then to an index.

The first part, mapping an item key to a hash value, is done by the function hash(item_key) (more details here). For example:

One important thing to note is that Python’s hash function is pretty regular. For example, the hash for an integer is the integer itself. Usually, you’d need an irregular hash function – i.e. one that scatters similar-seeming keys to different hash values – to make hash tables work well, but Python doesn’t have this. Instead, Python relies on a good way of resolving hash collisions to make the lookups efficient. More on that later.

The second part, mapping a hash value to an array index, is i = hash(item_key) & mask, where mask=array_size - 1.

& is a bitwise and.  To do a bitwise and A & B, rewrite A and B in binary, and then write a 1 in the digit places where both A and B are 1, and write otherwise. For example, suppose we have a hash value of 500 and an array size of 8, which is the starting size of empty Python dictionaries.

Intuitively, this second part maps each hash value to a value in the range [0, array_size-1], so the location we look up will be in the array.

2. Checking for hash collisions

Now we can look up the cell the array index is pointing to. If the cell is empty and we’re trying to do a lookup, we return a KeyError. If the cell is not empty, we check if the item in the cell is what we’re looking for.

Recall each cell contains the hash value, the item key and the item value. We check if the item key and the hash value are the same as the search key and the hash of the search key using the == operator. If they’re the same, we’ve got what we were looking for! If not, we have a hash collision: either (1) two item keys having the same hash code OR (2) two item keys with different hash codes both point to the same index. (i.e. hashvalue1 & mask = hashvalue2 & mask.)

Hash collision! Two possible cases

Hash collisions can happen because (1) there are an infinite number of strings and only a finite number of hash values (so two strings might have the same hash code), and (2) there are usually fewer cells in the array than hash values, so two hash values may point to the same position in the array. Specifically, if the length of the array is N digits long in binary, if two hash values share the same last N digits, there will be a hash collision.

3. Resolving hash collisions

To resolve the collision, Python searches the other array cells in a scrambled way that depends on the hash value. Because Python’s hash function is relatively regular, the way it resolves collisions is key to implementing lookups efficiently.

You can safely skip this part, but if here’s the code if you’re interested:

This is discussed in much more depth in the docstring in the CPython dictobject source code.

Resizing the dictionary

Recall that in Python we want the dict to be sparse, specifically at least 1/3 empty. So when the dict becomes 2/3 full, Python copies the dict to a different location and makes it bigger. This increases the array_size and increases  mask, which means (1) the lookup now likely uses more digits of the hash value, and (2) the array indices might change too. This is why the order of  dict.keys() might change as you add entries to dict.

Python 3.6: more compact dictionaries

Remember what I said about Python 3.6 dictionaries not using as much memory? This is because the array is reformatted into two arrays, one compact array that holds the <hash value, item key, item value> triples, and a sparse array that holds indices that point to rows in the compact array. Here’s an illustration that shows how it works:

How dictionaries are stored from Python 3.6 onwards.

Pretty clever, huh?

I hope this has helped – if you want to learn more, I recommend you check out the docstring in the CPython dictobject source code or read Laurent’s blog post that walks through more of the source code.

References:

Leave a Reply