Python Dictionaries: Not even a space-time tradeoff
If you could choose to store things that you’d want to look up later in a Python dictionary or in a Python list, which would you choose?
It turns out that looking up items in a Python dictionary is much faster than looking up items in a Python list. If you search for a fixed number of keys, if you grow your haystack size (i.e. the size of the collection you’re searching through) by 10,000x from 1k entries to 10M entries, using a dict is over 5000x faster than using a list! (Source: Fluent Python by Luciano Ramalho)
Then why not always use dictionaries? Looking up entries in Python dictionaries is fast, but dicts use a lot of memory. Or that used to be the case anyway. From Python 3.6, dictionaries don’t even use that much memory, so dictionaries are almost always the way to go.
For most of this post, we’ll discuss dictionaries as they’re implemented pre-Python 3.6, and I’ll quickly go over the Python 3.6 changes near the end.
What makes lookups in Python dictionaries so much faster?
Dictionaries in Python are implemented using a hash table.
A hash table is a way of doing key-value lookups. You store the values in an array, and then use a hash function to find the index of the array cell that corresponds to your key-value pair. A hash function maps a key (e.g. a string) to a hash value and then to an index of the array.
So there are three main elements in hash tables: the keys, the array and the hash function.
In Python dictionaries, we keep hash tables sparse – that is, they contain many empty cells. Specifically, Python tries to keep at least a third of the cells empty.
Walking Through Lookups
Let’s try to understand how dictionaries are implemented by going through how Python looks up an entry in the dictionary. Adding entries to the dictionary follows a similar process.
1. Hash function: Calculate array index value to look up
Python dictionaries are stored as a contiguous block of memory. That means each array cell would start at dict_start + array_cell_index * cell_size.
First, then, we need to decide which index value to look up. Recall this decision is made by the hash function, which maps the item key to a hash value and then to an index.
The first part, mapping an item key to a hash value, is done by the function hash(item_key) (more details here). For example:
One important thing to note is that Python’s hash function is pretty regular. For example, the hash for an integer is the integer itself. Usually, you’d need an irregular hash function – i.e. one that scatters similar-seeming keys to different hash values – to make hash tables work well, but Python doesn’t have this. Instead, Python relies on a good way of resolving hash collisions to make the lookups efficient. More on that later.
The second part, mapping a hash value to an array index, is i = hash(item_key) & mask, where mask=array_size - 1.
bin(500) = '0b111110100' # so 500 in binary is 111110100 (everything after the b).
# and 7 in binary is 1110 (4 + 2 + 1 + 0).
# line them up
000000100 # bitwise and
# which is 4 in base 10.
Intuitively, this second part maps each hash value to a value in the range [0, array_size-1], so the location we look up will be in the array.
2. Checking for hash collisions
Now we can look up the cell the array index is pointing to. If the cell is empty and we’re trying to do a lookup, we return a KeyError. If the cell is not empty, we check if the item in the cell is what we’re looking for.
Recall each cell contains the hash value, the item key and the item value. We check if the item key and the hash value are the same as the search key and the hash of the search key using the
== operator. If they’re the same, we’ve got what we were looking for! If not, we have a hash collision: either (1) two item keys having the same hash code OR (2) two item keys with different hash codes both point to the same index. (i.e.
Hash collisions can happen because (1) there are an infinite number of strings and only a finite number of hash values (so two strings might have the same hash code), and (2) there are usually fewer cells in the array than hash values, so two hash values may point to the same position in the array. Specifically, if the length of the array is N digits long in binary, if two hash values share the same last N digits, there will be a hash collision.
3. Resolving hash collisions
To resolve the collision, Python searches the other array cells in a scrambled way that depends on the hash value. Because Python’s hash function is relatively regular, the way it resolves collisions is key to implementing lookups efficiently.
You can safely skip this part, but if here’s the code if you’re interested:
perturb >>= PERTURB_SHIFT; # PERTURB_SHIFT is a constant.
# >> shifts the bits to the right
# by PERTURB_SHIFT (bits).
# e.g. 9 = 1001 base 2.
# 9 >> 1 = 100 base 2,
# which is 6.
# So 9 >> 1 is 6.
j = (5*j) + 1 + perturb; # this would search through the array
# in a fixed way if not for perturb,
# which makes the search order
# different for different hash keys
use j % 2**i as the next table index; # where i is the current
# array index
This is discussed in much more depth in the docstring in the CPython dictobject source code.
Resizing the dictionary
Recall that in Python we want the dict to be sparse, specifically at least 1/3 empty. So when the dict becomes 2/3 full, Python copies the dict to a different location and makes it bigger. This increases the array_size and increases mask, which means (1) the lookup now likely uses more digits of the hash value, and (2) the array indices might change too. This is why the order of dict.keys() might change as you add entries to dict.
Python 3.6: more compact dictionaries
Remember what I said about Python 3.6 dictionaries not using as much memory? This is because the array is reformatted into two arrays, one compact array that holds the <hash value, item key, item value> triples, and a sparse array that holds indices that point to rows in the compact array. Here’s an illustration that shows how it works:
Pretty clever, huh?
I hope this has helped – if you want to learn more, I recommend you check out the docstring in the CPython dictobject source code or read Laurent’s blog post that walks through more of the source code.
- Fluent Python by Luciano Ramalho, Chapter 3: Dictionaries and Sets
- CPython source code for Dict
- Hash Tables tutorial (Video by Gayle Laakmann Mcdowell for HackerRank)
- Laurent Luce’s blog post on implementing dicts in Python
- Changes to Python dict data structure from Python 3.6 (idea)
- How Python calculates hash values (from 2002, may have changed since)