muhuk's blog

Nature, to Be Commanded, Must Be Obeyed

November 05, 2008

Dictionary Key Validation, Lists, Sets and Iterators

Recently I needed to check if a Python dictionary has a specific set of keys exactly or not. I made some tests to see which method works best. For the sake of simplicity, I’ve compared keys of two dictionaries instead of a dictionary’s keys against a sequence of keys.

All of the following test results are averages of 10 repetitions of the given code, in the form of:

sum(timeit.Timer(<test_code>).repeat(10))/10.0

First, I did some simple tests with small sizes:

>>> timeit.Timer('l1.sort();l2.sort();l1==l2', 'l1,l2=[1,2,3], [3,2,1]')
1.95

>>> timeit.Timer('set(l1)==set(l2)', 'l1,l2=[1,2,3], [3,2,1]')
2.85

The second one with set‘s seems more intuitive to me. But it’s slower than the one with lists. Obviously you need to sort the lists before comparison. Because a dictionary’s keys are not ordered, and therefore its keys() would return a list with unpredictable order of items. On the other hand a set, even though it is unordered, would return the same hash for the same keys.

Now, what happens if we work on a larger number of items:

>>> timeit.Timer('l1.sort();l2.sort();l1==l2', 'l1,l2=range(100), \
... [100-i for i in range(100)]')
15.49

>>> timeit.Timer('set(l1)==set(l2)', 'l1,l2=range(100), \
... [100-i for i in range(100)]')
25.13

More or less the same results. Actually both of these tests are biased towards lists. We initialize our test data as lists. So while the first tests only consist of in-place sorting and comparison, the second ones involve creation of new set objects and comparison.

Let us see what happens when we start with actual dictionaries and do the same comparisons:

>>> timeit.Timer('l1,l2=d1.keys(),d2.keys();l1.sort();l2.sort();l1==l2', \
... 'd1,d2=dict([(str(i), i) for i in range(3)]),dict([(str(3-i), 3-i) \
... for i in range(3)])')
2.49

>>> timeit.Timer('set(d1.keys())==set(d2.keys())', 'd1,d2=dict([(str(i), i) \
... for i in range(3)]),dict([(str(3-i), 3-i) for i in range(3)])')
2.83

And for comparison the first test without sorting (of course it evaluates to False):

>>> timeit.Timer('d1.keys()==d2.keys()', 'd1,d2=dict([(str(i), i) \
... for i in range(3)]),dict([(str(3-i), 3-i) for i in range(3)])')
1.11

We can clearly see initializing sets takes slightly longer than in-place sorting. But what happens if we work with more keys:

>>> timeit.Timer('l1,l2=d1.keys(),d2.keys();l1.sort();l2.sort();l1==l2', \
... 'd1,d2=dict([(str(i), i) for i in range(10)]), \
... dict([(str(10-i), 10-i) for i in range(10)])')
5.21

>>> timeit.Timer('set(d1.keys())==set(d2.keys())', \
... 'd1,d2=dict([(str(i), i) for i in range(10)]), \
... dict([(str(10-i), 10-i) for i in range(10)])')
5.59

When we increased item number to ten performance gap decreased. This is probably because while set comparisons are based on hashes as I mentioned before, list comparisons are item-by-item comparisons due to its mutability. Let’s go even higher and see if the difference becomes more clear:

>>> timeit.Timer('l1,l2=d1.keys(),d2.keys();l1.sort();l2.sort();l1==l2', \
... 'd1,d2=dict([(str(i), i) for i in range(25)]), \
... dict([(str(25-i), 25-i) for i in range(25)])')
16.21

>>> timeit.Timer('set(d1.keys())==set(d2.keys())', \
... 'd1,d2=dict([(str(i), i) for i in range(25)]), \
... dict([(str(25-i), 25-i) for i in range(25)])')
9.79

This time sets are almost 100% faster. It seems sets perform better than sorted lists overall. They are not much slower at lower item counts and significantly faster at higher item counts. I dawned on me when using sets I can take advantage of iterators instead of creating list objects. Here are the results for all three sizes:

>>> timeit.Timer('set(d1.iterkeys())==set(d2.iterkeys())', \
... 'd1,d2=dict([(str(i), i) for i in range(3)]), \
... dict([(str(3-i), 3-i) for i in range(3)])')
2.07
>>> timeit.Timer('set(d1.iterkeys())==set(d2.iterkeys())', \
... 'd1,d2=dict([(str(i), i) for i in range(10)]), \
... dict([(str(10-i), 10-i) for i in range(10)])')
3.79
>>> timeit.Timer('set(d1.iterkeys())==set(d2.iterkeys())', \
... 'd1,d2=dict([(str(i), i) for i in range(25)]), \
... dict([(str(25-i), 25-i) for i in range(25)])')
8.92

It performs best for all sizes. In addition it is quite intuitive and readable. I think I have found the foundation of my comparison function, it should look similar to this:

if set(d.iterkeys())==set(sequence_of_keys):

If you have any questions, suggestions or corrections feel free to drop me a line.