I recently contributed the Set
data structure to the Elixir programming language.
The Set
is implemented through a HashSet.
1 2 3 4 |
|
Elixir’s Set implementation is often faster than Erlangs own sets. Which is pretty impressive since Elixir runs on the same Erlang VM. It does this by adapting the internal data structure based on the size of the sets.
Lets explore how Elixir does this while also explaining a little about Elixir:
Growing Data structures
For small sets the internal representation uses an ordered set (which really is just a list).
We refer to this as a Bucket.
1
|
|
Being ordered allows us to perform faster access and modification.
For example finding a member of a set:
1 2 3 |
|
Order in the list is not maintained in the data structure but in the insertion/put:
Putting elements into a Bucket
Before we jump into some Elixir code its important to understand that the order of functions in Elixir is important. We start from the top function and if this function matches a criteria we call it otherwise we try the function below and so on. If we find no function that matches that’s a error and execution aborts.
Here is the Elixir code for inserting a member into a bucket.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Expanding into a Trie
Upon reaching a threshold the Set changes the internal representation into a Trie (What the hell is a trie).
Why a Trie?
For a Trie the time required for insert, delete and find operations is almost identical. As a result, for situations where code is inserting, deleting and finding in equal measure, tries can handily beat binary search trees.
In a Trie we don’t store any keys or hashes, instead a elements position in the tree defines the key with which it is associated.
Hence we can take advantage of Erlangs fast tuples to model the trie. A multi-depth trie is just tuples inside tuples.
Buckets growing into Tries
Lets work through an example. For simplicity we will assume the expansion happens at the very small size of 5 elements:
1 2 3 4 |
|
Create a new Erlang Tuple
1
|
|
Redistribute the bucket elements into the Tuple
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Finally add the new set member into the Trie
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
Notice while we have a more complicated data structure with a trie the leaves are always plain old buckets. Hence insertion with a trie is a case of finding the right bucket and then just reusing our bucket_put
function. Beautiful and fast :)
Tries expanding into deeper Tries
Things get a little more complicated with multi depth tries. If you are interested digging more into the implementation you can see all the source code of HashSet on Github.
Contraction
We can also remove elements from a set so in much the same as we expanded we have to contract the internal representation of the set.
Performance Benchmarks
Here is some sample data comparing Erlang sets to Elixir Sets. Smaller is better.
The benchmark scripts are on github (https://gist.github.com/josephwilk/5808525) if you are interested in running them yourself.