Joseph Wilk

Joseph Wilk

Things with code, creativity and computation.

Sets in Elixir

I recently contributed the Set data structure to the Elixir programming language. The Set is implemented through a HashSet.

1
2
3
4
s1 = HashSet.new([1, 2, 3, 4])
s2 = HashSet.new([3, 4])

Set.difference(s1, s2) # => HashSet<[1, 2]>

Elixir’s Set implementation is often faster than Erlangs own sets. Which is pretty impressive since Elixir runs on the same Erlang VM. It does this by adapting the internal data structure based on the size of the sets.

Lets explore how Elixir does this while also explaining a little about Elixir:

Growing Data structures

For small sets the internal representation uses an ordered set (which really is just a list).

We refer to this as a Bucket.

1
[1, 2, 3, 4]

Being ordered allows us to perform faster access and modification.

For example finding a member of a set:

1
2
3
# Only needs a single comparison to know if 1 is in the Set.
# Since the first element is greater than 1 we know its not in the Set.
Set.member?(HashSet.new [2, 3, 4, 5], 1)

Order in the list is not maintained in the data structure but in the insertion/put:

Putting elements into a Bucket

Before we jump into some Elixir code its important to understand that the order of functions in Elixir is important. We start from the top function and if this function matches a criteria we call it otherwise we try the function below and so on. If we find no function that matches that’s a error and execution aborts.

Here is the Elixir code for inserting a member into a bucket.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# We have found the place to insert the new member.
defp bucket_put([m|_]=bucket, member) when m > member do 
  { [member|bucket], 1 }
end

# Member is already in the set, we don't added it. (notice how member is being used to pattern match here).
# Its the same as writing: `defp bucket_put([m|bucket], member) when m == member do
defp bucket_put([member|bucket], member) do
  { [member|bucket], 0 }
end

# Recursive case, keep calling bucket_put as this is not the point to insert.
defp bucket_put([e|bucket], member) do
  { rest, count } = bucket_put(bucket, member)
  { [e|rest], count }
end

# Empty set, insert the number
defp bucket_put([], member) do
  { [member], 1 }
end

Expanding into a Trie

Upon reaching a threshold the Set changes the internal representation into a Trie (What the hell is a trie).

Why a Trie?

For a Trie the time required for insert, delete and find operations is almost identical. As a result, for situations where code is inserting, deleting and finding in equal measure, tries can handily beat binary search trees.

In a Trie we don’t store any keys or hashes, instead a elements position in the tree defines the key with which it is associated.

Hence we can take advantage of Erlangs fast tuples to model the trie. A multi-depth trie is just tuples inside tuples.

Buckets growing into Tries

Lets work through an example. For simplicity we will assume the expansion happens at the very small size of 5 elements:

1
2
3
4
set = Set.new [0, 1, 2, 3]

# Apply an operation that forces the internal structure to change
set = Set.put(set, 4)

Create a new Erlang Tuple

bucket to trie

1
  :erlang.make_tuple(node_size = 4, [])

Redistribute the bucket elements into the Tuple

bucket to trie

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# This returns the index that a set value should have in the tuple.
defp bucket_nth_index(set_value, trie_depth) do
  node_shift = 2
  node_bitmap = 0b0011

  (:erlang.phash2(set_value) >>> (node_shift * trie_depth)) &&& node_bitmap
end

# Create a new node or add to an existing node.
# Based on the bucket_nth_index of the set member put it into a bucket at that index in the tuple.
defp node_relocate(node // :erlang.make_tuple(node_size = 4, []), bucket, depth) do
  :lists.foldl fn member, new_node ->
    position_in_tuple = bucket_nth_index(member, depth)
    set_elem(new_node, position_in_tuple, bucket_put(elem(new_node, position_in_tuple), member))
  end, new_node, bucket
end

Finally add the new set member into the Trie

bucket to trie

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Base case. We have found the right bucket, just insert the new member into it.
defp node_put(node, 0, member) do
  position = bucket_index(member)
  { new, count } = bucket_put(elem(node, position), member)
  { set_elem(node, position, new), count }
end

# Recursive case. Keeping trying to find the right bucket.
defp node_put(node, depth, member) do
  position = bucket_index(member)
  { new, count } = node_put(elem(node, position), depth - 1, bucket_next(hash), member)
  { set_elem(node, position, new), count }
end

# The bucket index of this member
defp bucket_index(member) do
  node_bitmap = 0b0011
  hash = :erlang.phash2(member)
  hash &&& node_bitmap
end

# The bucket index of this element with depth + 1.
defp bucket_next(member) do
  node_shift = 2
  hash = :erlang.phash2(member)
  hash >>> node_shift
end

Notice while we have a more complicated data structure with a trie the leaves are always plain old buckets. Hence insertion with a trie is a case of finding the right bucket and then just reusing our bucket_put function. Beautiful and fast :)

Tries expanding into deeper Tries

Things get a little more complicated with multi depth tries. If you are interested digging more into the implementation you can see all the source code of HashSet on Github.

Contraction

We can also remove elements from a set so in much the same as we expanded we have to contract the internal representation of the set.

Performance Benchmarks

Here is some sample data comparing Erlang sets to Elixir Sets. Smaller is better.

HashSet vs :sets

The benchmark scripts are on github (https://gist.github.com/josephwilk/5808525) if you are interested in running them yourself.

Comments