Cell hash collisions

To help with the design of a dynamic UBB class with lbmpy, I wrote a Cell hashmap to keep track of boundary nodes. While investigating an unrelated performance bottleneck, I found out that the hashing of a Cell object was prone to collisions for indices larger than 64, up to a rate of 1% for indices larger than 512. It stood out in my unit test because my hashmap stores global cell indices, and as such it commonly handles large indices.

The current Cell hash function was introduced by dec54ed4. From what I understand, it originates from the MurmurHash2 library included in Boost, and was probably extracted from the <boost/functional/hash.hpp> header. However, the code that was extracted comes from the hash_combine(std::size_t& seed, std::size_t const& value) function, and since value has to be a std::size_t, an extra call to std::size_t std::hash<int>(cell[i]) was added, instead of extracting the hash_combine(std::size_t& seed, int const& value) function directly. The latter was originally used by waLBerla on 64bit architectures (the function is quite slow, because it doesn't rely on std::hash<int>() to hash the int).

I could be wrong, but it looks like injecting a std::hash<int>() in the former function has weakened the hashing algorithm, at least for the range of values that are typical with Cell objects. Interestingly, by removing the Golden Ratio constant (0x9e3779b9) and tweaking the bit shifts, it's possible to remove collisions for cell indices up to 512 on both 64bit and 32bit architectures, while also reducing runtime. Here is the new implementation:

inline std::size_t hash_value( const Cell & cell )
{
  std::size_t seed = 0;
  std::hash<cell_idx_t> hasher;

  // limit bitwise left shift to 32 positions to avoid overflow on 32bit archs
  // (if seed is zero, the first bitwise left and right shifts do nothing)
  seed ^= hasher(cell.x()) + (seed<<11) + (seed>>6);
  seed ^= hasher(cell.y()) + (seed<<11) + (seed>>6);
  seed ^= hasher(cell.z()) + (seed<<11) + (seed>>6);

  return seed;
}

A benchmark is attached (benchmark.cpp). Here are the results:

Benchmark results on a 64bit computer


Count collisions with current_hash_algorithm
box_length  16:             0 collisions (0.0%)
box_length  32:           786 collisions (0.0%)
box_length  64:        48'378 collisions (0.2%)
box_length 128:     1'771'416 collisions (0.8%)
box_length 256:    16'510'817 collisions (0.9%)
box_length 512:   135'273'859 collisions (1.0%)
Timing hash algorithm with current_hash_algorithm
box_length  16:   0 milliseconds
box_length  32:   0 milliseconds
box_length  64:   0 milliseconds
box_length 128:   1 milliseconds
box_length 256:   7 milliseconds
box_length 512:  53 milliseconds
Timing insertions in unordered map with current_hash_algorithm
box_length  16: 0.00 seconds
box_length  32: 0.00 seconds
box_length  64: 0.02 seconds
box_length 128: 0.13 seconds
box_length 256: 1.80 seconds
box_length 512: 53.06 seconds

Count collisions with new_hash_algorithm
box_length  16: 0 collisions (0.0%)
box_length  32: 0 collisions (0.0%)
box_length  64: 0 collisions (0.0%)
box_length 128: 0 collisions (0.0%)
box_length 256: 0 collisions (0.0%)
box_length 512: 0 collisions (0.0%)
Timing hash algorithm with new_hash_algorithm
box_length  16:   0 milliseconds
box_length  32:   0 milliseconds
box_length  64:   0 milliseconds
box_length 128:   1 milliseconds
box_length 256:  10 milliseconds
box_length 512:  78 milliseconds
Timing insertions in unordered map with new_hash_algorithm
box_length  16: 0.00 seconds
box_length  32: 0.00 seconds
box_length  64: 0.02 seconds
box_length 128: 0.18 seconds
box_length 256: 1.54 seconds
box_length 512: 13.25 seconds

Count collisions with boost_hash_algorithm
box_length  16: 0 collisions (0.0%)
box_length  32: 0 collisions (0.0%)
box_length  64: 0 collisions (0.0%)
box_length 128: 0 collisions (0.0%)
box_length 256: 0 collisions (0.0%)
box_length 512: 0 collisions (0.0%)
Timing hash algorithm with boost_hash_algorithm
box_length  16:   0 milliseconds
box_length  32:   0 milliseconds
box_length  64:   0 milliseconds
box_length 128:   2 milliseconds
box_length 256:  13 milliseconds
box_length 512: 101 milliseconds
Timing insertions in unordered map with boost_hash_algorithm
box_length  16: 0.00 seconds
box_length  32: 0.00 seconds
box_length  64: 0.04 seconds
box_length 128: 0.73 seconds
box_length 256: 7.47 seconds
box_length 512: 83.00 seconds

Benchmark results on a 32bit computer

Note: collisions were tested with box_length=480 instead of 512 to reduce the memory footprint and avoid a std::bad_alloc.


Count collisions with current_hash_algorithm
box_length  16:           0 collisions (0.0%)
box_length  32:         786 collisions (0.0%)
box_length  64:      48'378 collisions (0.2%)
box_length 128:   1'771'416 collisions (0.8%)
box_length 256:  16'510'817 collisions (0.9%)
box_length 480: 111'387'188 collisions (1.0%)
Timing hash algorithm with current_hash_algorithm
box_length  16:   0 milliseconds
box_length  32:   0 milliseconds
box_length  64:   0 milliseconds
box_length 128:   3 milliseconds
box_length 256:  24 milliseconds
Timing insertions in unordered map with current_hash_algorithm
box_length  16: 0.00 seconds
box_length  32: 0.01 seconds
box_length  64: 0.05 seconds
box_length 128: 0.46 seconds
box_length 256: 5.35 seconds

Count collisions with new_hash_algorithm
box_length  16: 0 collisions (0.0%)
box_length  32: 0 collisions (0.0%)
box_length  64: 0 collisions (0.0%)
box_length 128: 0 collisions (0.0%)
box_length 256: 0 collisions (0.0%)
box_length 480: 0 collisions (0.0%)
Timing hash algorithm with new_hash_algorithm
box_length  16:   0 milliseconds
box_length  32:   0 milliseconds
box_length  64:   0 milliseconds
box_length 128:   3 milliseconds
box_length 256:  24 milliseconds
Timing insertions in unordered map with new_hash_algorithm
box_length  16: 0.00 seconds
box_length  32: 0.01 seconds
box_length  64: 0.07 seconds
box_length 128: 0.51 seconds
box_length 256: 3.86 seconds

Count collisions with boost_hash_algorithm
box_length  16:         0 collisions (0.0%)
box_length  32:         0 collisions (0.0%)
box_length  64:        14 collisions (0.0%)
box_length 128:       646 collisions (0.0%)
box_length 256:    35'821 collisions (0.0%)
box_length 480: 1'482'657 collisions (0.0%)
Timing hash algorithm with boost_hash_algorithm
box_length  16:   0 milliseconds
box_length  32:   0 milliseconds
box_length  64:   1 milliseconds
box_length 128:   4 milliseconds
box_length 256:  33 milliseconds
Timing insertions in unordered map with boost_hash_algorithm
box_length  16: 0.00 seconds
box_length  32: 0.01 seconds
box_length  64: 0.15 seconds
box_length 128: 1.57 seconds
box_length 256: 14.79 seconds

While the performance gain of the new hash implementation over the current implementation is noticeable for cell indices >= 64, it's unclear to me how useful that is for end users. Maintaining a hashmap of global cell indices that contains the indices of more than 10% of the blockforest cells is quite a specific use case.

Due to the memory-intensive nature of the benchmark, it's unclear what happens after 512 (maybe my implementation wraps around eventually); for such a use case I would prefer using the Boost version. Even though it is slower and has a few collisions already at 512 on 32bit architectures, it is definitively better tested.

I'm sharing this in case it could prove useful to anyone. If you don't think there are other use cases where this collision rate could be a problem, feel free to close this ticket.