Floating Point Keys in Go: How to Make Hash Functions Unpredictable

The Go Programming Language has a built-in map datatype that is implemented using hash tables. The only types that can be used as map keys are those for which the language defines an equality operator ==; this includes primitive types like Booleans, strings, numbers, and pointers, but also composite types like arrays and structures. For all these types Go automatically generates a suitable hash function. Maps are not extensible, however: if you need a hash table that supports custom comparison or hash algorithms you have to write your own.

Floating point numbers can be used as map keys, and since Go follows the normal conventions for comparison, so maps treat all NaN keys as distinct since NaN ≠ NaN. As a consequence NaN keys can be inserted into maps but are unretrievable; in fact, any number of NaN keys can be inserted, and because they are all distinct it’s possible to end up with a map in which every entry is unretrievable. Normally this would be merely a curiosity, an interesting side effect of the quirky semantics of floating point numbers. But this curiosity can lead to an exploitable performance issue if the hash function being used maps every NaN to a single hash code. In this case all the NaN keys collide in the hash table and the runtime cost of inserting n of them drops to O(n2), down from the O(n) we would expect for uniformly distributed hash codes.

Overloading servers by causing hash table collisions has become a a real security threat in recent years. Many languages and hashing algorithms were found to be susceptible to these so-called hash-flooding attacks and had to be updated. In the following I will discuss how Go prevents hash-flooding attacks, in particular for the special of case floating point keys.

Go’s implementation of maps uses a set of runtime functions for comparing and hashing primitive types defined in runtime/alg.go. The low-level hash functions defined in this file have a common interface:

func strhash(a unsafe.Pointer, h uintptr) uintptr {
  ...
}

func f32hash(p unsafe.Pointer, h uintptr) uintptr {
 ...
}

func f64hash(p unsafe.Pointer, h uintptr) uintptr {
 ...
}

The first argument is an untyped pointer p to the data that should be hashed, and the second argument is a seed value h. All functions return a hash code of type uintptr.

The seed value serves two related purposes. If a fixed number h0 is used as the seed, its effect is to choose a particular hash function fh0 from a family of functions. Choosing a different h0 for every run of a program is one common technique of protecting against hash-flooding attacks because it makes hash codes much harder to predict. In Go, every map has its own seed which is initialized to a random number, so all hash codes in Go change from one program run to the next and even from one hash table to the next.

The second purpose of seed values is to chain together a series of hash function calls. To compute the hash code for an array a of floating point numbers, for example, Go hashes each entry individually but seeds each call to f32hash with hash code of all previous entries:

h := h0
for _, v := range a {
  h = f32hash(v, h)
}
return h

Whenever arrays and structures must be hashed, the Go compiler generates a sequence of hash function calls like this.

Let’s now take a look at the implementation of f32hash itself. Depending on the floating point number f that should be hashed, the algorithm distinguishes three cases. If you’ve read the previous articles in this series you will correctly predict that two of these cases are there to handle signed zeros and NaNs:

func f32hash(p unsafe.Pointer, h uintptr) uintptr {
  f := *(*float32)(p)
  switch { 
  case f == 0:
    return c1 * (c0 ^ h) // +0, -0 
  case f != f:
    return c1 * (c0 ^ h ^ uintptr(fastrand())) // any kind of NaN 
  default:  
    return memhash(p, h, 4) 
  }
}

If f is plus or minus zero, the function returns a hash code that only depends on the seed h. The expression c1 * (c0 ^ h) is a variant of a so-called multiplicative hash function, and c0 and c1 are two large integer constants and are used to scramble the bits of h.

If f is NaN, the function returns a random number. This is quite unexpected: hash functions generally have to be deterministic since hash tables require that each key has a single well-defined hash code. It’s easy to see why a random hash code is not a problem in this case: NaN keys are unretrievable anyway, so for the observable behavior of maps it doesn’t matter whether we use a fixed or a random hash code for NaN. But for the runtime performance it does matter. Recall from the introduction of this article that using a fixed hash code for NaN is problematic since inserting n NaN values into a hash table requires O(n2) time. Using a random hash code for NaN ensures that collisions are very unlikely and that inserting any sequence of n floating point numbers into a map runs in expected O(n) time. This blog post by Russ Cox and the discussion that follows explains Go’s choice of a nondeterministic hash function in more detail.

Finally, if f is any other finite or infinite number, it is treated as a 4-byte memory block and hashed using a generic hash function for byte sequences. Note that this is possible only because any two floating point numbers (except zero and NaN) are equal if and only if they have the same bit representation.

Using such a generic hash function has several benefits, the most important being that hash functions for byte sequences are a well-researched problem, so it’s comparatively easy to choose a function that

  • produces well-distributed hash codes;
  • is hard to predict and therefore resilient to hash-flooding attacks;
  • can be implemented efficiently.

In this case, using a generic hash function also has the benefit that we don’t have to deal with the internal encoding of floating point numbers at all.

As a side-effect of treating most float point numbers as blocks of raw memory, the result of f32hash depends on the processor’s byte order. This would be a problem if the hash codes were exchanged between different systems, but Go only requires that hash functions return predictable values during the lifetime of a single program.

So, to summarize, Go uses three techniques to prevent hash flooding attacks for floating point numbers:

  1. All hash functions are seeded using random initial values. In effect, every map uses a different hash function chosen randomly from a family of hash functions.
  2. NaNs are mapped to random hash codes to prevent collisions.
  3. All other floating point numbers except ±0 are hashed using a generic hash function that is designed to be resilient to hash collision attacks.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s