Using Floating Point Numbers as Hash Keys

Strings and integers are the most common keys for hash tables, but in principle every type can be used as a hash key, provided we define a suitable comparison operator and hash function. Most languages that support hash tables also provide hash functions for all built-in types, including floating point numbers. In this series of articles I will take a look at how languages like Lua, Python, JavaScript, and C++ handle floating point keys. It turns out that there are significant differences between these languages, both in the way they handle special values like NaN and in the way they define their hash functions.

This article addresses a few general questions: what is the difference between floating point numbers and other types of hash keys? What are common pitfalls when defining hash functions for floats? How do we handle special values such as minus zero, infinity, and NaN?

Floating point keys aren’t widely used and a common recommendation is to avoid them altogether. The main argument against floating point numbers as keys is that they are “inaccurate” and therefore unsuitable for indexing hash tables. This is, of course, a common problem with floating point numbers, and it shouldn’t be surprising that rounding and truncation errors can lead to unexpected results when they affect table keys:

>>> table = {}
>>> table[0.3] = "Hello"
>>> table[0.1 + 0.2] = "World"
>>> table
{0.30000000000000004: 'World', 0.3: 'Hello'}

In general, floating point numbers as hash keys is error-prone for the same reason comparing them for equality is error-prone. There are many ways to compare floating point numbers, but unfortunately most of them are useless for hash tables: it’s easy to define different notions of “approximately equal”, but it’s hard or impossible to define hash functions that map approximately equal keys to identical hash values. Why? Let’s take a look at what happens with the common *epsilon comparison*, for example: two numbers a, b are considered equal if |a – b| < ε. With this definition of equality all numbers between -ε/2; and ε/2 are considered equal and therefore must have the same hash value h. But the numbers between 0 and ε are also equal to each other, so their hash value must be h as well. If we continue in this manner we see that all all numbers must have the same hash value h regardless of the choice of ε. The idea of approximate comparison is unfortunately hard to reconcile with the non-approximate nature of hash functions.

But we can also turn the argument from previous paragraph around: comparing two floating point numbers for equality may be error-prone in general, but in situation where it is safe, it’s also safe to use the numbers as hash table keys. After all, there’s nothing inaccurate about floating point numbers or comparison operations themselves: a number like 1010.101112 · 217 has a precise value, and comparing it to other floats is a well-defined operation. It’s only certain operations like arithmetic or decimal-to-binary conversions that introduce rounding errors. And not even all arithmetic operations on floats are inaccurate. For example, the following equations are always true:

>>> 1.0 + 2.0 == 3.0
True
>>> 0.5 * 2 == 1.0
True

In some scripting languages like Javascript or older versions of Lua, floats are the only numerical type provided by the language, so the language has to deal with floating point keys at some level.

How do you define a hash function for floating point numbers? Here are the basic requirements:

  1. Values x should be mapped to unsigned integers hash(x) that can be used to index the hash table.
  2. If x == y then hash(x) == hash(y) so that we can find keys in the hash table after inserting them.
  3. If x != y then hash(x) != hash(y) with high probability so that there are few collisions and the hash table performs efficiently.

The second condition is the one that’s most interesting because it relates the hash function to the equality operator used by the hash table. This is especially relevant for floating point numbers where the standard == operator has a few special cases that must be taken into account:

  • There are two distinct floating point zeros, +0.0 and -0.0. In general they should map to the same hash value.
  • Especially in dynamically typed languages, floating point numbers may be just one representation for a more general “number” type. In this case, we may want to map all numbers that are numerically equal (e.g., 1.0f, 1.0, and 1) to the same hash value.

NaN values pose a special problem because basically all languages follow the IEEE convention that NaN != NaN. In the context of hash tables this means that that if you insert a NaN key into a hash table, you can never find it using == since the comparison always returns false! Likewise, it becomes possible to insert an arbitrary number of NaN keys into a hash table, as the following Python example demonstrates:

>>> table = {}
>>> table[float('NaN')] = True
>>> table[float('NaN')] = False
>>> table
{nan: True, nan: False}
>>> float('NaN') in table
False

In the next few posts we will see how different programming languages handle all these issues.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s