Floating Point Keys in Lua: a Simple Ad-Hoc Solution

Lua is an economical programming language: it’s fast, compact, and both the language and its standard library are small but surprisingly powerful. There are only a handful of built-in data types and just one language feature for structuring data: tables can hold arbitrary key/value pairs, and represent, depending on the choice of keys and values, everything from arrays, sets, dictionaries, and structures to classes and objects. An array, for example, is a table indexed with integer keys, and a structure is a table indexed with strings that represent the field names.

All built-in types can be used as table keys and values. In this article I will explain how Lua handles numeric keys, and in particular floating point keys. In keeping with Lua’s tradition, the solution is practical and efficient. But, also in keeping with Lua’s tradition, the implementation isn’t trivial and narrowly skirts undefined behavior. Let’s take a look.

Lua has a single number type that represents both floating point and integer numbers. Before version 5.3, Lua internally stored all numbers as double-precision floats. This worked reasonably well, but restricted the range of representable integers and made operations like bit manipulation more difficult. For these reasons, Lua now has the ability to store integers directly, but in a way that is mostly invisible to programmers: the interpreter automatically converts between the two number representations as necessary. For example, 1 and 1.0 have a different internal representation but are treated as the same number.

t = {}
t[1] = "Hello"
t[1.0] = "World"  -- replace existing entry

This behavior is hardcoded into Lua’s table implementation which converts integral floating point keys to integers when looking up numeric keys. One convenient consequence of this rule is that it ensures that both ”+0.0” and ”-0.0” refer to the same table entry.

In addition to integral numbers, NaN values also receive special treatment. As explained in the previous post, the convention that NaN ≠ NaN can cause hash tables to behave inconsistently. For this reason, Lua simply disallows NaN values as table keys: inserting a NaN key fails with a runtime error, and looking up a NaN key always returns nil:

t = {}
t[0.0 / 0.0] = true   -- runtime error: table index is NaN
print(t[0.0 / 0.0])   -- always prints nil

For the remaining floating point numbers, Lua uses a simple hashing scheme. Here is a simplified version of the default hash function for floating point numbers; The original implementation is a bit more complicated since Lua can be configured to use different floating point and integer types to store numbers, but the underlying algorithm is the same:

static int l_hashfloat (double n) {
  if (!isfinite(n))
    return 0;
  int exp;
  double m = frexp(n, &exp);
  int mi = (int)(m * -(double)INT_MIN);
  unsigned u = (unsigned)exp + (unsigned)mi;
  return u >= (unsigned)INT_MAX ? u : ~u;
}

Let’s take a look at how this function works. For NaN and infinity we simply return zero. The remaining finite numbers are decomposed into their mantissa and exponent from which a non-negative hash value is computed.

The function frexp is defined in math.h and makes it possible to access the internal representation of floating point numbers in a portable manner: it computes a signed mantissa m an integer exponent exp so that the original number is m·2exp. The signed mantissa returned by frexp is normalized so that |m| < 1. Mapping m to the whole integer range should therefore be as simple as (int)(m * INT_MAX), but for reasons I will explain below, Lua uses the expression (int)(m * -(double)INT_MIN) instead. The mantissa and the exponent are combined by adding them and the result is converted to a non-negative value that is returned from the function. Unsigned arithmetic is used in these last steps to avoid undefined behavior due to overflow.

Even though the implementation of l_hashfloat is only a few lines of code, understanding it in detail isn’t trivial due to the way it mixes signed, unsigned and floating point arithmetic and bit operations:

  1. Since -INT_MIN is a power of two (int)(m * -(double)INT_MIN) extracts the most significant bits of the mantissa. On architectures with 32-bit integers, the lowest 22 bits of the mantissa are discarded, so floating point values that are close together are mapped to the same hash key.
  2. -INT_MIN > INT_MAX, so you might be concerned about overflow. Luckily, frexpt guarantees that m < 1 and conversion to int truncates the result of the multiplication to the range -INT_MAX ≤ mi ≤ INT_MAX. That’s one aspect of the C standard I wasn’t aware of: conversion from floating point to integer only overflows if the result aftertruncation is out of range; for example (unsigned)-0.5 is well-defined and always 0, but (unsigned)-1.0 is undefined behavior.
  3. Speaking of overflow: even small changes to the code can lead to integer overflow and undefined behavior. For example, we cannot write -INT_MIN because the negation is performed using integer arithmetic which overflows — we have to write -(double)INT_MIN. For the same reason the addition in line 7 and the negation in line 8 must be performed using unsigned integers.
  4. In principle, it would be possible to scale the mantissa using the expression (int)(m * INT_MAX), but the behavior of this expression is actually harder to analyze than the one using -INT_MIN. The main reason is that INT_MAX isn’t a power of two, so converting INT_MAX to double and multiplying it by m can involve rounding. In contrast, the expression using -INT_MIN doesn’t involve rounding since multiplying by a power of two leaves the mantissa of m unchanged and only changes the exponent.
  5. The combination of unsigned arithmetic, potential overflow, and binary negation in the last two steps makes it hard to state precisely what the function actually computes.

My overall impression is that Lua’s hash function for floats was designed for compactness and maybe speed, but not for good distribution of the hash keys. Its main defect is that the least significant bits of the mantissa are simply discarded on systems with 32-bit integers. These collisions may not be an issue for the majority of Lua programs where fractional floating point keys are rare, but as we will see in the next posts, it’s possible to define better hash functions that are only slightly longer, not significantly slower, and easier to understand.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s