Like most other scripting languages Python comes with a built-in dictionary type based on hash tables. Unlike other languages the rules for hashing different types are not baked into the language definition or the dictionary implementation. Instead, the object-oriented nature of Python is leveraged to allow types, including built-in types, to define their own hash functions. Objects are said to be hashable if they define a hash function and a matching comparison operator.
As we discussed in the introductory post of this series, the hash function and comparison operator cannot be defined independently: hash tables only work correctly if any two objects that are “equal” also have the same hash key. This poses a special challenge for numeric types: Python can represent numbers in a variety of ways, including arbitrary-precision integers, rationals, binary and decimal floating point numbers, or complex numbers. Since all of these number types can be compared to each other, each type must have a hash function that ensures that numbers that are numerically equal also map to the same hash key. For example, the number “two” should have the same hash key, regardless of whether it’s stored as an integer “2”, as a floating point number “2.0”, as a fraction “4 / 2”, or as a complex number “2.0 + 0.0i”.
In the previous article we saw how Lua solves this problem by switching between different hash function on the fly: integral floating point numbers and integers use one hash function and fractional numbers another. Python takes a different approach: it defines a unified hash function for all numeric types. In this article I will explain how this hash function works and how it can be computed efficiently for the special case of binary floating point numbers. The resulting algorithm is an ingenious demonstration of how to combine floating point arithmetic, modular arithmetic, and bit operations. Let’s find out how it works!
The main idea behind Python’s implementation is to treat every number as a fraction a / b and to compute a unique hash value from the two integers a and b by performing the division modulo a prime number P. Here is the formula for non-negative fractions; for negative fractions we simply set hash(x) = -hash(-x).
hash(a / b) = (a mod P) /P (b mod P)
The operator “/P” on the right-hand side performs division modulo P and can be implemented by multiplication with the modular multiplicative inverse.
The hash method in Python’s
Fraction class is a direct implementation of this algorithm:
class Fraction: def __hash__(self): # Compute multiplicative inverse dinv = pow(self._denominator, _PyHASH_MODULUS - 2, _PyHASH_MODULUS) if not dinv: hash_ = _PyHASH_INF else: hash_ = abs(self._numerator) * dinv % _PyHASH_MODULUS result = hash_ if self >= 0 else -hash_ # Return value -1 is reserved for errors return -2 if result == -1 else result
In this code,
_PyHASH_MODULUS is the prime P and
_PyHASH_INF is a special hash value that is returned for infinity and numbers that have no multiplicative inverse. The multiplicative inverse is computed using a formula that follows from Fermat’s little theorem in number theory.
How do we implement this hash function for floating point numbers?
Every floating point number can be written as m·2e, and it’s easy to rewrite this as a fraction a / b. If the exponent e ≥ 0 the number m·2e is an integer and we have a = m · 2e and b=1. If e < 0 the number is fractional and we have a = m and b=2-e. Note that in either case b is a power of two, which will turn out to be the key to computing the modular division efficiently. The main problem is that these numbers a and b can be huge, so we would have to use arbitrary precision arithmetic to compute the hash value.
But there is a better way: the crucial insight is that modular multiplication and division by a power of two can be implemented using bit rotations if the prime P is chosen to be a Mersenne prime. Mersenne primes are primes that have the special form Mp=2p-1. They are quite rare, but fortunately for us computer scientists, there are two Mersenne primes that fit nicely into 32-bit and 64-bit integers: Python uses M31=231-1 on 32-bit machines and M61=261-1 on 64-bit machines.
The fact that Mersenne primes are one less than a power of two has an important consequence: it implies that doubling a number modulo Mp simply rotates the bits of that number by one place. More specifically, given a number a with 0 ≤ a < Mp (in other words a is any p-bit number except Mp), we have
The “rotl” function rotates the bits of the p-bit number a. It’s easy to see that this is true. If the highest bit of a is set, we can write a · 2 = 2p + r, where r contains the lower bits of a shifted to the left. In this case we have (2p + r) mod Mp = 1 + r = rotl(a, 1). Otherwise, if the highest bit of a isn’t set, all the bits are simply shifted to the left and we have a · 2 mod Mp = a · 2 = rotl(a, 1).
More generally, for any non-negative number a we have
and, likewise for divisions
By starting with a and repeatedly multiplying or dividing by two, we obtain the following expressions for arbitrary powers of two:
Not only does this transformation reduce the problem of modular multiplication and division to simple bit rotations, it also reduces the magnitude of all numbers to a range where we can use machine arithmetic and do not even have to worry about integer overflow.
This last equation allows us to efficiently compute the hash key for floating point numbers. The whole algorithm proceeds as follows:
- Replace x by its absolute value, but remember the original sign for later.
- Find non-negative integers a and e so that x = a / 2e.
- Compute a mod Mp. In practice, this step can be combined with the previous step by extracting bits from the mantissa of x, accumulating them into a mod Mp and adjusting e as we go. See Python’s implementation linked below for details.
- Compute (a / 2e) mod Mp by rotating the bits of (a mod Mp) according to the above equation.
- Negate the hash key if x was originally negative.
And that’s how Python hashes floating point numbers. The original implementation follows the above description except for some boundary cases. It also uses the following trick to express right rotations as left rotations:
It’s obvious that a lot of thought went into Pythons design for numeric hash functions. Even though it’s not the only language that maps numerically equal values to the same hash key, its solution based on a unified hash function is certainly unique.