Like most other scripting languages Python comes with a built-in dictionary type based on hash tables. *Unlike* other languages the rules for hashing different types are not baked into the language definition or the dictionary implementation. Instead, the object-oriented nature of Python is leveraged to allow types, including built-in types, to define their own hash functions. Objects are said to be *hashable* if they define a hash function and a matching comparison operator.

As we discussed in the introductory post of this series, the hash function and comparison operator cannot be defined independently: hash tables only work correctly if any two objects that are “equal” also have the same hash key. This poses a special challenge for numeric types: Python can represent numbers in a variety of ways, including arbitrary-precision integers, rationals, binary and decimal floating point numbers, or complex numbers. Since all of these number types can be compared to each other, each type must have a hash function that ensures that **numbers that are numerically equal also map to the same hash key**. For example, the number “two” should have the same hash key, regardless of whether it’s stored as an integer “2”, as a floating point number “2.0”, as a fraction “4 / 2”, or as a complex number “2.0 + 0.0i”.

In the previous article we saw how Lua solves this problem by switching between different hash function on the fly: integral floating point numbers and integers use one hash function and fractional numbers another. Python takes a different approach: it defines a **unified hash function for all numeric types**. In this article I will explain how this hash function works and how it can be computed efficiently for the special case of binary floating point numbers. The resulting algorithm is an ingenious demonstration of how to combine floating point arithmetic, modular arithmetic, and bit operations. Let’s find out how it works!

The main idea behind Python’s implementation is to treat every number as a fraction *a / b* and to compute a unique hash value from the two integers *a* and *b* by performing the division modulo a prime number *P*. Here is the formula for non-negative fractions; for negative fractions we simply set hash(*x*) = -hash(-*x*).

hash(*a* / *b*) = (*a* mod *P*) /_{P} (*b* mod *P*)

The operator “/_{P}” on the right-hand side performs division modulo *P* and can be implemented by multiplication with the modular multiplicative inverse.

The hash method in Python’s `Fraction`

class is a direct implementation of this algorithm:

class Fraction: def __hash__(self): # Compute multiplicative inverse dinv = pow(self._denominator, _PyHASH_MODULUS - 2, _PyHASH_MODULUS) if not dinv: hash_ = _PyHASH_INF else: hash_ = abs(self._numerator) * dinv % _PyHASH_MODULUS result = hash_ if self >= 0 else -hash_ # Return value -1 is reserved for errors return -2 if result == -1 else result

In this code, `_PyHASH_MODULUS`

is the prime *P* and `_PyHASH_INF`

is a special hash value that is returned for infinity and numbers that have no multiplicative inverse. The multiplicative inverse is computed using a formula that follows from *Fermat’s little theorem* in number theory.

How do we implement this hash function for floating point numbers?

Every floating point number can be written as *m*·2^{e}, and it’s easy to rewrite this as a fraction *a / b*. If the exponent *e* ≥ 0 the number *m*·2^{e} is an integer and we have *a* = *m* · 2^{e} and *b=1*. If *e* < 0 the number is fractional and we have *a = m* and *b*=2^{-e}. Note that in either case *b* is a power of two, which will turn out to be the key to computing the modular division efficiently. The main problem is that these numbers *a* and *b* can be huge, so we would have to use arbitrary precision arithmetic to compute the hash value.

But there is a better way: the crucial insight is that **modular multiplication and division by a power of two can be implemented using bit rotations** if the prime *P* is chosen to be a *Mersenne prime*. Mersenne primes are primes that have the special form *M _{p}*=2

^{p}-1. They are quite rare, but fortunately for us computer scientists, there are two Mersenne primes that fit nicely into 32-bit and 64-bit integers: Python uses

*M*

_{31}=2

^{31}-1 on 32-bit machines and

*M*

_{61}=2

^{61}-1 on 64-bit machines.

The fact that Mersenne primes are one less than a power of two has an important consequence: it implies that doubling a number modulo *M _{p}* simply rotates the bits of that number by one place. More specifically, given a number

*a*with 0 ≤

*a*<

*M*(in other words

_{p}*a*is any

*p-*bit number except

*M*), we have

_{p}The “rotl” function rotates the bits of the *p*-bit number *a*. It’s easy to see that this is true. If the highest bit of *a* is set, we can write *a* · 2 = 2^{p} + *r*, where *r* contains the lower bits of *a* shifted to the left. In this case we have (2^{p} + *r*) mod *M _{p}* = 1 +

*r*= rotl(

*a*, 1). Otherwise, if the highest bit of

*a*isn’t set, all the bits are simply shifted to the left and we have

*a*· 2 mod

*M*=

_{p}*a*· 2 = rotl(

*a*, 1).

More generally, for any non-negative number *a* we have

and, likewise for divisions

By starting with *a* and repeatedly multiplying or dividing by two, we obtain the following expressions for *arbitrary powers of two*:

Not only does this transformation reduce the problem of modular multiplication and division to simple bit rotations, it also reduces the magnitude of all numbers to a range where we can use machine arithmetic and do not even have to worry about integer overflow.

This last equation allows us to efficiently compute the hash key for floating point numbers. The whole algorithm proceeds as follows:

- Replace
*x*by its absolute value, but remember the original sign for later. - Find non-negative integers
*a*and*e*so that*x*=*a*/ 2^{e}. - Compute
*a*mod*M*. In practice, this step can be combined with the previous step by extracting bits from the mantissa of_{p}*x*, accumulating them into*a*mod*M*and adjusting_{p}*e*as we go. See Python’s implementation linked below for details. - Compute (
*a*/ 2^{e}) mod*M*by rotating the bits of (_{p}*a*mod*M*) according to the above equation._{p} - Negate the hash key if
*x*was originally negative.

And that’s how Python hashes floating point numbers. The original implementation follows the above description except for some boundary cases. It also uses the following trick to express right rotations as left rotations:

It’s obvious that a lot of thought went into Pythons design for numeric hash functions. Even though it’s not the only language that maps numerically equal values to the same hash key, its solution based on a unified hash function is certainly unique.