Hash Function for Floats

Hash function for floats

It depends on the application but most of time floats should not be hashed because hashing is used for fast lookup for exact matches and most floats are the result of calculations that produce a float which is only an approximation to the correct answer. The usually way to check for floating equality is to check if it is within some delta (in absolute value) of the correct answer. This type of check does not lend itself to hashed lookup tables.

EDIT:

Normally, because of rounding errors and inherent limitations of floating point arithmetic, if you expect that floating point numbers a and b should be equal to each other because the math says so, you need to pick some relatively small delta > 0, and then you declare a and b to be equal if abs(a-b) < delta, where abs is the absolute value function. For more detail, see this article.

Here is a small example that demonstrates the problem:

float x = 1.0f;
x = x / 41;
x = x * 41;
if (x != 1.0f)
{
    std::cout << "ooops...\n";
}

Depending on your platform, compiler and optimization levels, this may print ooops... to your screen, meaning that the mathematical equation x / y * y = x does not necessarily hold on your computer.

There are cases where floating point arithmetic produces exact results, e.g. reasonably sized integers and rationals with power-of-2 denominators.

implementing a hash table-like data structure with floating point keys where values within a tolerance are binned together

The best way to do this is to use fixed point arithmetic.

The implementation in the question details works but is more obfuscated than it needs to be. What it treats as an epsilon or a tolerance is actually a "bin width" -- a one-dimensional spacing between grid lines partitioning the real number line -- and thus if you are expecting the epsilon value to act like a tolerance you will notice counter-intuitive behavior around the edges of bins / near grid lines.

In any case a clearer way to think about this problem is to not try to use a notion of "tolerance" but instead use the notion of "significant digits". Treat only n base-10 digits right of the decimal as mattering and parametrize on that n. What this results in essentially is using fixed point values as keys rather than floating point values; in the above implementation it is akin to using an epsilon of 0.0001 instead of 0.0005.

But rather than just modifying the epsilon in the original code, there is now no reason to not just make the fixed point values a public type and using that type as the key of an unordered_map exposed to the user. Previously we wanted to hide the key type by wrapping the implementation's unordered_map in a custom data structure, because in that case the keys were opaque, didn't have an intuitive meaning. Using fixed point keys in a normal unordered_map has the side benefit of making it such that we do not have to implement wrapper methods for all the standard std::unordered_map calls since the user is now given an actual unordered_map.

code below:

template<int P=4>
class fixed_point_value
{
    static constexpr double calc_scaling_factor(int digits_of_precision)
    {
        return (digits_of_precision == 1) ? 10.0 : 10.0 * calc_scaling_factor(digits_of_precision - 1);
    }

    static constexpr double scaling_factor = calc_scaling_factor(P);

    template<int P>
    friend struct fixed_point_hash;

public:
    fixed_point_value(float val) : 
        impl_(static_cast<long long>(std::llround(scaling_factor * val)))
    {}

    bool operator==(fixed_point_value<P> fpv) const 
    {
        return impl_ == fpv.impl_;
    }

    float to_float() const
    {
        return static_cast<float>(impl_ / scaling_factor);
    }

private:
    long long impl_;
};

template<int P = 4>
struct fixed_point_hash
{
    std::size_t operator()(fixed_point_value<P> key) const {
        return std::hash<long long>{}(key.impl_);
    }
};

template<typename V, int P = 4>
using fixed_point_table = std::unordered_map<fixed_point_value<P>, V, fixed_point_hash<P>>;

int main()
{
    fixed_point_table<std::string, 4> t4;

    t4[12.0453f] = "yes";

    // these will all be "no" except 12.0453f because we have 4 base-10 digits of precision i.e.
    // 4 digits right of the decimal must be an exact match
    std::cout << "precision = 4" << std::endl;
    std::cout << "12.0446f => " << (t4.find(12.0446f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0447f => " << (t4.find(12.0447f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0448f => " << (t4.find(12.0448f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0449f => " << (t4.find(12.0449f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0450f => " << (t4.find(12.0450f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0451f => " << (t4.find(12.0451f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0452f => " << (t4.find(12.0452f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0453f => " << (t4.find(12.0453f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0454f => " << (t4.find(12.0454f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0455f => " << (t4.find(12.0455f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0456f => " << (t4.find(12.0456f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0457f => " << (t4.find(12.0457f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0458f => " << (t4.find(12.0458f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0459f => " << (t4.find(12.0459f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0460f => " << (t4.find(12.0460f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "\n";

    fixed_point_table<std::string, 3> t3;
    t3[12.0453f] = "yes"; // 12.0453 will round to the fixed point value 12.045.
    std::cout << "precision = 3" << std::endl;
    std::cout << "12.0446f => " << (t3.find(12.0446f) != t3.end() ? "yes" : "no") << std::endl; // rounds to 12.045 so yes;
    std::cout << "12.0447f => " << (t3.find(12.0447f) != t3.end() ? "yes" : "no") << std::endl; // rounds to 12.045 so yes;
    std::cout << "12.0448f => " << (t3.find(12.0448f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0449f => " << (t3.find(12.0449f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0450f => " << (t3.find(12.0450f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0451f => " << (t3.find(12.0451f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0452f => " << (t3.find(12.0452f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0453f => " << (t3.find(12.0453f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0454f => " << (t3.find(12.0454f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0455f => " << (t3.find(12.0455f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0456f => " << (t3.find(12.0456f) != t3.end() ? "yes" : "no") << std::endl; // 12.0456f rounds to the 3 digits of precison fixed point value 12.046 so no
    std::cout << "12.0457f => " << (t3.find(12.0457f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0458f => " << (t3.find(12.0458f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0459f => " << (t3.find(12.0459f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0460f => " << (t3.find(12.0460f) != t3.end() ? "yes" : "no") << std::endl; // '

}

Does std::hash guarantee equal hashes for equal floating point numbers?

std::hash has same guarantees for all types over which it can
be instantiated: if two objects are equal, their hash codes will
be equal. Otherwise, there's a very large probability that they
won't. So you can rely on a double as a key in an
unordered_map to work as expected: if two doubles are not
equal (as defined by ==), they will probably have a different
hash (and even if they don't, they're different keys, because
unordered_map also checks for equality).

Obviously, if your values are the results of inexact
calculations, they aren't appropriate keys for unordered_map
(nor perhaps for any map).

Hashing floating point values

Why are you wanting to hash floating point values? For the same reason that comparing floating point values for equality has a number of pitfalls, hashing them can have similar (negative) consequences.

However given that you really do want to do this, I suspect that the boost algorithm is complicated because when you take into account denormalized numbers different bit patterns can represent the same number (and should probably have the same hash). In IEEE 754 there are also both positive and negative 0 values that compare equal but have different bit patterns.

This probably wouldn't come up in the hashing if it wouldn't have come up otherwise in your algorithm but you still need to take care about signaling NaN values.

Additionally what would be the meaning of hashing +/- infinity and/or NaN? Specifically NaN can have many representations, should they all result in the same hash? Infinity seems to have just two representations so it seems like it would work out ok.

Hash Codes for Floats in Java

These auto-generated hashcode functions are not very good.

The problem is that small integers cause very "sparse" and similar bitcodes.

To understand the problem, look at the actual computation.

System.out.format("%x\n", Float.floatToIntBits(1));
System.out.format("%x\n", Float.floatToIntBits(-1));
System.out.format("%x\n", Float.floatToIntBits(3));
System.out.format("%x\n", Float.floatToIntBits(-3));

gives:

As you can see, the - is the most significant bit in IEEE floats. Multiplication with 31 changes them not substantially:

The problem are all the 0s at the end. They get preserved by integer multiplication with any prime (because they are base-2 0s, not base-10!).

So IMHO, the best strategy is to employ bit shifts, e.g.:

final int h1 = Float.floatToIntBits(x);
final int h2 = Float.floatToIntBits(z);
return h1 ^ ((h2 >>> 16) | (h2 << 16));

But you may want to look at Which hashing algorithm is best for uniqueness and speed? and test for your particular case of integers-as-float.

Hash Function for Floats