By Ben B.


2011-11-11 13:49:50 8 Comments

EnumerableObject : IEnumerable<Foo>

wraps a List<Foo>

If EnumerableObject a.SequenceEquals( EnumerableObject b), then they are equal.

Therefore, a GetHashCode must be implemented. The problem is XORing each element in the list will return the same hash code for any list with all and only the same elements, regardless of order. This is Okay in terms of it working, but will result in many collisions, which will slow down retrieval, etc.

What is a good, fast GetHashCode method for lists of objects that is order dependent?

3 comments

@MovGP0 2018-01-10 16:55:25

The .GetHashCode() method usually just returns a hash based on the object reference (pointer address). This is because calculating the hash code of every item in an enumerable list can be very time intensive. Instead of overwriting the existing behaviour, I prefer to use an extension method and use it only where the hash code needs to be deterministically determined:

public static class EnumerableExtensions
{
    public static int GetSequenceHashCode<TItem>(this IEnumerable<TItem> list)
    {
        if (list == null) return 0;
        const int seedValue = 0x2D2816FE;
        const int primeNumber = 397;
        return list.Aggregate(seedValue, (current, item) => (current * primeNumber) + (Equals(item, default(TItem)) ? 0 : item.GetHashCode()));
    }
}

@Jon Skeet 2011-11-11 13:54:05

I'd do it the same way I normally combine hash codes - with an addition and a multiplication:

public override int GetHashCode()
{
    unchecked
    {
        int hash = 19;
        foreach (var foo in foos)
        {
            hash = hash * 31 + foo.GetHashCode();
        }
        return hash;
    }
}

(Note that you shouldn't add anything to the list after this has been used for the key in a hash table of any description, as the hash will change. This also assumes that there are no null entries - if there could be, you need to take account of that.)

@Vlad 2011-11-11 13:57:59

Strange enough, List<T>'s implementation of GetHashCode is just inherited from object.

@Ben B. 2011-11-11 14:20:26

Because List<T> equality is not sequence equality.

@Jon Skeet 2012-11-27 06:44:48

@MK_Dev: Well, they're both primes, which generally helps. I'll confess to not understanding the maths behind why this hash generally works well. Multiplication by 31 is easily optimizable to a shift and a subtraction, which makes it attractive. There's a page explaining this kind of hash somewhere around, but I can't remember where, I'm afraid :(

@David Schwartz 2013-12-02 20:18:10

@Jon You're probably referring to this page, which you mentioned in this answer.

@Jon Skeet 2013-12-02 20:22:21

@DavidSchwartz: Hmm - I think I've seen a longer page about Bernstein, but that's definitely better than nothing :)

@nawfal 2013-12-14 18:00:57

@JonSkeet a null check in the hash function will add to completeness :)

@Jon Skeet 2013-12-14 20:40:41

@nawfal: I'd rather not distract from the general approach, to be honest. Hopefully anyone with a collection including null references can figure it out...

@Rand Random 2016-06-17 08:28:14

@JonSkeet does that solution produce consistent result with List of string across multiple AppDomains? I have a WCF Server where multiple application from various platforms (XP, Vista, Win7 or different .Net Frameworks 3.5 and higher) can connect and in all those situations I need a consistent hashcode from a list of strings is that the case? If not how would I achieve that?

@Jon Skeet 2016-06-17 10:29:46

@RandRandom: You shouldn't be using GetHashCode for that - it's not meant to be consistent across different processes. You should use something like SHA-256 for that.

@Rand Random 2016-06-17 11:32:53

@JonSkeet can you share your idea in that question: stackoverflow.com/questions/37880511/…

@Muhammad Rehan Saeed 2019-06-12 08:06:50

What about null vs empty collections? In this case, both will have hash codes of 19. should they be different? Why would you want that to be so?

@Jon Skeet 2019-06-12 08:45:43

@MuhammadRehanSaeed: What do you mean by a "null collection"? You can't call GetHashCode on a null collection. If foos is null, the code in the answer will throw an exception. If it can be null, then the method should take that into account. Whether null and empty should be considered equal is a context-specific decision.

@Muhammad Rehan Saeed 2019-06-12 09:41:02

@JonSkeet I meant if foos could be null. Why would you differentiate null from an empty collection. Under what context is such a differentiation valid? A colleague is proposing to do so, and I have no answer. stackoverflow.com/questions/56557128/…

@Jon Skeet 2019-06-12 10:45:41

@MuhammadRehanSaeed: You should ask your colleague rather than us. But if both states are valid, it seems perfectly reasonable to differentiate between them. (Someone carrying an empty box isn't the same as someone not carrying a box at all...)

@Jon Hanna 2011-11-11 16:01:27

Firstly, double-check that you need a hashcode at all. Are you going to be putting these lists into a hash-mapped structure (e.g. dictionary, hashset, etc)? If not, forget about it.

Now, assuming that you mean that EnumerableObject already overrides Equals(object) (and hopefully therefore also implements IEquatable<EnumerableObject>) for some reason, then this is indeed necessary. You want to balance speed versus bit distribution.

A good starting point is a mult+add or a shift+xor like:

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    foreach(var item in this)
    {
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

(This assumes that you are using item.Equals() for your sequence equality comparison, if you're using an IEqualityComparer's equals you'll need to call into its hashcode).

From there we can optimise.

If null items are disallowed, remove the null-check (be careful, this will make the code throw if it ever does find a null).

If very large lists are common we need to reduce the number examined, while trying not to result in lots of collisions. Compare the following different implementations:

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    int max = Math.Min(Count, 16);
    for(int i = 0, i != max; ++i)
    {
        var item = this[i];
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    int min = Math.Max(-1, Count - 16);
    for(int i = Count -1, i != min; --i)
    {
        var item = this[i];
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    int step = Count / 16 + 1;
    for(int i = 0, i < Count; i += step)
    {
        var item = this[i];
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

Each of these restrict the total number of items examined, which speeds execution but risks poorer quality hashes. Which (if any) is best depends on whether collections with the same start or the same end are more likely.

Changing the number 16 above adjusts the balance; smaller is faster but higher is better hash quality with a lower risk of hash collisions.

Edit: And now you can use my implementation of SpookyHash v. 2:

public override int GetHashCode()
{
  var hasher = new SpookyHash();//use methods with seeds if you need to prevent HashDos
  foreach(var item in this)
    hasher.Update(item.GetHashCode());//or relevant feeds of item, etc.
  return hasher.Final().GetHashCode();
}

This will create a much better distribution than mult+add or shift+xor, while also being particularly fast (especially in 64-bit processes as the algorithm is optimised for that, though it works well on 32-bit too).

@Chris Nielsen 2013-09-16 20:31:04

Can you explain where 0x2D2816FE came from? Google suggests this is .Net's hash code for an empty string, but I am not sure why that would be a good starting value. Jon Skeet's answer used 19 in a similar way, but I don't understand that either.

@Jon Hanna 2013-09-17 08:36:02

@ChrisNielsen it's arbitary, and yes I just borrowed from that use. The main thing is that it shouldn't be zero so a to have a different code for an empty list and for null (most uses will assign zero to null objects) as those are two particularly common cases.

@nawfal 2013-11-09 08:44:19

@JonHanna you would want a unchecked scope for GetHashCode there.

@supercat 2014-01-07 19:38:16

If the list will never be mutated, the time to hash everything once will be a fixed multiple of the time spent building the list. I would favor hashing every item but caching the hash value. If the list will support limited forms of mutation (e.g. Add only), it may be possible to maintain an incrementally-updated hash value as the list grows. Also, in many cases I'd suggest keeping more than 32 bits of state within the hashing loop, and rendering the result to 32 bits at the end. That will help reduce the danger of "systematic" collisions.

@Jon Hanna 2014-01-07 20:33:37

@supercat almost all uses of GetHashCode() will memoise the hash obtained, making the value of caching it within the method limited.

@supercat 2014-01-07 20:50:31

@JonHanna: It's true that things like Dictionary will memoise the hash values of objects that are added; since adding an item requires calling GetHashCode() to figure out where it goes, storing the value poses minimal extra overhead. On the other hand, if an object gets stored in or checked against more than one dictionary, or if a dictionary is asked repeatedly for an item it does not contain, each such action will result in another GetHashCode call. Unless one knows the list will be small and repeated calls to nested items' GetHashCode() will be fast, caching would seem wise.

@Jon Hanna 2014-01-07 21:15:54

@supercat that's assuming it is in fact the same instance that is hashed each time, when if anything the point of overriding is that it might not be, and often won't. I'd also consider that if calling GetHashCode() is particularly expensive, the problem is elsewhere, which funnily enough has been a scratch I've been itching of late (I wouldn't suggest the answer above any more, but something based on bitbucket.org/JonHanna/spookilysharp or at least on what it'll soon be after the initial ironing-out-the-wrinkles period.

@supercat 2014-01-07 22:31:46

@JonHanna: The point of overriding is that it won't always be the same instance. The point of caching is that GetHashCode() may sometimes end up getting called multiple times on the same instance. That doesn't have to happen very often for caching to be a "win", and in some cases it can make a really huge difference (if the items cache their hash codes, it may not matter so much if the collection does, but if caching isn't considered worthwhile for the collection, why should one expect that it was considered worthwhile for the items)?

Related Questions

Sponsored Content

27 Answered Questions

[SOLVED] Why not inherit from List<T>?

14 Answered Questions

[SOLVED] Why is it important to override GetHashCode when Equals method is overridden?

  • 2008-12-16 13:41:18
  • David Basarab
  • 369759 View
  • 1443 Score
  • 14 Answer
  • Tags:   c# overriding hashcode

20 Answered Questions

[SOLVED] What is the best algorithm for overriding GetHashCode?

7 Answered Questions

[SOLVED] Ukkonen's suffix tree algorithm in plain English

4 Answered Questions

[SOLVED] List of Big-O for PHP functions

19 Answered Questions

[SOLVED] How to Sort a List<T> by a property in the object

11 Answered Questions

18 Answered Questions

[SOLVED] Distinct() with lambda?

5 Answered Questions

2 Answered Questions

Sponsored Content