Which hashing algorithm is best for uniqueness and speed?

  • Which hashing algorithm is best for uniqueness and speed? Example (good) uses include hash dictionaries.

    I know there are things like SHA-256 and such, but these algorithms are designed to be secure, which usually means they are slower than algorithms that are less unique. I want a hash algorithm designed to be fast, yet remain fairly unique to avoid collisions.

    For what purpose, security or other?

    @Orbling, for implementation of a hash dictionary. So collisions should be kept to a minimal, but it has no security purpose at all.

    Note that you will need to expect at least *some* collisions in your hash table, otherwise the table will need to be enormous to be able to handle even a relatively small number of keys...

    what is the input to your hash function? A simple byte array?

    @stacko For the moment, I'm only concerned with strings.

    If uniqueness (low chance of collision) is all you care about, then any of the usual functions (SHA family) with a long-bit output should do. As for determining which is the fastest, use the newest OpenSSL source compiled with all the architecture speedups made specifically for your platform, then test it with 'openssl speed' as mentioned in here http://www.madboa.com/geek/openssl/#benchmark-speed

    Uniqueness is a side-effect of security, though. Secure Hash algorithms are supposed to be suitable for message authentication.

    Take a look in Google spreedhash template library. http://code.google.com/p/sparsehash/?redir=1

    Why would a CRC not do the job?

    @nohros Unfortunately since CityHash's minimum hash size is 64-bits, it's too large to be used in practice. It also depends on SSE4.2, which limits it's ability to be deployed in practice.

    @IanBoyd: City Hash "too large to be used in practice"? Can you explain? Nothing forces you to use the whole 64-bit result. Also, it doesn't require SSE 4.2 for normal City Hash. There is an (incompatible) variant, CityHashCrc, which requires SSE 4.2 and produces a longer hash, but you can ignore that if it's inappropriate for your needs.

    You can always find a pathetic case for any algorithm.

    @ThorbjørnRavnAndersen Except the ones with very strict bounds. They tend to not usually be the interesting algorithms though, as they're often more costly than the ones that are cheaper in some cases…

    @DonalFellows care to share an example?

    Great post! Could you also check's Yann Collet's xxHash (creator or LZ4), which is twice as fast as Murmur? Homepage: http://code.google.com/p/xxhash/ More info: http://fastcompression.blogspot.fr/2012/04/selecting-checksum-algorithm.html?spref=tw

    @user1249: It's a long time ago, so I don't have the code any more and my memory might be dodgy. I needed a hash algorithm for a structure where each of the members of the structure could only take a relatively small number of values, so I constructed a dictionary that mapped each member to a bit pattern; ORing them together (with the right shifts) would give me a value which fit in 32 bits. That's the easiest way of doing a perfect hashing function I know, but it was *totally* dependent on the data model I was working with; even minor tweaks would break it utterly.

    I am curious as to the intent of said hash. Depending on your needs, security is the opposite of speed. Fast means easier to brute force, while hashes such as scrypt work to the opposite. I've found that SHA512 is generally good enough, and depending on the settings scrypt can be overkill. Also, as always, one should salt one's hashes, and add random seed bytes into encrypted streams to ensure greater uniqueness.

    @Tracker1 it's been a while, but this question was not at all about security. This was for hashes in a hash table, not for hashing passwords

    @zvrba Depends on the algorithm. bcrypt is designed to be slow.

    I can't put an answer here. So I write a comment. A property not cited and important in preventing DOS attack is the predictability of collision. It is for that some cryptographic researcher build SipHash which happen to be also fast.

    @XavierCombelle that doesn't always matter either though either, especially for local processes that do not interact with the world

    @Izkata `bcrypt` is not a hashing algorithm. It is a key derivation scheme that can be used with a hashing algorithm.

    I'd like to see results for xxHash too!

    Soo... we're actually asking for a well distributed hash function (rather than unique)? I spent a good few minutes getting confused by the answers until I figured this out.

    `lh_strhash` from the openssl library is also one to consider. In my tests it beat *murmur3*, *fnv1* and *djb2* for 16-char string hashing. (which will encompass an overwhelming majority of dictionary words) **see lhash.c**

    i worte a program to test all algorithms supported by your php intallation: http://paste.debian.net/872031/ (on my installation on php7.0.11, it's 46 different algos)

    We can compute sha 256/whirlpool/etc on gpu or even on fpga. All such things are already implemented and you can find it on github =)

    There's a lot of confusion and misinformation here. "unique" is the wrong concept; you want even distribution among your buckets. Message digests achieve "uniqueness" by producing lots of bits, which are useless for hash tables. And most cryptographic hashes (sipHash excepted) are way too slow ... people recommending SHA512 are way off base. xxHash and sipHash are good choices ... neither is included in the accepted answer, which is now quite out of date. (And the test is bad ... there should be collisions.)

    All decent hash functions that were actually designed as hash functions have good distributions, so the accepted answer (which is now quite out of date) mostly peers into the wrong hole. The only thing that matters is speed, but that's depends on what the inputs look like -- meaning their lengths. Here's the comparison you want for the stated purpose of a hash function for a hash table: https://aras-p.info/blog/2016/08/09/More-Hash-Function-Tests/

    One shouldn't assume that SHA is a security hash, in git it also used to avoid collision: https://stackoverflow.com/questions/28792784/why-does-git-use-a-cryptographic-hash-function In fact we shouldn't even use it for password since there are dedicated hash algorithm for password now (argon2)

  • Ian Boyd

    Ian Boyd Correct answer

    8 years ago

    I tested some different algorithms, measuring speed and number of collisions.

    I used three different key sets:

    For each corpus, the number of collisions and the average time spent hashing was recorded.

    I tested:

    Results

    Each result contains the average hash time, and the number of collisions

    Hash           Lowercase      Random UUID  Numbers
    =============  =============  ===========  ==============
    Murmur            145 ns      259 ns          92 ns
                        6 collis    5 collis       0 collis
    FNV-1a            152 ns      504 ns          86 ns
                        4 collis    4 collis       0 collis
    FNV-1             184 ns      730 ns          92 ns
                        1 collis    5 collis       0 collis▪
    DBJ2a             158 ns      443 ns          91 ns
                        5 collis    6 collis       0 collis▪▪▪
    DJB2              156 ns      437 ns          93 ns
                        7 collis    6 collis       0 collis▪▪▪
    SDBM              148 ns      484 ns          90 ns
                        4 collis    6 collis       0 collis**
    SuperFastHash     164 ns      344 ns         118 ns
                       85 collis    4 collis   18742 collis
    CRC32             250 ns      946 ns         130 ns
                        2 collis    0 collis       0 collis
    LoseLose          338 ns        -             -
                   215178 collis
    

    Notes:

    Do collisions actually happen?

    Yes. I started writing my test program to see if hash collisions actually happen - and are not just a theoretical construct. They do indeed happen:

    FNV-1 collisions

    • creamwove collides with quists

    FNV-1a collisions

    • costarring collides with liquid
    • declinate collides with macallums
    • altarage collides with zinke
    • altarages collides with zinkes

    Murmur2 collisions

    • cataract collides with periti
    • roquette collides with skivie
    • shawl collides with stormbound
    • dowlases collides with tramontane
    • cricketings collides with twanger
    • longans collides with whigs

    DJB2 collisions

    • hetairas collides with mentioner
    • heliotropes collides with neurospora
    • depravement collides with serafins
    • stylist collides with subgenera
    • joyful collides with synaphea
    • redescribed collides with urites
    • dram collides with vivency

    DJB2a collisions

    • haggadot collides with loathsomenesses
    • adorablenesses collides with rentability
    • playwright collides with snush
    • playwrighting collides with snushing
    • treponematoses collides with waterbeds

    CRC32 collisions

    • codding collides with gnu
    • exhibiters collides with schlager

    SuperFastHash collisions

    • dahabiah collides with drapability
    • encharm collides with enclave
    • grahams collides with gramary
    • ...snip 79 collisions...
    • night collides with vigil
    • nights collides with vigils
    • finks collides with vinic

    Randomnessification

    The other subjective measure is how randomly distributed the hashes are. Mapping the resulting HashTables shows how evenly the data is distributed. All the hash functions show good distribution when mapping the table linearly:

    Enter image description here

    Or as a Hilbert Map (XKCD is always relevant):

    Enter image description here

    Except when hashing number strings ("1", "2", ..., "216553") (for example, zip codes), where patterns begin to emerge in most of the hashing algorithms:

    SDBM:

    Enter image description here

    DJB2a:

    Enter image description here

    FNV-1:

    Enter image description here

    All except FNV-1a, which still look pretty random to me:

    Enter image description here

    In fact, Murmur2 seems to have even better randomness with Numbers than FNV-1a:

    Enter image description here

    When I look at the FNV-1a "number" map, I think I see subtle vertical patterns. With Murmur I see no patterns at all. What do you think?


    The extra * in the table denotes how bad the randomness is. With FNV-1a being the best, and DJB2x being the worst:

          Murmur2: .
           FNV-1a: .
            FNV-1: ▪
             DJB2: ▪▪
            DJB2a: ▪▪
             SDBM: ▪▪▪
    SuperFastHash: .
              CRC: ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
         Loselose: ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
                                            ▪
                                     ▪▪▪▪▪▪▪▪▪▪▪▪▪
                            ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
              ▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
    

    I originally wrote this program to decide if I even had to worry about collisions: I do.

    And then it turned into making sure that the hash functions were sufficiently random.

    FNV-1a algorithm

    The FNV1 hash comes in variants that return 32, 64, 128, 256, 512 and 1024 bit hashes.

    The FNV-1a algorithm is:

    hash = FNV_offset_basis
    for each octetOfData to be hashed
        hash = hash xor octetOfData
        hash = hash * FNV_prime
    return hash
    

    Where the constants FNV_offset_basis and FNV_prime depend on the return hash size you want:

    Hash Size  
    ===========
    32-bit
        prime: 2^24 + 2^8 + 0x93 = 16777619
        offset: 2166136261
    64-bit
        prime: 2^40 + 2^8 + 0xb3 = 1099511628211
        offset: 14695981039346656037
    128-bit
        prime: 2^88 + 2^8 + 0x3b = 309485009821345068724781371
        offset: 144066263297769815596495629667062367629
    256-bit
        prime: 2^168 + 2^8 + 0x63 = 374144419156711147060143317175368453031918731002211
        offset: 100029257958052580907070968620625704837092796014241193945225284501741471925557
    512-bit
        prime: 2^344 + 2^8 + 0x57 = 35835915874844867368919076489095108449946327955754392558399825615420669938882575126094039892345713852759
        offset: 9659303129496669498009435400716310466090418745672637896108374329434462657994582932197716438449813051892206539805784495328239340083876191928701583869517785
    1024-bit
        prime: 2^680 + 2^8 + 0x8d = 5016456510113118655434598811035278955030765345404790744303017523831112055108147451509157692220295382716162651878526895249385292291816524375083746691371804094271873160484737966720260389217684476157468082573
        offset: 1419779506494762106872207064140321832088062279544193396087847491461758272325229673230371772250864096521202355549365628174669108571814760471015076148029755969804077320157692458563003215304957150157403644460363550505412711285966361610267868082893823963790439336411086884584107735010676915
    

    See the main FNV page for details.

    All my results are with the 32-bit variant.

    FNV-1 better than FNV-1a?

    No. FNV-1a is all around better. There was more collisions with FNV-1a when using the English word corpus:

    Hash    Word Collisions
    ======  ===============
    FNV-1   1
    FNV-1a  4
    

    Now compare lowercase and uppercase:

    Hash    lowercase word Collisions  UPPERCASE word collisions
    ======  =========================  =========================
    FNV-1   1                          9
    FNV-1a  4                          11
    

    In this case FNV-1a isn't "400%" worse than FN-1, only 20% worse.

    I think the more important takeaway is that there are two classes of algorithms when it comes to collisions:

    • collisions rare: FNV-1, FNV-1a, DJB2, DJB2a, SDBM
    • collisions common: SuperFastHash, Loselose

    And then there's the how evenly distributed the hashes are:

    • outstanding distribution: Murmur2, FNV-1a, SuperFastHas
    • excellent distribution: FNV-1
    • good distribution: SDBM, DJB2, DJB2a
    • horrible distribution: Loselose

    Update

    Murmur? Sure, why not


    Update

    @whatshisname wondered how a CRC32 would perform, added numbers to the table.

    CRC32 is pretty good. Few collisions, but slower, and the overhead of a 1k lookup table.

    Snip all erroneous stuff about CRC distribution - my bad


    Up until today I was going to use FNV-1a as my de facto hash-table hashing algorithm. But now I'm switching to Murmur2:

    • Faster
    • Better randomnessification of all classes of input

    And I really, really hope there's something wrong with the SuperFastHash algorithm I found; it's too bad to be as popular as it is.

    Update: From the MurmurHash3 homepage on Google:

    (1) - SuperFastHash has very poor collision properties, which have been documented elsewhere.

    So I guess it's not just me.

    Update: I realized why Murmur is faster than the others. MurmurHash2 operates on four bytes at a time. Most algorithms are byte by byte:

    for each octet in Key
       AddTheOctetToTheHash
    

    This means that as keys get longer Murmur gets its chance to shine.


    Update

    GUIDs are designed to be unique, not random

    A timely post by Raymond Chen reiterates the fact that "random" GUIDs are not meant to be used for their randomness. They, or a subset of them, are unsuitable as a hash key:

    Even the Version 4 GUID algorithm is not guaranteed to be unpredictable, because the algorithm does not specify the quality of the random number generator. The Wikipedia article for GUID contains primary research which suggests that future and previous GUIDs can be predicted based on knowledge of the random number generator state, since the generator is not cryptographically strong.

    Randomess is not the same as collision avoidance; which is why it would be a mistake to try to invent your own "hashing" algorithm by taking some subset of a "random" guid:

    int HashKeyFromGuid(Guid type4uuid)
    {
       //A "4" is put somewhere in the GUID.
       //I can't remember exactly where, but it doesn't matter for
       //the illustrative purposes of this pseudocode
       int guidVersion = ((type4uuid.D3 & 0x0f00) >> 8);
       Assert(guidVersion == 4);
    
       return (int)GetFirstFourBytesOfGuid(type4uuid);
    }
    

    Note: Again, I put "random GUID" in quotes, because it's the "random" variant of GUIDs. A more accurate description would be Type 4 UUID. But nobody knows what type 4, or types 1, 3 and 5 are. So it's just easier to call them "random" GUIDs.

    All English Words mirrors

    It's interesting that FNV-1a seems to have better randomness than FNV-1 (at least, based on the "number string" results), yet has more collisions on the word list you tested. I wonder if this is just a weird coincidence, but your sample size seems to suggest otherwise.

    @DanielB i think it's safe to say that all the serious hashing algorithms will have collisions, but less then 0.01% chance of a collision. The rest is just luck.

    As a random-dot stereogram enthusiast, I had to try viewing your FNV-1a Numbers image as a stereogram. And yes, it is indeed still quite a repetitive pattern for that use case. Treating the image as flowing left to right then top to bottom, I see significant repetition at 805 pixels (is that 805 bits?). 805 = 23 * 7 * 5. I see no repetition in Murmur2. =)

    @maaartinus On the downside, as a practical matter (non-cryptographic) hash functions are usually used to implement a hash table. Hash tables are usually keyed by a *string*. Trying to pick our the correct 4 bytes of a guid so you can convert it to an integer, so you can convert it to a string adds needless complexity. And it also reduces the usefulness (e.g. if i want to find the item for CustomerGUID "{B5225B82-F3DE-41E7-A70B-91F5AB1F9709}")

    For random input and an ideal hashing function, you expect 5.5 collisions with N=216553 and a 32-bit word size (d=2^32) by N-d-d*((d-1)/d)^N. The probability of no collisions in this ideal case is p ≃ ((d-1)/d)^(N*(N-1)/2) ≃ 0.00426 (0.426%). Sheer luck that CRC32 had low collisions. So when collisions aren't significantly above 5.5 and input is sufficiently random, only speed matters. (For consecutive integers; low collisions are desired).

    It would be really interesting to see how SHA compares, not because it's a good candidate for a hashing algorithm here but it would be really interesting to see how any cryptographic hash compares with these made for speed algorithms.

    @IanBoyd: I did a Java CityHash port at one point, and it was faster than my coworkers' Murmur3 implementation. Could we possibly just see the low 32 bits of CityHash benchmarked?

    @Michael: I'd expect SHA to have no collisions and a perfectly random looking distribution. Anything else would be unacceptable from a cryptographic POV.

    One addition: CRC can be used as a hash function, but it wasn't designed for this purpose. Mandatory reading: A painless guide to CRC error detection algorithms (http://www.ross.net/crc/download/crc_v3.txt)

    Care to add Phil Bagwell's hash to the bench? ( described in my answer below)

    A new hash by the name of 'xxHash', by Yann Collet, was doing the rounds recently. I'm always suspicious of a new hash. It would be interesting to see it in your comparison, (if you aren't tired of people suggesting random hashes they've heard of to be added...)

    The result is x86 specific. Please note that the performance of murmur2 are worse on ARM for instance. I don't know if this is relevant for the question, but still important.

    This is Phil Bagwell's hash: //32-bit version, Scala, but should be self-explanatory import java.lang.Integer.reverseBytes def sbhash(i:Int) = reverseBytes(i * 0x9e3775cd) * 0x9e3775cd

    Indeed. The performance numbers announced by the xxHash project page look impressive, maybe too much to be true. Well, at least, it's an open-source project : http://code.google.com/p/xxhash/

    I'm really curious about how the SHA family would fare. If performance is not incredibly worse, the guarantee of no collisions is enough to make it a better choice, since you're virtually never going to be bottlenecked by hashing bandwidth.

    How about testing a true random function for collisions and randomness for comparison?

    Hi Ian, my Delphi implementation of SuperFastHash is correct. When implementing I created a test set in C and Delphi to compare the results of my implementation and the reference implementation. There are no differences. So what you see is the actual badness of the hash... (That is why I also published a MurmurHash implementation: http://landman-code.blogspot.nl/2009/02/c-superfasthash-and-murmurhash2.html )

    Great answer. One thing, though, that I think must be stressed, is that while MurmurHash2 does seem to be the best, doesn't mean it's the *only* hash function that should *ever* be used. FNV-1a has one significant advantage over Murmur, that being ease of implementation (two constants and a loop of two simple operations), and it is still *good* at the things MurmurHash is *better* at.

    One question; what would an implementation of FNV-1a look like that used 4 bytes at a time? How well would it do at these tests?

    Some analysis/graphs can be found at Pluto Scarab — Hash Functions http://home.comcast.net/~bretm/hash/ , or at the MurmurHash old pages: https://sites.google.com/site/murmurhash/avalanche (from https://sites.google.com/site/murmurhash/statistics )

    Another study of hash speed is http://playcontrol.net/opensource/LuaHashMap/benchmarks.html but that may also be sensitive to the hash table rebuild strategy.

    https://news.ycombinator.com/item?id=6547734 suggests that SipHash would be a valuable addition to this answer.

    @IanBoyd: This is one of the best answers I've seen on any stackexchange site. Bravo. :). Is there any chance of you posting all the code publicly? I'd like to add a few things to see for myself... Thanks.

    @JimR i could, but it's Delphi, using various code files of helper routines and classes that i've collected over the years. You're better off to start over in C# or something. The real work was implementing the algorithms. It would be a lot of pain to cull down the relevant files into the required routines; and even then it was only designed for Delphi 7 - using an ide from this decade would require code-rework. In other words: no. :)

    @IanBoyd: Ahh well, I guess I'll put off looking for that Delphi 7 Enterprise box in the garage... :)

    Which CRC32 have you used? Apparently there is at least IEEE, Castagnoli and Koopman.

    @yaccz i chewed whatever gum i found laying around. This one i think it was

    I'm a bit late to this party but I was wondering how md5 compares with that list. I've found it to be a bit faster than crc, but I don't know how it compares in terms of collisions to the others.

    If I'm not incorrect, it's `SuperFastHash collisions` not `SuperFastCash collisions`, right?

    Excellent post! What surprises me is how well good old CRC32 performed. I have always shied away from this for "serious" applications as I though (due to age and simplicity) it would not perform as well as more modern algorithms.

    Is the poster aware this is not just an awesome answer - this is the world's de facto reference resource on the subject? Anytime I need to deal with hashes, that solves my issue so fast and authoritatively that I don't ever need anything else.

    I tested a bit more, and basically CRC32 is by far the fastest on a modern CPU if you use zlib (just macports uses the slow SW fallback) or the use the SSE4.2 CRC32-C function manually. This needs 1 cycle per op, with 3 wait cycles. zlib provides on most platforms access to the HW-accelerated function which is optimized to run in parallel (to avoid the 3 wait cycles) for larger keys. There are several more really fast hash function variants, tested at: http://www.sanmayce.com/Fastest_Hash/index.html#TRIADvsCRC https://github.com/rurban/perl-hash-stats https://github.com/rurban/Perfect-Hash

    @IanBoyd The list of words -- http://www.sitopreferito.it/html/all_english_words.html seems to be unavailable. Do you have a local copy lying around?

    @KedarMhaswade http://www.sitopreferito.it/html/all_english_words.html" target="_blank">archive.org has it.

    murmur3 is also available now. Would you please run the same test on see how many collision are there? https://github.com/PeterScott/murmur3

    I need some extra information. I've tried making a test-program, but it fails miserably. I get 20000+ collisions at best. So I'd like to ask: `1: How many buckets do you use ?` and `2: What's your definition of a collision ?`. It may be my code that's completely off, but my own definition of a collision is that a hash (after modulation) matches another hash (after modulation). I've also tried comparing the hashes before modulation; but still 20000+ of my collisions at best. Do you count collisions each time a word has the same hash, or do you count them just the first time ?

    @IanBoyd - as an aside, do you have a high-res version of the Hilbert Map above? This would made a beautiful addition to this geek's wall art. I'd be willing to pay you for it.

    @cciotti I was about to say that i can do that. But then i remembered: that map **is** 1:1 resolution. There are `216,553` words in the English language, and you see `524,288` pixels (where white is the unfilled hash map). There are no more pixels to show! Were you thinking you'd just like to be able to see a high resolution Hilbert color gradient? 16:10 aspect ratio and 4,000 pixels high?

    @IanBoyd - That makes sense. I was hoping to find something high-res that I could print at a reasonable size. I have just the spot and was considering a DNA marker print but I like this too.

    @mcfinnigan "Costarring liquid"

    In the page https://web.archive.org/web/20070221060514/http://www.sitopreferito.it/html/all_english_words.html OUTSOURCING and OUTSOURCINGS each repeat twice. Did you account for this? Or did you just used another version of this database without this bug?

    @IanBoyd, your answer is really impressive, like many others have noted, and I'm as curious as the next guy, but I also like to get shit done, rather than spending my whole day reading about the intricacies of various hashing algorithms ;) Could you please add a *Conclusion* section or a *TL;DR*? It sounds like maybe Murmur2 is the "best", or maybe Murmur3? It's hard to tell without studying the whole thing. I realize "best" may depend on the situation, but some kind of overview would make your answer much more useful.

    @IanDunn It's tucked away in there; but i decided on **MurMur2**. I wasn't able to implement MurMur3 easily, and so it wasn't tested. Originally i was going to use FNV-1, but changed my mind to Murmur2.

    The 64-bit superfasthast pascal implementation is incorrect. It shouldn't use Cardinal() that will truncate the pointer in 64-bit.

    The table is very confusing to me. _"The extra * in the above table denotes how bad the randomness is. With FNV-1a being..."_ but the table is _under_ the statement and I don't understand what the '*' mean there...

    @Noein The `*` denotes relative *badness*, with FNV-1 have no "badness", so there is no **`*`** next to it. Meanwhile **LoseLose** is so bad that the **`** spill out over the edge of the chart onto the floor.

    For the randomness aspect of the two most evenly hashed algorithms, you might look into the Serial Correlation Coefficient (Knuth, pp. 64–65). Applying ent's random analysis on this as a bit stream (rather than byte stream) could be an interesting addition to the analysis.

    @IanBoyd But FNV-1 has a point, FNV-1_a_ has no points, and _CRC_ has a lot of points while SuperFastHash has none. There's the confusion :/

    I wrote and refined my own hash function to replace md5 in jdupes with something faster; it was specifically designed to reflect basic CPU instructions and use as few operations as possible. I ran the All English Words (upper and lower case, with OUTSOURCING[S] deduped) and number sequence tests to check collisions: 1 pair collided for UC, no collisions for LC or numbers. My other tests in my actual programs show very low collision rates as well; I'd be interested in more thorough testing and commentary as seen here!

    Would you mind putting your code for comparison up on GitHub or someplace that would accept contributions? It would be nice to have this in an easy place, and there's a few additional hashes that I wouldn't mind seeing added myself.

    Your answer is VERY comprehensive, HOWEVER: i dont seen anything (i used ctrl+F) about cores or threads. What is your hardware spec for the analysis? perhaps its due to my limited understanding of cryptology, but would not the CPU type, # or cores, multithreading, compiler opts, and memory type all have non-linear affects on these algos? or is it not affecting your current test sets (or affecting them proportionally)?

    @Nick Fortunately i wasn't really investigating the hand-optimized assembly multi-core versions of the algorithms. I was testing the actual algorithms as i implemented them, as they'd actually be used in the language i wrote them in. It's also fortunate that multiple cores wouldn't help hashing your typical 10 character string - the string can already fit inside one 64-byte cache line inside one core. And modern CPUs are already so pipelined that it's able to do a lot.

    @Nick I ***highly*** recommend Eric Brumer's talk Native Code Performance and Memory: The Elephant in the CPU. He explains how memory access is everything. Memory is the bottleneck, not processing power. 99% of your CPU die is dedicated to getting stuff out of memory.

    @IanBoyd - thanks for the clarification - there i went again, getting all excited thinking about all these use cases, and as it turns out, my use case is pretty similar to yours - "short" string hasing :) I've added the link to my list of todos. Hopefully i'll have time on the weekend!

    What about SipHash because of HashDoS mentioned at https://en.wikipedia.org/wiki/MurmurHash#Attacks ?

    @Nick Even if memory weren't the bottleneck, hash functions usually are difficult or impossible to parallelize unless they produce final hashes larger than the machine word size (in which case they MIGHT be able to use more cores to compute partial hashes and then recombine them at the end) but even then some pretty specific restrictions would have to be placed on the hash function to force it to be parallelizable. In general most machine-word sized non-secure hash functions are so fast already that the effort is not worth it.

    How do you generate those **randomness pixel graphs**? I tried calculating SBDM for increasing numbers, then dividing the output of SDBM by 0xFFFFFFFF, and check if its over or under .5 but its almost always under. It looks much worse than your graph. MurMurHash3 works great with my method though. This is my output for SDBM: http://i.imgur.com/MMQuYxY.png

    @bryc Each pixel is an entry in the internal array. If the array slot is empty, it's drawn as white. If the array slot is filled: it's drawn with a color. So then i just create an image that is sqrt(arrayLength) wide and tall. It also helps that, at the time, i was using a prime number probe. I've since switched to linear probe - cause cache hits.

    You should have published a IEEE paper with these much information. ;)

    @IanBoyd Would love seeing Adler32 added to this list for comparison. I had been using it for String hashing and saw a related post on a sister site from Adler himself pointing out it should not be used that way. Also, I bet I'm not alone in wishing to see this posted elsewhere as an updated blog type post with more details included for those interested.

    @IanBoyd Asking @user124114's question: `1: How many buckets do you use ? and 2: What's your definition of a collision ?`, he tried the experiment and got 20000+ collisions at best. `Do you count collisions each time a word has the same hash, or do you count them just the first time ?`

    I wonder how the PJW/ELF hash goes. It's standard in UNIX world, so I'd hope the collision rates are not too bad.

    Now that you've got these measurements, it's worth noting for readers that the tradeoff of speed and collisions changes depending on the use case. Every use case will pay some cost A for the non-collision case, and a different cost B for the collision case. The total cost of a sequence of N accesses then, with H being the cost of the hash itself and c the collision rate, is `N*(H + (1-c)A + cB)`, and you want to choose your hash to minimize that.

    Any chance to add Bob Jenkins hash to the comparison? That (one-at-a-time) is the one I use. (I just found this thread form the reddit post)

    I bet the collision between `codding` and `gnu` for CRC32 is an easter egg!

    Since the question is about a hash function for a hash table, this is largely irrelevant .. collisions necessarily occur when you fold the hash into a relatively small number of buckets, and some functions will do poorly on some hash table sizes. And the best hash functions for hash tables, such as xxHash, sipHash, and Bob Jenkins' SpookyHash aren't included. People asking for SHA and MD5 are in the wrong ballpark for speed, so maybe include them just to demonstrate that.

    @Yves If you look at the definition and history of CRC32, you will see that cannot possibly be the case. Good thing for you that "I bet" is just an emphatic way to state unfounded beliefs, rather than a commitment to making a payment.

    @PhilMiller Collisions happen upon table insertion, and the costs are internal to the hash table implementation. They don't depend on the "use case", nor do they apply to "accesses". Further, for any given hash table implementation, the costs A and B are not constant--they depend on the key and the prior content of the table--so your formula is wrong. The right way to formulate it is with worst case, best case, and average costs for insertion, deletion and query. Use cases should pick implementations accordingly.

    @JimBalter Hash collisions can absolutely impose costs on lookups as well as insertions. For the most trivial illustration, consider separate chaining where entries that collide require added traversal of a list to locate.

    @PhilMiller I didn't say that collisions can't impose costs on lookups; of course they can. But you wrote "The total cost of a sequence of N accesses then ... is [formula]", which should read "insertions", not "accesses". The cost of accesses is different from the cost of insertions, and again are *not constant* -- the cost of traversal of a chain making that obvious. And you ignored the rest of my points. I won't respond further.

    Who would guess that snushing playwrighting could cause collisions...

    Dictionary words plumless and buckeroo are also CRC32 collisions based on https://www.lammertbies.nl/comm/info/crc-calculation.html Wondering how cityhash would do.

    @IanBoyd It would be nice if you could include Java String Hashcode algorithm for comparison. :)

    This is an awesome answer, and deserves all its upvotes. In addition to the numbers on compute time and collisions, I would like to see an indication of implementation complexity. In a constrained situation, such as a SQLite query in which you can't define your own functions, which of the hash functions listed here can be practically implemented?

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM