Why are hash functions one way? If I know the algorithm, why can't I calculate the input from it?
Why can't a password hash be reverse engineered?
I've looked into this ages ago and have read lots on it, but I can't find the explanation of why it can't be done. An example will make it easier to understand my question and to keep things simple we will base it on a hashing algorithm that doesn't use a salt (LanMan).
Say my password is "Password". LanMan will hash this and store it in the database. Cracking programs can brute force these by hashing password guesses that you provide. It then compares the generated hash to the hash in the database. If there is a match, it works out the password.
Why, if the password cracker knows the algorithm to turn a plain text password into a hash, can't it just reverse the process to calculate the password from the hash?
If a password cracker knows the process to turn a cow into ground beef, does that mean he can "just reverse" it and turn ground beef into a cow?
please migrate this to [crypto.se] or even [security.se] because I only see wrong answers getting upvoted and the ones that are correct don't really address the **why** part of the question.
Agreed, I am somewhat confused as to why these severely flawed answers are getting so many upvotes.
The problem with all of the provided answers below is that they seem to explain why you can't get the answer back, but then raise the issue "given this, you'd be more likely to gain access than if it stored the password in plain-text, because you no longer need an exact match." The only way this is covered is when people say "Oh, but it's really unlikely". It's still more likely than if you have to get an exact match!
_"is just like a worm that is getting off a branch on a tree to take a shower at the river, if he decides to go back to the exact branch he was before, he will miserably fail and everyone will laugh at him"_
What a very frustrating set of answers to come across. The answer is simple: each hash can be the result of an infinite number of strings being hashed, so there is no way of knowing which one a hash was meant to represent - even more simply put, a hash doesn't represent any one value.
Let me invent a simple "password hashing algorithm" to show you how it works. Unlike the other examples in this thread, this one is actually viable, if you can live with a few bizarre password restrictions. Your password is two large prime numbers, x and y. For example:
x = 48112959837082048697 y = 54673257461630679457
You can easily write a computer program to calculate xy in O(N^2) time, where N is the number of digits in x and y. (Basically that means that it takes four times as long if the numbers are twice as long. There are faster algorithms, but that's irrelevant.) Store xy in the password database.
x*y = 2630492240413883318777134293253671517529
A child in fifth grade, given enough scratch paper, could figure out that answer. But how do you reverse it? There are many algorithms people have devised for factoring large numbers, but even the best algorithms are slow compared to how quickly you can multiply x by y. And none of those algorithms could be performed by a fifth grader, unless the numbers were very small (e.g., x=3, y=5).
That is the key property: the computation is much simpler going forwards than backwards. For many problems, you must invent a completely new algorithm to reverse a computation.
This has nothing to do with injective or bijective functions. When you are cracking a password, it often doesn't matter if you get the same password or if you get a different password with the same hash. The hash function is designed so it is hard to reverse it and get any answer at all, even a different password with the same hash. In crypto-speak: a hash function vulnerable to a preimage attack is utterly worthless. (The password hashing algorithm above is injective if you have a rule that x < y.)
What do cryptography experts do? Sometimes, they try to figure out new algorithms to reverse a hash function (pre-image). They do exactly what you say: analyze the algorithm and try to reverse it. Some algorithms have been reversed before, others have not.
Exercise for the reader: Suppose the password database contains the following entry:
What is the password? (This one is actually not too difficult for a computer.)
Footnote: Due to the small number of passwords that people choose in practice, a good password hash is not merely difficult to compute backwards but also time-consuming to compute forwards, to slow down dictionary attacks. As another layer of protection, randomized salt prevents the use of precomputed attack tables (such as "rainbow tables").
Footnote 2: How do we know that it is hard to reverse a hash function? Unfortunately, we don't. We just don't know any easy ways to reverse hash functions. Making a hash function that is provably difficult to reverse is the holy grail of hash function design, and it has not been achieved yet (perhaps it will never happen).
Ok, so tell me the answer to the exercise you have given me? I have no idea how to work that out, how would i?? And what is injective and bijective?
@Mucker: A function f(x) is injective if f(x) = f(y) implies x = y, i.e., no two inputs have the same output. A bijective function is an injective function that is also surjective, i.e., for every possible output there is a corresponding input. When people say "bijective" in this thread they really should be saying "injective". Both concepts are not really relevant to password hash security. Telling you the answer to the reader exercise defeats its purpose, I never wrote down the answer anyway (it does exist, I just don't know what it is).
@DietrichEpp could you please add to your answer that most cryptographic hashing functions actually use simple operations like `add, and, or, xor, rotate, mod` to do the hashing and not prime numbers? (just to make it clearer)
@Mucker: Obviously it factors to 1451730470513778492236629598992166035067 x 2425967623052370772757633156976982469681 (Actually, took about ten minutes of CPU time using SIQS method on a single 3GHz core.)
@drjimbob: Nice! Now here's a link I used: http://primes.utm.edu/lists/small/small.html
I would like to add that hashes are also compression. You can hash a (high entropy) x GiB file and receive a 160bit digest. Information is lost (even if len(m) < len(hash(m)) ) and trying to find out what is missing is very hard.
@Tie-fighter: Very, very lossy compression - as you note, the amount of lost information is so huge that it's compression only in a very technical sense, I'm tempted to add.
"it often doesn't matter if you get the same password or if you get a different password" - But it does when the input is not fully arbitrary but must fulfill some requirement, such as a password + known salt, or a message in a certain protocol. This makes the data-destroying aspect of hashes beneficial to security. (Of course, if you try long enough you can find an input that does follow the requirements, but that takes even longer)
@BartvanHeukelom: That's not actually true. Destroying data is irrelevant, the only relevance is the cost of a preimage attack. Using salt does not increase the cost of a preimage attack, it only prevents attackers from running preimage attacks in parallel or ahead of time. Don't misunderstand, salt is essential -- but there is no real benefit for non-injective functions here.
@juanpastas: Your edit is incorrect. If you use the product xy as your password, then the scheme does not work. The password is both numbers, and the product is stored in the password database.
I have a question, how we choose 2 prime numbers? If I have a hash-table to store value like 6(2*3),63(7*9), of course, much larger value of product of 2 prime numbers. And we do 'reverse engineering', can we get the 2 prime numbers much faster? and if we keep building this table like 9x9 table(multiplication table), is this method possible to break this hash mechanism?
@Timeless: The hash table method would never work, because the hash table would be too large, and it would take too much time to create the hash table. The state of the art technique for factoring large numbers is the *general number field sieve* which is still too slow for large numbers.
Now THAT is a good question.
We must first give a precision: many one-way functions, in particular hash function as commonly used in cryptography, accept inputs from a space which is much larger than the space of output values. For instance, SHA-256 is defined for inputs which are strings of up to 18446744073709551615 bits; there are 218446744073709551616-1 possible inputs, but since the output is always a sequence of 256 bits, there are only 2256 possible outputs for SHA-256. Necessarily, some distinct inputs yield the same output. Therefore, for a given output of SHA-256, it is not possible to unambiguously recover the input which was used, but, possibly, it might be possible to compute an input which yields the given output value. Preimage resistance is about that: the difficulty of finding a matching input for an output (regardless of how that output was obtained in the first place).
So we talk about a function that everybody can compute over any input (using a publicly known program, no secret value involved -- we are not talking about encryption).
What academics say
It is unclear whether one-way functions can actually exist. Right now, we have many functions that no one knows how to invert; but this does not mean that they are impossible to invert, in a mathematical sense. Note, though, that it is not proven that one-way functions cannot exist, so hope remains. Some people suspect that whether one-way functions may exist or not could be one of these irksome mathematical assertions which can be neither proven nor disproved (Gödel's theorem proves that such things must exist). But there is no proof of that either.
Therefore, there is no proof that any given hash function is really resistant to preimages.
There are some functions which can be linked to well-known hard problems. For instance, if n is the product of two big primes, then the function x ⟼ x2 mod n is hard to invert: being able to compute square roots modulo a non-prime integer n (on a general basis) is equivalent to being able to factor n, and that problem is known to be hard. Not proven to be hard, mind you; only that mathematicians have tried to efficiently factor big integers for (at least) the last 2500 years, and although some progress has been made, none of these smart people found a really killer algorithm for that. World record for factorization of a "RSA modulus" (a product of two randomly chosen big primes of similar lengths) is a 768-bit integer.
Some hash functions based on such "hard problems" have been proposed; see for instance MASH-1 and MASH-2 (on the RSA problem) and ECOH (with elliptic curves). Only a few such functions exist, because:
Turning a "hard problem" into a secure hash function is not easy; there are lots of tricky issues. For instance, while extracting square roots modulo a non-prime n is usually hard, there are values for which square root extraction is easy.
The performance of such hash functions tends to be, let's say, suboptimal. Like being 100x slower than a more commonly used SHA-1.
The more "standard" way of building a hash function is to get cryptographers together and have them gnaw at some proposed designs; the functions which survive cryptanalytic attempts for a few years are then considered "probably robust". The SHA-3 competition is such an effort; the winner should be announced later this year. On the 51 candidates (the ones who succeeded the administrative step), 14 were retained for "round 2" and these 14 have been relatively closely looked at by many cryptographers, and none of them found anything really worth saying about the functions. The list has been reduced to 5 and will be further reduced to 1 "soon", but not for security reasons (most of the actual data was about performance, not resistance).
What makes MD5 hard to invert
Since we do not know how to prove that a function is hard to invert, the best we can do is to give it a try on a specific function, so as to get an "intuition" of how the function achieves its apparent resistance.
I choose MD5, which is well known. Yes, MD5 is "broken", but that's for collisions, not preimages. There is a known preimage attack which is, at least theoretically, faster than the generic way (the "generic way" is "luck", i.e. trying inputs until a match is found, for an average cost of 2128 evaluations since MD5 has a 128-bit output; the Sasaki-Aoki attack has cost 2123.4, which is lower, but still way too high to be actually tried, so the result is still theoretical). But MD5 is relatively simple and has withstood attacks for quite some time, so it is an interesting example.
MD5 consists in a number of evaluations of a "compression function" over data blocks. The input message is first padded, so that its length becomes a multiple of 512 bits. It is then split into 512-bit blocks. A 128-bit running state (held in four 32-bit variables called A, B, C and D) is initialized to a conventional value, then processed with the compression function. The compression function takes the running state and one 512-bit message block, and mixes them into a new value for the running state. When all message blocks have been thus processed, the final value of the running state is the hash output.
So let's concentrate on the compression function. It works like this:
- Inputs: the running state (A B C D) and a message block M. The message block is 512 bits; we split it into 16 32-bit words M0, M1, M2,... M15.
- Output: the new running state value.
- Save the current state in some variables: A → A', B → B', C → C' and D → D'
- Do 64 rounds which look like this:
- Compute T = B + ((A + fi(B, C, D) + Mk + Xi) <<< si). This reads like this: we compute a given function fi (a simple bitwise function, which depends on the round number i) over B, C, and D. Add to that the value of A, one message word Mk and a constant Xi (additions are done modulo 232). Rotate the result to the left by some bits (the shift amount also depends on the round). Finally, add B: the result is T.
- Rotate the state words: D → A, C → D, B → C, T → B.
- Add the saved state values to the current state variables: A + A' → A, B + B' → B, C + C' → C, D + D' → D.
The important point is that there are 64 rounds, but only 16 message words. This means that each message word enters the processing four times. I write that in bold because it is the central point; resistance to preimages comes from that characteristic. Which message word is used in each round is described in the MD5 specification (RFC 1321); the specification also describes the functions fi, the rotate counts si and the 32-bit constants Xi.
Now suppose that you are trying to "invert" MD5; you begin from the output and work slowly up the compression function. First, you must decide the output of round 64. Indeed, the output of the compression function is the sum of the output of round 64, and the saved state (the A' B' C' D' values). You have neither, so you must choose. Your hope is that you will be able to find values for the message words which will allow you to obtain for input of round 1 some values which are coherent with your arbitrary decision on A' and its brothers.
Let's see how things look when you walk the compression function backward. You have the output of a round (the variables A, B, C and D after the round) and you want to recompute the input of that round. You already know the previous values of B, C and D, but for A and Mk you have plenty of choice: each 32-bit value is possible for A, and each has a corresponding Mk. At first, you are glad of that; who would spurn such freedom ? Just choose a random Mk, and this yields the corresponding A with just a few operations (try it !).
But after you have inverted that way 16 rounds (the rounds 49 to 64, since you are working backwards), freedom disappears. You have "chosen" the values of all the message words. When trying to invert round 48, you want to recompute the value of A just before that round; as per the MD5 specification, message word M2 is used in round 48, and you have already chosen the value of M2 (when inverting round 63). So there is only one choice for A. So what, would you say. One choice is sufficient to continue the backward walk. So you continue.
Now, you are at the beginning of the compression function. Remember that, initially, you made an arbitrary choice of the values of A' B' C' D': this allowed you to compute the output of round 64, and begin the backward walk. Now you have obtained the input of round 1, which should be identical to A' B' C' D'... and it does not match. That's quite normal: you chose A' B' C' D' arbitrarily, and you also chose the message words Mk arbitrarily, so it can be expected that it won't work most of the time. So you try to repair the computation, by retrospectively altering either your initial choice of A' B' C' D', or one or several of the random choices for Mk. But each modification on any Mk implies modifications elsewhere, because each Mk is used four times. So you need other modifications to cancel out the other ones, and so on...
At that point you begin to understand the problem of inverting MD5: every time you touch a single bit, it triggers an awful lot of modifications throughout the algorithm, which you need to cancel out by touching other bits, and there are just too many interactions. Basically, you juggle with 2128 balls at the same time, and that's way too much to keep track of all of them.
If each message block was 2048-bit long, split into 64 words, and each message word was used only once in MD5, then you could invert it easily. You do as above: arbitrary selection of A' B' C' D', arbitrary selection of message words for rounds 64 to 5; and for the first four rounds, you just consider the value you wish to obtain for the round input (the value which matches your arbitrary choice of A', B', C' or D') and work out the corresponding message word. Easy as pie. But MD5 does not process data by 2048-bit blocks, but by 512-bit blocks, and each message word is used four times.
Some additional twists
The structure of the compression function of MD5 is actually a generalization of a Feistel cipher. In a Feistel cipher, the data is split into two halves, and, for each round, we alter one half by adding/xoring it to an intermediate value which is computed from the other half and from the key; and then we swap the two halves. Extend this scheme to a four-parts split, and you get the same structure than the MD5 rounds -- with a 90º rotate: MD5 looks like the encryption of the current state using the message block as key (and there is the extra addition of the output of round 64 with the saved state, which departs MD5 from a rotated cipher).
So maybe we can build hash functions out of block ciphers ? Indeed we can: that's what Whirlpool is about. A hash function built over a rotated block cipher (the message block is the key); the block cipher of Whirlpool is "W", a derivative of Rijndael, better known as the AES. But W has bigger blocks (512 bits instead of 128 bits) and a reforged key schedule.
When you make a hash function out of a rotated block cipher, then preimage attacks on the hash function are somewhat equivalent to key reconstruction attacks on the block cipher; so there is some hope that if the block cipher is secure, then so is the hash function. There again, there are snarky details. Also, for such a structure, collisions on the hash function are like related-key attacks on the block cipher; related-key attacks are usually considered non fatal, and often ignored (for instance, they were not part of the evaluation criteria for the AES competition, and Rijndael is reputed a bit flaky in that respect, which is why W has a brand new key schedule).
Some newer designs are built over a block cipher which is not rotated, so that security of the hash function can be derived more directly from security of the block cipher; see for instance the SHA-3 candidate Skein, defined over a block cipher called Threefish.
Conversely, one could try to make a block cipher out of a hash function. See for instance SHACAL, which is SHA-1 "set upright". And, on cue, SHACAL has some related-key weaknesses which are quite similar to the known weaknesses of SHA-1 with regards to collisions (no actual collision was computed, but we have a method which should be almost a million times faster than the generic collision-finding algorithm).
Therefore, contrary to what I said in the introduction of this post, we have been talking encryption all along. There is still much to be discovered and studied about the links between hash functions and symmetric encryption.
TL;DR: there is no TL;DR for this message. Read it whole, or begone.
Best TL;DR quote ever. I think I need to create a new stack in my evernote just for your answers. Do you author any articles or books by chance?
I don't care that it's late, I need to say this: Really good explaination that really shows the complexity you can create using algorithms. I had this ignorant thought that everything easily could be done backwards if you knew how to do it forward (using computers), and this clearly shows that it's not that easy. The example with MD5 was great as well, since it lets you actually see the complexity for what it is (unlike with analogies [which are great too, don't get me wrong]). Again, really great and enlightening article; hope to read more from you.
" x ⟼ x2 mod n is hard to invert " ... This seems unlikely, especially since you (or whoever is using this inside a hash function they designed, e.g., the NSA) have access to those big primes.
Hi , When you say "It is unclear whether one-way functions can actually exist. Right now, we have many functions that no one knows how to invert; but this does not mean that they are impossible to invert, in a mathematical sense", what are you referring to? For example, if we look at the function "floor", do we claim that it is "not impossible to invert"? Thanks!
@AsheKetchum A one-way function is by definition preimage resistant, so the meaning is not exactly what you would expect. If you have `floor(n) = 7`, I can "invert" that with `n = 7.2`. Even if that is not the original value, I still "inverted" it. I didn't discover the original value of `n` that you may have had in mind, but I did discover _a_ value which solves the equation, proving that it is not one-way in the cryptographic sense.
@cnd That equation was just an example of a one-way function called a "trapdoor function". Functions of that kind are _normally_ one-way, but not if you have access to certain secret variables used in creating the function, in that case the primes multiplied together to derive _n_. Real hashes do not use trapdoor functions, so their one-wayedness is unconditional and not dependent on the secrecy of some value.
The first step to to the answer here is seeing examples, like the nice one from @Dietrich, of functions that are much harder to run in one direction than the inverse, and have resisted many attempts to find a speed breakthrough. But the problem is complex, so I'll try to flesh it out some more.
Lots of people seem to be falling into the trap (heh) of thinking that hash functions are actually somehow magical - that they really are absolute "one way functions" that mathematically can't be run backwards at all, just because they are called hashes. This is not a healthy way to think about it in a security forum. It is often wrong in practice. And it is always wrong in theory, given the basic mathematical definition of a function as a mapping from a domain to an image.
All hashes can be reversed, in principle. It may be messy and brutish (as in brute-force), it may take a impractically long time with today's hardware, and it may even hold up over the long haul, but mathematically it is simply a matter of time. As @mucker noted, all the information is there to find the original password, (or, at least, a password that works). If we forget that, we forget the danger of clever heuristics for cherry-picking likely passwords, which make the news regularly. Hashing is an engineering problem and the primary challenge is an efficiency one - how to make it expensive to find the password given the hash. One of the principle results of that kind of thinking is the importance of making password hashes slow
And the science and mathematics of hashing is only slowly getting better. There really aren't any proofs that any hashes really are hard. @Dietrich's answer is a nice way of illustrating how ideal hash functions might be possible. But just look at the real experts describing how we don't have proofs for any of the best crypto algorithms: What's the mathematical model behind the security claims of symmetric ciphers and digest algorithms?
The fact that LanMan was cited in the question is yet more evidence that we need to avoid idealizing hashes. LanMan is anything but an ideal hash function, easily defeated by a combination of a bit of analysis and a bit of brute forcing. For another popular example of a horrid hash function see MySQL OLD_PASSWORD cryptanalysis?.
So get yourself back out of the trap - falling into it needn't be a one-way trip. Recognize that hashes are reversible, and keep that trusty security-mindset active as you look for the best way to reverse them. That's often the best way to find ones that really are hard to reverse. I'm not trying to cast aspersions on the best practices out there, like bcrypt or PBKDF2 or scrypt. But the evidence is clear that even good programmers get this stuff wrong all too often. so be careful with how you use them and don't try to invent your own.
I'm trying to figure out what you could mean by "all the information is there to find the original password." Do you mean "all the information is there to find a password that will generate the same hash value with the given hash algorithm"? Because the former just isn't true... many hashes lose information.
@LarsH you're right, most hashes lose information, and you may not be able to find the original password. But most of the time you just need a password that results in the same hash, and that is always possible, given enough resources, so long at it is a valid hash. I've updated my answer a bit.
Because that's how Cryptographic Hash Functions work, they are one-way (from plain to hash) mathematical functions. Algorithms are made and tested specifically to avoid that, and also avoid collisions (2 different plain texts generate the same hash).
You can read more on wikipedia, but the main point of the article is:
The ideal cryptographic hash function has four main or significant properties:
- it is easy (but not necessarily quick) to compute the hash value for any given message
- it is infeasible to generate a message that has a given hash
- it is infeasible to modify a message without changing the hash
- it is infeasible to find two different messages with the same hash
Most of the attacks on hash functions are based on finding collisions (so 2 different plain texts will match the same hash) or pre-generating millions of hashes and comparing them until you find the plain that generated it.
Long history short: if a hash algorithm is reverse-engineerable or can be attacked that way, it's not a good hash algorithm.
For passwords, investigating using BCrypt, this post has a great deal of info on it.
Hashes are not designed to avoid collisions. Collisions are always present, in abundance, since there are many more possible input values than output values. As Wikipedia says, the goal is simply to make it infeasible to _find_ the collisions. And as I note in my answer, the unfortunate fact is that only a small number of hash functions have any track record of actually meeting the requirements laid out, despite the many that have been designed and popularized.
This answer basically says "hash functions are one-way because hash functions are one way". You might want to give a more rigorous mathematical explanation of how a hash function works to better describe the _why_ of this fact.
Imagine a hash function that uses a single bit for the hash. So your hash can either be 0 or 1.
And let's say the hash function adds up every byte of data and if the data was even, the hash value is 0. If the data was odd, the hash is 1.
Do you see why you couldn't recover your data by reverse engineering that hash function?
It's the same for actual hash algorithms, only the formulas are significantly better than the function I just described.
Your difficulty may be that you are considering hash as far as their use for passwords. It's not obvious why you cannot recover a 8 character password from a 128 bit hash. But that hash function you use for passwords can also be used to calculate the hash of an entire terabyte of data, and the hash will still take only 128 bits of data. Obviously, you cannot reverse engineer that 128 bit hash and recover your terabyte of data.
Also, assuming you had every possibly permutation of a single terabyte of data, there would be a huge amount of different datas that generate the same hash. After all, if you have more than 2^127 different permutations of data, you become likely to encounter two different datas that have the same hash.
There are algorithms that are inherently not reversible; they change an input A into an output B in such a way that even if you know the exact steps of the algorithm, you can't recover A from B.
A very simple example: convert each character in the password to its ASCII value and sum all the values. There is no way you can recover the original password from the result.
But… you don't need the original password, you just need any password whose hash is the same. In other words you need a string whose sum of ASCII values is the same as the hash value, and that's easy.
Agreed. But the question is asking "why can't it just reverse the process to calculate the password from the hash", not "how can I match the hash even if don't know the password".
As explained in other answers, cryptographic hash functions are difficult to reverse because they are designed in such a way that reversing them is computationally very expensive - not because there are multiple possible answers. In your example, although it is impossible to be certain exactly what the original password was, it is trivial to narrow it down to a relatively small set of passwords, which is a massive security flaw in addition to the one explained by Neil G.
There is one aspect of the problem that people are missing in the previous answers. That is the many-to-one nature of hash functions. Since (most) hash functions are fixed length output (e.g., 256 bit), technically there are infinitely many strings that all hash to the same value.
For example, if you take all all 512 bit strings (of which there are 2^512). There are only 2^256 outputs of the hash function. Thus, for each output of the hash function, there are roughly 2^256 512 bit strings which hash to that value. I say roughly because we don't know if the hash function actually is a random function, it could have slight biases.
Thus, given a digest, there are many strings which hash to the same value. Therefore, if you define "reversing a hash function" as outputting the users password, how is your reversing function going to deal with the potentially infinite number of strings that result in the given digest?
funny thing: a few hours ago (probably before you read the answers) we had the problem of responses only focussing on that aspect of the hashing function completely ignoring the other (more important) points. Anyway, I think that the current answers don't focus on that because the user is talking about passwords which _usually_ have much less possible combinations than the output of most cryptographic hashing functions.
A reversing function can't know which preimage is the original password used by the user, though it will often be pretty clear based on common password practices. But it doesn't have to, since any of the preimages will work as the password.
@nealmcb, true except for in a few certain circumstances. For example, if a salt is used. Only the preimage with the proper salt will work (another reason to use salts). But yes, it will, with overwhelming probability, be the case that the correct preimage will be distinguishable. If, however, there are 2^256 preimages, that would be an impossible amount of data to search through.
@mikeazo A salt doesn't help to counter a preimage attack. If your database has been compromised, then the hacker has both the hashes and the salts, so his workload is identical to if he was operating on a hash without a salt. Instead of computing `preimage(hash)` he computes `preimage(hash||salt)`. What a salt does help counter is dictionary attacks (the hacker will have to launch a separate dictionary attack on each password, rather than one for the entire database), and rainbow tables (the rainbow table won't have included the salt in the computation).
You're asking "why is it important that hash functions be one-way?" It's a security property.
There are two kinds of "hash" (or "message digest" as they're called) in common use today. One is a plain message digest, which you may be familiar with as a checksum algorithm, such as CRC32. The algorithm is designed so that a single bit change in the input will yield a different digest value. The primary purpose of this is to ensure that a message is not been corrupted by accident. CRC32 checksums are present on every TCP/IP packet, and a mis-match results in retransmission to correct the error.
Message digests are often used in cryptography as part of "signing" a message. The message is encrypted by the sender with his private key, and anyone can use the public key to validate that it was encrypted by only the sender. But RSA public key cryptography can only encrypt messages smaller than the key size (256 bytes), which are much shorter than most useful messages. Message digest algorithms produce values smaller than RSA keys. So by encrypting the digest instead of the message, RSA signatures can be used on any size message.
But an ordinary message digest is not secure against an attacker. Consider a very simple checksum that just sums the values of the characters. If you signed such a checksum, I could swap out any other message that yields the same checksum, and the signatures would match, fooling the victim.
Another common use for message digests is password protection during storage. If you encrypt the passwords before storing them in the system, a system administrator who knows the key could decrypt them all. (You may have noticed this problem recently when some web sites were hacked.)
To avoid these problems, a different kind of hash is needed, one that is "cryptographically secure." A secure hash algorithm has two additional properties, collision resistance, and non-reversibility.
Collision resistance means that I should not be able to find a message that produces the same digest. That way I can't swap my evil message for your good message.
Non-reversibility property means that I can't turn a digest back into a plaintext so I can't decrypt the original message, like the user's password.
Creating a digest is a very similar problem to encryption, in that you have to scramble the data in such a way that it leaks no information about the original data. It's even harder, because the same math has to not give any clues about how to successfully create a collision.
Others have explained why good cryptographic hash functions are difficult to reverse - but according to this Wikipedia article, LanMan is poorly designed and can be reversed relatively easily:
Although it is based on DES, a well-studied block cipher, the LM hash is not a true one-way function as the password can be determined from the hash because of several weaknesses in its implementation... By mounting a brute force attack on each half separately, modern desktop machines can crack alphanumeric LM hashes in a few hours... In 2003, Ophcrack, an implementation of the rainbow table technique, was published. It specifically targets the weaknesses of LM encryption, and includes pre-computed data sufficient to crack virtually all alphanumeric LM hashes in a few seconds.
This doesnt really address the actual question. And besides, it's not true that it can be reversed - bruteforce is not the reverse (or inverse) of a hash function.
It answers part of the question - Mucker specifically asked about LanMan, in which it _is_ quite easy to find a matching password given a hash. The point is that this particular algorithm has weaknesses (splitting the password into two parts, and converting lower case letters to upper case) that make it very easy to brute force. Can you explain the distinction you are making between inverting the hash function and brute forcing it - I would call the latter a special case of the former?
Because the OP is asking about the internals of hash functions, he's asking why the function cannot simply be reversed *mathematically* speaking. Brute force is orthogonal to hash inversion, it doesn't care what the actual function *is*. It basically goes around the hash, not back up it.
I really don't understand the distinction you are trying to make. The whole point of a brute force algorithm is to invert the hash. It has exactly the same inputs and outputs as any other (correct) method of inverting the function. It isn't even _necessarily_ the slowest method. If you are pointing out that - if the hash function is multivalued - it cannot be inverted in the strict mathematical sense (because it isn't an injection) - then I agree but that isn't really relevant: a hash function can be injective, in fact it is desirable for collisions to be rare.
@James - no, a brute force does not reverse anything. It tries the entire address space with an algorithm and provides the entire output space. Where there is a match, you can make some assumptions.
I think we are misunderstanding each other. I am using the word 'invert' in the mathematical sense - i.e., 'find the input of a function given its output' (and I am using 'reverse' as a synonym). A brute force method is just one way of doing this - it doesn't matter that we generate lots of other outputs of the function in the process - most algorithms produce useless junk along the way. The OP asked why the password can't be obtained given the hash and the algorithm - and the answer is that it can be - it is just computationally difficult, though in LanMan's case, not difficult enough.
I agree with your underlying point that in the LanMan case, a combination of clever math and brute force produces a reverse function that is more than fast enough for the real world. But even if there were no analysis of the function involved to speed up a brute force approach, from a mathematical standpoint I'd still call a dumb brute-force function a "reverse" function. And certainly a reverse-engineered function. Just not pretty engineering....
I think there are many reasons, but one is obvious: a digest produced by a hash function can never contain infinite information, since the digest has finite bits. But the hash function can be used to hash inputs of infinite information. The input can actually be anything.
The hardship to find out a collision is not the answer. The real hardship is proving your original data is actually the only possible input that matches a certain digest. I think you may never calculate one input and claim it is the only answer to the digest.