How to determine what type of encoding/encryption has been used?
Is there a way to find what type of encryption/encoding is being used? For example, I am testing a web application which stores the password in the database in an encrypted format (
WeJcFMQ/8+8QJ/w0hHh+0g==). How do I determine what hashing or encryption is being used?
Some content of (or links pointing to) a methodology is in order to explain how to identify certain types of crypto or encoding in a completely zero-knowledge scenario. Most of these answers are "it's impossible" and my gut feeling tells me that nothing in our industry is impossible.
@atdre Thanks for the bounty. The question seems focussed on password hashing formats - is that your focus also? That seems best to me, and if people want to answer the question for file formats, they can ask another question.
@atdre: "Impossible" is usually a shortcut for "infeasible with current technology/won't finish before the heat death of the universe".
I asked a similar question on SE: http://stackoverflow.com/questions/988642/how-would-i-reverse-engineer-a-cryptographic-algorithm
Your example string (
WeJcFMQ/8+8QJ/w0hHh+0g==) is Base64 encoding for a sequence of 16 bytes, which do not look like meaningful ASCII or UTF-8. If this is a value stored for password verification (i.e. not really an "encrypted" password, rather a "hashed" password) then this is probably the result of a hash function computed over the password; the one classical hash function with a 128-bit output is MD5. But it could be about anything.
The "normal" way to know that is to look at the application code. Application code is incarnated in a tangible, fat way (executable files on a server, source code somewhere...) which is not, and cannot be, as much protected as a secret key can. So reverse engineering is the "way to go".
Barring reverse engineering, you can make a few experiments to try to make educated guesses:
- If the same user "changes" his password but reuses the same, does the stored value changes ? If yes, then part of the value is probably a randomized "salt" or IV (assuming symmetric encryption).
- Assuming that the value is deterministic from the password for a given user, if two users choose the same password, does it result in the same stored value ? If no, then the user name is probably part of the computation. You may want to try to compute MD5("username:password") or other similar variants, to see if you get a match.
- Is the password length limited ? Namely, if you set a 40-character password and cannot successfully authenticate by typing only the first 39 characters, then this means that all characters are important, and this implies that this really is password hashing, not encryption (the stored value is used to verify a password, but the password cannot be recovered from the stored value alone).
Thanks for the inputs.. Pls tell me more about how you confirmed its a Base64 encoding for a sequence of 16 bytes. **Regarding your experiments,** Yes, this is a value stored for password verification. 1) if a user changes password, then the stored value changes too.. 2) if two users choose same password, the stored value is the same 3) password length is not limited.
@Learner: _any_ sequence of 24 characters, such that the first 22 are letters, digits, '+' or '/', and the last two are '=' signs, is a valid Base64 encoding of a 128-bit value. And any 128-bit value, when encoded with Base64, yields such a sequence.
This is the right answer, even though I wanted to hear some potential tricks if the application code isn't available. If you are dealing with a closed-source binary application -- check out http://aluigi.org/mytoolz.htm#signsrch
If it is MD5 you could try any of the MD5 cracking websites like http://www.md5decrypter.co.uk/. None of these are fast or guaranteed to give a result. Google for "md5 cracking" to find more. That will also give you an extendeed list at http://www.stottmeister.com/blog/2009/04/14/how-to-crack-md5-passwords/
Edit: I just noticed a very cool script named hashID. The name pretty much describes it.
Generally speaking, using experience to make educated guesses is how these things are done.
Here is a list with a very big number of hash outputs so that you know how each one looks and create signatures/patters or just optically verify.
There are two main things you first pay attention to:
- the length of the hash (each hash function has a specific output length)
- the alphabet used (are all english letters? numbers 0-9 and A-F so hex? what special characters are there if any?)
Several password cracking programs (John the ripper for example) apply some pattern matching on the input to guess the algorithm used, but this only works on generic hashes. For example, if you take any hash output and rotate each letter by 1, most pattern matching schemes will fail.
thank you.. i took a look and tried with few passwords. but the method works given the scenario we know the password is hashed and not encrypted, right?
Passwords are usually not encrypted, they are hashed, and usually represented inthe form the function outputs them - so you can find many of those forms in the links above. In the end, the output of all functions is just binary digits which are usually represented and stored either as the hex representation, or as the base64 representation.
What you have posted is 16 bytes (128 bits) of base 64 encoded data. The fact that it is base 64 encoded doesn't tell us much because base 64 is not an encryption/hashing algorithm it is a way to encode binary data into text. This means that this block includes one useful piece of information, namely that the output is 16 bytes long. We can compare this to the block size of commonly used schemes and figure out what it can't be. By far the most common schemes are:
The next thing we need to do is to look at other blocks of cipher text to figure out the answer to the following question:
- Are all cipher texts the same length, even for different input lengths?
If not all blocks are the same length then you aren't looking at a hashing algorithm, but an encryption one. Since the output will always be a multiple of the underlying block size the presence of a block that is not evenly divisible by 16 bytes would mean that it cant be AES and therefore must be DES or 3DES.
If you have the ability to put in a password and observe the output this can be determined very quickly. Just put in a 17 character password and look at the length. If its 16 bytes you have MD5, 20 bytes means SHA-1, 24 bytes means DES or 3DES, 32 bytes means AES.
:-p i don't get it.. is it a base 64 encoded data?? how? and i definitely didn't get the part where you say "if any blocks are not evenly divisible by 16 bytes it is probably DES or 3DES, otherwise AES most likely" pls shed me some light on this. :)
@The Learner I've added to the answer to hopefully make it clearer. If you can use chosen plaintext you can probably work it out from this.
thank you so much.. i'm starting to get the second part. and after asking this question here, i read somewhere in google that hashing is many-to-one, meaning that many different texts could be hashed to a same output (pls correct me if i am wrong). coming to our scenario, all the cypher text blocks are not of same length.. so now can i be sure that they are not hashed but encrypted? But still, i didn't get how you said that the data i posted is 16 bytes of base 64 encoded data. If there is a cypher or hash, how would i tell that it is a 16 or 32 byte data?
@the learner the = as padding is characteristic of base64. This tool http://home2.paulschou.net/tools/xlate/ (from a google search on base64 decoder hex) will convert from base 64 to an array of hex values. just paste your text into the base 64 box and click decode and count the bytes. you can also estimate by taking the number of characters in the base64 string and multiplying by .75
@Karthik, there's very few ways to represent a number (which is what the end result of hashing or encrypting will be) as compact, human-readable text. The dominant ways are hexadecimal (A-F, 0-9) and base 64 (A-Z, a-z, 0-9, +, /). There's also technically the possibility of trying to view the data as a text encoding (ASCII, UTF-8, etc), although that is typically just a sign of not knowing how to view the data (eg, trying to open a binary file in notepad). You'd mostly recognize it by simply looking at what kinds of characters appear and take a guess from there.
Then base 64 also has the telltale padding (the `=`s). That's really the biggest indicator of being base64. But then you have to remember that many places don't store *just* a hash, but actually have a text representation that includes separators and other fields (eg, for salt, an ID of which hashing algorithm was used, etc). That frankly allows for a lot more information to be gleaned, but doesn't apply here. Important to keep in mind since those other characters aren't a part of how the data is encoded.
It depends upon the format - some protocols for storing encrypted text have a cleartext portion that defines how it's encrypted. From your example, I'm doubtful since the string you reference is so short that it looks like it's just the encrypted text.
I'd suggest a couple thoughts:
The "==" on the end would definitely be padding, so don't include that in any decryption attempts.
You may be dealing with a hash or a salted hash, rather than encryption. In that case, trying to "decrypt" the data won't work - you need to match passwords by using the same hash and/or salt value that was used originally. There is no way with a salted password to get the original value.
Your absolute best bet is to get a copy of the code that is used to store the passwords. Somewhere in there, the passwords are undergoing a cryptographic operation. Find the code to learn what's happening here. 9 times out of 10, they are using some sort of API for the hashing/salting/encryption and you can imitate or reverse it using the same API.
The `==` on the end is a tell-tale sign of the value being base64 encoded. The padding is necessary; you don't just *drop* it. First you base64-decode the data, THEN play with it.
Encoding can generally be guessed at. For example, the string you posted in your question is Base64 encoded. The equals signs are padding in the Base64 scheme. That's something I know on-sight from experience.
If you gave me a string that was encrypted, I may be able to tell you the encoding but I can't tell you the algorithm used to encrypt it unless some sort of metadata is available. The reason is this: encryption algorithms work by producing what appears to be random data. If I encrypted two sentences each with two ciphers (four outputs), you would be unable to tell me with any confidence which ciphertext belonged to which cipher unless you decrypted it or broke the cipher.
In regards to your specific instance, passwords are usually hashed. That means you can't recover the password from the hash, but you can test to see if the hash matches for the password. In that regard, @john's answer is golden. If you can input a password that you know and then try common schemes against it, you can learn what the hash used is.
If this is indeed a simple password hash, we might be able to use Google to crack it. Base64 is hard to search for, though, with all those slashes and plus signs, so let's first convert that hash into hexadecimal:
$ perl -MMIME::Base64 -le 'print unpack "H*", decode_base64 "WeJcFMQ/8+8QJ/w0hHh+0g=="' 59e25c14c43ff3ef1027fc3484787ed2
Unfortunately (or perhaps fortunately, depending on your perspective), we're not lucky enough to actually find a preimage (the site currently lists this hash as "cracking..."), but the fact that it's on that list at all does strongly suggest that it's indeed an unsalted MD5 hash of a real password.
The only way is to guess. With experience, guess works will be more correct.
For example: Based on length of output: MD5 output is 128 bits, or 16 bytes, SHA1 output is 160 bits, or 20 bytes. Based on charset of output: BASE64 produces output with printable characters.
At the end of the day, it's the try-and-error approach that teaches you how.
The only way is when there's some metadata that tells you. For instance, I've been working with PDFs lately, and the format includes a dictionary containing the filter, algorithm, key size etc. But if all you've got is the ciphertext, then all you've got is some opaque blob of data.
This is very weak security on all fronts! The plaintext is P4$$w0rdP4$$w0rd and it's encrypted using XOR encryption, with the key CdZ4MLMPgYtAE9gQ80gMtg==. This produces the ciphertext posted by the OP above, WeJcFMQ/8+8QJ/w0hHh+0g==.
First, use xxd to get the underlying binary of the plaintext:
echo -n 'P4$$w0rdP4$$w0rd' | xxd -b -c16
01010000 00110100 00100100 00100100 01110111 00110000 01110010 01100100 01010000 00110100 00100100 00100100 01110111 00110000 01110010 01100100
Next, base64-decode the key and use xxd to get the underlying binary of the key:
echo -n 'CdZ4MLMPgYtAE9gQ80gMtg==' | base64 -d | xxd -b -c16
00001001 11010110 01111000 00110000 10110011 00001111 10000001 10001011 01000000 00010011 11011000 00010000 11110011 01001000 00001100 10110110
Now, XOR the two binary strings:
01010000 00110100 00100100 00100100 01110111 00110000 01110010 01100100 01010000 00110100 00100100 00100100 01110111 00110000 01110010 01100100 (plaintext) [XOR] 00001001 11010110 01111000 00110000 10110011 00001111 10000001 10001011 01000000 00010011 11011000 00010000 11110011 01001000 00001100 10110110 (key) ----------------------------------------------------------------------------------------------------------------------------------------------- 01011001 11100010 01011100 00010100 11000100 00111111 11110011 11101111 00010000 00100111 11111100 00110100 10000100 01111000 01111110 11010010 (ciphertext)
Finally, use bc, xxd, and base64 to convert the binary ciphertext to base64:
echo "obase=16; ibase=2; 01011001111000100101110000010100110001000011111111110011111011110001000000100111111111000011010010000100011110000111111011010010" | bc | xxd -p -r | base64
This produces WeJcFMQ/8+8QJ/w0hHh+0g==, which is the ciphertext posted by the OP in the question above.
I apologize if this answer seems contrived. Admittedly, it is. Questions similar to this, where the poster provides only some ciphertext, and asks for some insight as to how that ciphertext could have been produced, seem come up quite frequently on security.stackexchange.com; and this question is often referenced as a duplicate to those. The point of this answer is to illustrate that questions of this nature are unanswerable, because there are infinite solutions to these types of questions.
Wait, so did you just pick a password, pick a "hash" method (XOR), and then brute force for a key that produced the given ciphertext? Your last sentence there is gold but I think might be worth emphasizing a bit more. Someone who doesn't pay close attention could easily walk away thinking that you have "solved" the problem.
@Conor Mancone, Thanks for the feedback. XOR encryption can be 'reverse engineered', so that if you know 2 of the 3 variables (i.e. plaintext, key, ciphertext), you can easily find the third. [plaintext xor key = ciphertext, ciphertext xor key = plaintext, plaintext xor ciphertext = key]. I just made up a plaintext (P4$$w0rdP4$$w0rd), then used the third equation above to find the key that would produce the ciphertext that the OP posted, given the plaintext that I chose. I could have chosen any plaintext.