How to detect the encoding of a file?

  • On my filesystem (Windows 7) I have some text files (These are SQL script files, if that matters).

    When opened with Notepad++, in the "Encoding" menu some of them are reported to have an encoding of "UCS-2 Little Endian" and some of "UTF-8 without BOM".

    What is the difference here? They all seem to be perfectly valid scripts. How could I tell what encodings the file have without Notepad++?

    There is a pretty simple way using Firefox. Open your file using Firefox, then View > Character Encoding. Detailed here.

    use heuristics. checkout `enca` and `chardet` for POSIX systems.

    I think an alternative answer is TRIAL and ERROR. `iconv` in particular is useful for this purpose. Essentially you iterate the corrupted characters strings/text through different encodings to see which one works. You win, when the characters are no longer corrupted. I'd love to answer here, with a programmatic example. But it's unfortunately a protected question.

    FF is using Mozilla Charset Detectors. Another simple way is opening the file with MS word, it'll guess the files quite correctly even for various ancient Chinese and Japanese codepages

    If `chardet` or `chardetect` is not available on your system, then you can install the package via your package manager (e.g. `apt search chardet` — on ubuntu/debian the package is usually called `python-chardet` or `python3-chardet`) or via *pip* with `pip install chardet` (or `pip install cchardet` for the faster c-optimized version).

  • Files generally indicate their encoding with a file header. There are many examples here. However, even reading the header you can never be sure what encoding a file is really using.

    For example, a file with the first three bytes 0xEF,0xBB,0xBF is probably a UTF-8 encoded file. However, it might be an ISO-8859-1 file which happens to start with the characters . Or it might be a different file type entirely.

    Notepad++ does its best to guess what encoding a file is using, and most of the time it gets it right. Sometimes it does get it wrong though - that's why that 'Encoding' menu is there, so you can override its best guess.

    For the two encodings you mention:

    • The "UCS-2 Little Endian" files are UTF-16 files (based on what I understand from the info here) so probably start with 0xFF,0xFE as the first 2 bytes. From what I can tell, Notepad++ describes them as "UCS-2" since it doesn't support certain facets of UTF-16.
    • The "UTF-8 without BOM" files don't have any header bytes. That's what the "without BOM" bit means.

    Why would a file that starts with a BOM be auto-detected as "UTF-8 without BOM"?

    And if a file started with 0xFF,0xFE it should be auto-detected as UTF-16, not UCS-2. UCS-2 is probably guessed because it contains mainly ASCII characters and thus every other byte is null.

    @MichaelBorgwardt You are definitely right on the the UTF-2. The UCS-2/UTF-16 is a bit less clear. Will update my answer.

    Gah, meant to say "UTF-8" not "UTF-2" in my previous comment.

    With experience, alas, metadata (“headers”) can also be wrong. The database holding the information could be corrupted, or the original uploader could have got this wrong. (This has been a significant problem for us in the past few months; some data was uploaded as “UTF-8” except it was “really ISO8859-1, since they're the same really?!” Bah! Scientists should be kept away from origination of metadata; they just get it wrong…)

    Actually I think it's "funny" that the encoding problem still stays in 2014 since no file in the world will begin with "" and I'm very surprised when I see a HTML page which has been loaded with the wrong encoding.. It's a matter of probability. It's unthinkable to choose the wrong encoding if another encoding would avoid strange chars.. Looking for the encoding which avoids strange chars would work in 99,9999% of cases I guess. But still there are errors.. Also it's a very confusing message to use ascii instead of UTF8 to save space.. it's confusing junior developers this idea of perform..

    Floppy Disks became obsolete. Encoding are still all there.. :o

    "no file in the world" sounds to me like "no-one would ever do that".

    1k like for "Notepad++ does its best to guess what encoding a file is using".

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM