How can I test the encoding of a text file... Is it valid, and what is it?

  • I have several .htm files which open in Gedit without any warning/error, but when I open these same files in Jedit, it warns me of invalid UTF-8 encoding...

    The HTML meta tag states "charset=ISO-8859-1". Jedit allows a List of fallback encodings and a List of encoding auto-detectors (currently "BOM XML-PI"), so my immediate problem has been resolved. But this got me thinking about: What if the meta data wasn't there?

    When the encoding information is just not available, is there a CLI program which can make a "best-guess" of which encodings may apply?

    And, although it is a slightly different issue; is there a CLI program which tests the validity of a known encoding?

  • lesmana

    lesmana Correct answer

    10 years ago

    The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.

    Demonstration:

    $ file -i *
    umlaut-iso88591.txt: text/plain; charset=iso-8859-1
    umlaut-utf16.txt:    text/plain; charset=utf-16le
    umlaut-utf8.txt:     text/plain; charset=utf-8
    

    Here is how I created the files:

    $ echo ä > umlaut-utf8.txt 
    

    Nowadays everything is utf-8. But convince yourself:

    $ hexdump -C umlaut-utf8.txt 
    00000000  c3 a4 0a                                          |...|
    00000003
    

    Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding

    Convert to the other encodings:

    $ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt 
    $ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt 
    

    Check the hex dump:

    $ hexdump -C umlaut-iso88591.txt 
    00000000  e4 0a                                             |..|
    00000002
    $ hexdump -C umlaut-utf16.txt 
    00000000  ff fe e4 00 0a 00                                 |......|
    00000006
    

    Create something "invalid" by mixing all three:

    $ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt 
    

    What file says:

    $ file -i *
    umlaut-iso88591.txt: text/plain; charset=iso-8859-1
    umlaut-mixed.txt:    application/octet-stream; charset=binary
    umlaut-utf16.txt:    text/plain; charset=utf-16le
    umlaut-utf8.txt:     text/plain; charset=utf-8
    

    without -i:

    $ file *
    umlaut-iso88591.txt: ISO-8859 text
    umlaut-mixed.txt:    data
    umlaut-utf16.txt:    Little-endian UTF-16 Unicode text, with no line terminators
    umlaut-utf8.txt:     UTF-8 Unicode text
    

    The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.

    One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.

    Here is more information about the file command: http://www.linfo.org/file_command.html

    Thanks, that worked... I had tried 'file`, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1. `file -i` reported `unknown-8bit`. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"

    For those who get here and are on mac, it's `file -I` with a capital 'i' instead of lowercase.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM