How can I test the encoding of a text file... Is it valid, and what is it?
I have several
.htmfiles which open in Gedit without any warning/error, but when I open these same files in
Jedit, it warns me of invalid UTF-8 encoding...
The HTML meta tag states "charset=ISO-8859-1". Jedit allows a List of fallback encodings and a List of encoding auto-detectors (currently "BOM XML-PI"), so my immediate problem has been resolved. But this got me thinking about: What if the meta data wasn't there?
When the encoding information is just not available, is there a CLI program which can make a "best-guess" of which encodings may apply?
And, although it is a slightly different issue; is there a CLI program which tests the validity of a known encoding?
Similar to "How to auto detect text file encoding?" http://superuser.com/questions/301552/how-to-auto-detect-text-file-encoding
filecommand makes "best-guesses" about the encoding. Use the
-iparameter to force
fileto print information about the encoding.
$ file -i * umlaut-iso88591.txt: text/plain; charset=iso-8859-1 umlaut-utf16.txt: text/plain; charset=utf-16le umlaut-utf8.txt: text/plain; charset=utf-8
Here is how I created the files:
$ echo ä > umlaut-utf8.txt
Nowadays everything is utf-8. But convince yourself:
$ hexdump -C umlaut-utf8.txt 00000000 c3 a4 0a |...| 00000003
Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding
Convert to the other encodings:
$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt $ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt
Check the hex dump:
$ hexdump -C umlaut-iso88591.txt 00000000 e4 0a |..| 00000002 $ hexdump -C umlaut-utf16.txt 00000000 ff fe e4 00 0a 00 |......| 00000006
Create something "invalid" by mixing all three:
$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt
$ file -i * umlaut-iso88591.txt: text/plain; charset=iso-8859-1 umlaut-mixed.txt: application/octet-stream; charset=binary umlaut-utf16.txt: text/plain; charset=utf-16le umlaut-utf8.txt: text/plain; charset=utf-8
$ file * umlaut-iso88591.txt: ISO-8859 text umlaut-mixed.txt: data umlaut-utf16.txt: Little-endian UTF-16 Unicode text, with no line terminators umlaut-utf8.txt: UTF-8 Unicode text
filecommand has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.
One might argue that the heuristics of
fileis some sort of artificial intelligence. Yet, even if it is, it is a very limited one.
Here is more information about the
Thanks, that worked... I had tried 'file`, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1. `file -i` reported `unknown-8bit`. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"