Should UTF-16 be considered harmful?

  • I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?"

    Why do I ask this question?

    How many programmers are aware of the fact that UTF-16 is actually a variable length encoding? By this I mean that there are code points that, represented as surrogate pairs, take more than one element.

    I know; lots of applications, frameworks and APIs use UTF-16, such as Java's String, C#'s String, Win32 APIs, Qt GUI libraries, the ICU Unicode library, etc. However, with all of that, there are lots of basic bugs in the processing of characters out of BMP (characters that should be encoded using two UTF-16 elements).

    For example, try to edit one of these characters:

    • 𝄞 (U+1D11E) MUSICAL SYMBOL G CLEF
    • 𝕥 (U+1D565) MATHEMATICAL DOUBLE-STRUCK SMALL T
    • 𝟶 (U+1D7F6) MATHEMATICAL MONOSPACE DIGIT ZERO
    • 𠂊 (U+2008A) Han Character

    You may miss some, depending on what fonts you have installed. These characters are all outside of the BMP (Basic Multilingual Plane). If you cannot see these characters, you can also try looking at them in the Unicode Character reference.

    For example, try to create file names in Windows that include these characters; try to delete these characters with a "backspace" to see how they behave in different applications that use UTF-16. I did some tests and the results are quite bad:

    • Opera has problem with editing them (delete required 2 presses on backspace)
    • Notepad can't deal with them correctly (delete required 2 presses on backspace)
    • File names editing in Window dialogs in broken (delete required 2 presses on backspace)
    • All QT3 applications can't deal with them - show two empty squares instead of one symbol.
    • Python encodes such characters incorrectly when used directly u'X'!=unicode('X','utf-16') on some platforms when X in character outside of BMP.
    • Python 2.5 unicodedata fails to get properties on such characters when python compiled with UTF-16 Unicode strings.
    • StackOverflow seems to remove these characters from the text if edited directly in as Unicode characters (these characters are shown using HTML Unicode escapes).
    • WinForms TextBox may generate invalid string when limited with MaxLength.

    It seems that such bugs are extremely easy to find in many applications that use UTF-16.

    So... Do you think that UTF-16 should be considered harmful?

    I tried copying the characters to a filename and tried to delete them and had no problems. Some Unicode characters read right to left and keyboard input handling sometimes changes to accommodate that (depending on the program used). Can you post the numeric codes for the specific characters you are having trouble with?

    Have you tried to work with them in Notepad and see how this work? For example edit file name with this character and put coursor at the right of this character and press backspace. You'll see that in both. Notepad of file name editing dialog it requires two times to press "backspace" to remove this character.

    The double backspace behavior is mostly intentional http://blogs.msdn.com/michkap/archive/2005/12/21/506248.aspx

    Not really correct. I explain, if you write "שָׁ" the compound character that consists of "ש",‎ "ָ" and "ׁ", vovels, then removal of each one of them is logical, you remove one code-point when you press "backspace" and remove all character including vovels when press "del". But, you never produce **illegal** state of text -- illegal code points. Thus, the situation when you press backspace and get illegat text is incorrect.

    Are you referring to how sin and shin are composed of two code points, and by deleting the code-point for the dot you get an "illegal" character?

    No, you get "vowelless" writing. It is totally legal. More then that, in most of cases vowels like these (shin/sin) are almost ever written unless they are required for clearification of something that is not obvious from context like שׁם and שׂם these are two different words, but according to context you know which one of is vowelless שם means.

    CiscoIPPhone: If a bug is "reported several different times, by many different people", and then a couple years later a developer writes on a dev blog that "Believe it or not, the behavior is mostly intentional!", then (to put it mildly) I tend to think it's probably not the best design decision ever made. :-) Just because it's intentional doesn't mean it's not a bug.

    For the record, I don't have problems with any of these characters in Apple's TextEdit.app (which uses Cocoa and thus UTF-16), but trying to insert them in Emacs (which uses a variant of UTF-8 internally) produces garbage. I do think that such bugs are not the fault of the character encoding, but of the lack of competence of the programmers involved.

    BTW, I've just checked editing these letters, they don't give me a problems neither in Opera, nor in Windows 7. Opera seems to edit them properly, so does Notepad. File with these letters in the name has been created successfully.

    @Malcolm, 1st there is no problem creating such files - the question editing them. Now I've tested on XP maybe in 7 MS fixed this issue. Take a look how backspace works, do you need to hit it once or twice.

    Once. I specially checked for this issue, and in Windows 7 the problem with the characters beyond BMP seems to be gone. Maybe this problem had been solved even in Vista.

    @Malcolm - even thou it does not make UTF-16 less harmful :-)

    Well, I don't think that mere existence of crappy implementations indicates harmfulness of the standard at all. :p This is just an update on the current situation: how problematic characters beyond BMP in Windows (and Opera) are now.

    Great post. UTF-16 is indeed the "worst of both worlds": UTF8 is variable-length, covers all of Unicode, requires a transformation algorithm to and from raw codepoints, restricts to ASCII, and it has no endianness issues. UTF32 is fixed-length, requires no transformation, but takes up more space and has endianness issues. So far so good, you can use UTF32 internally and UTF8 for serialization. But UTF16 has no benefits: It's endian-dependent, it's variable length, it takes lots of space, it's not ASCII-compatible. The effort needed to deal with UTF16 properly could be spent better on UTF8.

    UTF-8 has the same caveats as UTF-16. Buggy UTF-16 handling code exists; although probably less than buggy UTF-8 handling code (most code handling UTF-8 thinks it's handling ASCII, Windows-1252, or 8859-1)

    @Ian: UTF-8 *DOES NOT* have the same caveats as UTF-8. You cannot have surrogates in UTF-8. UTF-8 does not masquerade as something it’s not, but most programmers using UTF-16 are using it wrong. I know. I've watched them again and again and again and again.

    @tchrist UTF-16 can sometimes require more than 16-bits to represent a single code-point, UTF-8 can sometimes require more than 8-bits to represent a single code-point. UTF-16 can sometimes use multiple code points to represent a single character, UTF-8 can sometimes use multiple code points to represent a single character. `U+0061 U+0301 U+0317` forms one character: `á̗`. When converted to UTF-8 the byte sequence (without the BOM) is `61 CC 81 CC 97`. When converted to UTF-16 the byte sequence (without the BOM) is `61 00 01 03 17 03`. Same caveats.

    @Ian You are welcome to spout off the theory all you want: it’s wasted on me. I teach this stuff myself. I can promise you that the UTF-16 problems are everywhere. These people can’t even get code points right. No one using UTF-8 ever screws that up. It’s these damned two-bit use-to-be-UCS2 P.O.S. UTF-16 interfaces that screw people up. That is the real world. That is the calibre of the average UTF-16 programmer out there. What are you some Microsoft apologist or something? It’s a screwed-up choice that has caused endless misery in this world: you can’t make a silk purse from a sow’s ear.

    i don't see how someone fresh to a subject can be stymied simply because it is named "UTF-16". Yet if you change the name to "UTF-8" it becomes obvious and intuitive.

    @Ian: You have listed *common* caveats, that doesn't mean that the two have the *same* caveats. UTF-16 has more: It has endianness issues, and it does not contain ASCII as a subset. Those two alone make a huge difference.

    @Kerrek S: In terms of writing code to handle caveats, endian order is not an issue for programmers. Take me, for example, as a programmer who is dealing with UTF-8 and UTF-16: multiple character diacritis, BMPs and surrogate pairs are (still) difficult to handle. Endian order is trivial. UTF-16 not containing an ASCII subset? What is UTF-16 missing? ASCII has `ACK` (0x06), UTF-8 has `ACK` (0x06), UTF-16 has ACK (0x0006).

    Also, UTF-8 doesn't have the problem because everyone treats it as a variable width encoding. The reason UTF-16 has the problem is because everyone treats it like a fixed width encoding.

    @Christoffer Hammarström You can't blame one non-fixed width encoding for being non-fixed width, while embracing another non-fixed width encoding because it's non-fixed width.

    UTF-32 good enough for you?

    Tell me about it, I've been shouting this at my stupid Windows programming colleages for years. The only safe encodings are UTF32 and UTF8 (As long as people don't treat is as a fixed length encoding).

    Can you elaborate on the assertion that "Python encodes such characters incorrectly"? How would you even write this into a file? AFAIK, Python cannot read files whose encoding is not a superset of ASCII (at byte level).

    Another example: JavaScript's `charCodeAt` selects UTF-16 words, not Unicode characters. This arguably isn't a bug, but applications that assume charCodeAt works on Unicode characters will be broken.

    i think this link provides some useful context to your question, though not related to an answer

    Please enjoy the grand summary of the popular POV at: http://www.utf8everywhere.org/

    I think @Ringding is right, the Python example seems flawed. In Python 2, `unicode('', 'utf-16')` interprets the bytes of `''` as a UTF-8 string and then decodes that to UTF-16; that obviously goes wrong.

    I'd rather see UTF-8 as well; but I have to say, I've seen just as many people who have said "my char strings are now UTF-8" and not dealt with the problems therein at all.

    @larsmans: That's because using regular quotes like that tells the interpreter that there's bytes inside. If you use `u''`, it should work correctly. Python2 has the "helpful" feature of automatically encoding/decoding between utf8/ascii in some situations. In python3 your example works because the quotes denote a type that contains unicode codepoints, not bytes.

    @tchrist: "UTF-8 DOES NOT have the same caveats as UTF-8." But surely UTF-8 has EXACTLY the same caveats that UTF-8 does? (Sorry, I couldn't help myself...)

    I am not sure what caveats on UTF-8, but at least, those caveats (if exists) should be a lot more visible than UTF-16, because the non-ASCII result will look broken immediately.

  • This is an old answer.
    See UTF-8 Everywhere for the latest updates.

    Opinion: Yes, UTF-16 should be considered harmful. The very reason it exists is because some time ago there used to be a misguided belief that widechar is going to be what UCS-4 now is.

    Despite the "anglo-centrism" of UTF-8, it should be considered the only useful encoding for text. One can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed. But when they do, text is not only for human readers.

    On the other hand, UTF-8 overhead is a small price to pay while it has significant advantages. Advantages such as compatibility with unaware code that just passes strings with char*. This is a great thing. There're few useful characters which are SHORTER in UTF-16 than they are in UTF-8.

    I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite. After long research and discussions, the development conventions at my company ban using UTF-16 anywhere except OS API calls, and this despite importance of performance in our applications and the fact that we use Windows. Conversion functions were developed to convert always-assumed-UTF8 std::strings to native UTF-16, which Windows itself does not support properly.

    To people who say "use what needed where it is needed", I say: there's a huge advantage to using the same encoding everywhere, and I see no sufficient reason to do otherwise. In particular, I think adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++0x. What must be demanded from STL implementations though is that every std::string or char* parameter would be considered unicode-compatible.

    I am also against the "use what you want" approach. I see no reason for such liberty. There's enough confusion on the subject of text, resulting in all this broken software. Having above said, I am convinced that programmers must finally reach consensus on UTF-8 as one proper way. (I come from a non-ascii-speaking country and grew up on Windows, so I'd be last expected to attack UTF-16 based on religious grounds).

    I'd like to share more information on how I do text on Windows, and what I recommend to everyone else for compile-time checked unicode correctness, ease of use and better multi-platformness of the code. The suggestion substantially differs from what is usually recommended as the proper way of using Unicode on windows. Yet, in depth research of these recommendations resulted in the same conclusion. So here goes:

    • Do not use wchar_t or std::wstring in any place other than adjacent point to APIs accepting UTF-16.
    • Don't use _T("") or L"" UTF-16 literals (These should IMO be taken out of the standard, as a part of UTF-16 deprecation).
    • Don't use types, functions or their derivatives that are sensitive to the _UNICODE constant, such as LPTSTR or CreateWindow().
    • Yet, _UNICODE always defined, to avoid passing char* strings to WinAPI getting silently compiled
    • std::strings and char* anywhere in program are considered UTF-8 (if not said otherwise)
    • All my strings are std::string, though you can pass char* or string literal to convert(const std::string &).
    • only use Win32 functions that accept widechars (LPWSTR). Never those which accept LPTSTR or LPSTR. Pass parameters this way:

      ::SetWindowTextW(Utils::convert(someStdString or "string litteral").c_str())
      

      (The policy uses conversion functions below.)

    • With MFC strings:

      CString someoneElse; // something that arrived from MFC. Converted as soon as possible, before passing any further away from the API call:
      
      std::string s = str(boost::format("Hello %s\n") % Convert(someoneElse));
      AfxMessageBox(MfcUtils::Convert(s), _T("Error"), MB_OK);
      
    • Working with files, filenames and fstream on Windows:

      • Never pass std::string or const char* filename arguments to fstream family. MSVC STL does not support UTF-8 arguments, but has a non-standard extension which should be used as follows:
      • Convert std::string arguments to std::wstring with Utils::Convert:

        std::ifstream ifs(Utils::Convert("hello"),
                          std::ios_base::in |
                          std::ios_base::binary);
        

        We'll have to manually remove the convert, when MSVC's attitude to fstream changes.

      • This code is not multi-platform and may have to be changed manually in the future
      • See fstream unicode research/discussion case 4215 for more info.
      • Never produce text output files with non-UTF8 content
      • Avoid using fopen() for RAII/OOD reasons. If necessary, use _wfopen() and WinAPI conventions above.

    // For interface to win32 API functions
    std::string convert(const std::wstring& str, unsigned int codePage /*= CP_UTF8*/)
    {
        // Ask me for implementation..
        ...
    }
    
    std::wstring convert(const std::string& str, unsigned int codePage /*= CP_UTF8*/)
    {
        // Ask me for implementation..
        ...
    }
    
    // Interface to MFC
    std::string convert(const CString &mfcString)
    {
    #ifdef UNICODE
        return Utils::convert(std::wstring(mfcString.GetString()));
    #else
        return mfcString.GetString();   // This branch is deprecated.
    #endif
    }
    
    CString convert(const std::string &s)
    {
    #ifdef UNICODE
        return CString(Utils::convert(s).c_str());
    #else
        Exceptions::Assert(false, "Unicode policy violation. See W569"); // This branch is deprecated as it does not support unicode
        return s.c_str();   
    #endif
    }
    

    I would like to add a little comment. Most of Win32 "ASCII" functions receive locale strings in local encodings. For example std::ifstream can accept Hebrew file name if locale encoding is Hebrew one like 1255. Anything needed to support these encodings for windows is make MS add UTF-8 code page to the system. This would make the life much simpler. So all "ASCII" functions would be fully Unicode capable.

    FWIW the AfxMessageBox(MfcUtils::Convert(s), _T("Error"), MB_OK) example should probably really have been a call to a wrapper of that function that accepts std::string(s). Also, the Assert(false) in the functions toward the end should be replaced with static assertions.

    I can't agree. The advantages of utf16 over utf8 for many Asian languages completely dominate the points you make. It is naive to hope that the Japanese, Thai, Chinese, etc. are going to give up this encoding. The problematic clashes between charsets are when the charsets mostly seem similar, except with differences. I suggest standardising on: fixed 7bit: iso-irv-170; 8bit variable: utf8; 16bit variable: utf16; 32bit fixed: ucs4.

    @Charles: thanks for your input. True, some BMP characters are longer in UTF-8 than in UTF-16. But, let's face it: the problem is not in bytes that BMP Chinese characters take, but the software design complexity that arises. If a Chinese programmer has to design for variable-length characters anyway, it seems like UTF-8 is still a small price to pay compared to other variables in the system. He might use UTF-16 as a compression algorithm if space is so important, but even then it will be no match for LZ, and after LZ or other generic compression both take about the same size and entropy.

    What I basically say is that simplification offered by having One encoding that is also compatible with existing char* programs, and is also the most popular today for everything is unimaginable. It is almost like in good old "plaintext" days. Want to open a file with a name? No need to care what kind of unicode you are doing, etc etc. I suggest we, developers, confine UTF-16 to very special cases of severe optimization where a tiny bit of performance is worth man-months of work.

    Well, if I had to choose between UTF-8 and UTF-16, I would definitely stick to UTF-8 as it has no BOM, ASCII-compliant and has the same encoding scheme for any plane. But I have to admit that UTF-16 is simpler and more efficient for most BMP characters. There's nothing worng with UTF-16 except the psychological aspects (mostly fixed-size isn't fixed size). Sure, one encoding would be better, but since both UTF-8 and UTF-16 are widely used, they have their advantages.

    @Malcolm: UTF-8, unfortunately, has a BOM too (0xEFBBBF). As silly as it looks (no byte order problem with single-byte encoding), this is true, and it is there for a different reason: to manifest this is a UTF stream. I have to disagree with you about BMP efficiency and UTF-16 popularity. It seems that majority of UTF-16 software do not support it properly (ex. all win32 API - which I am a fan of) and this is inherent, the easiest way to fix these seems to switch them to other encoding. The efficiency argument is only true for very narrow set of uses (I use hebrew, and even there it is not).

    Well, what I meant is that you don't have to worry about byte order. UTF-8 can have a BOM indeed (it is actually UTF-16 big endian BOM encoded in 3 bytes), though it's neither required, nor recommended according to the standard. As for the APIs I think the problem is that they were designed when surrogate pairs were either non-existent yet, or not really adopted. And when something gets patched up, it's always not as good as redesigning from the scratch. The only (painful) way is to drop any backwards compability and redesign the APIs. Should they switch to UTF-8 in the process, I don't know.

    @Malcolm, I think the natural way of this redesign is thru changing existing ANSI APIs. This way existing broken programs will unbreak (see my answer). This adds to the argument: UTF-16 must die.

    I'm sorry, I didn't really get the idea why transition to UTF-8 should be less painful. I also think that inconsistency in C++ makes it worse. Say, Java is very specific on the characters: char[] is no more than a char array, String is a string and Character is a character. Meanwhile, C++ is a mess with all the new stuff added to an existing language. To my mind, they should've abandoned any backwards compablity and design C++ in the way that doesn't allow to mix up structural programming and OOP or Unicode and other encodings. Not that I want to start a holy war, that's merely my opinion.

    UTF-8's disadvantage is NOT a small price to pay at all. Looking for any character is a O(n) operation, and other more complex operations can be far far worse than with UTF-16. Also UTF-8 is variable-length, just as UTF-16, so what's the point? UTF-8 was designed for storage and interoperability with ASCII. UTF-16 is the preferred way to store strings in memory, as anything outside the BMP is incredibly rare (you're wiring in Klingon?). With a little trick, storing characters outside of the BMP in a hash or map, UTF-16 can have constant processing time.

    @iconiK: non-english BMP is also quite rare. Consider all program sources and markup languages. One should have very good reasons to use UTF-16. See what is going on in Linux world wrt unicode to measure the price of breaking changes.

    Linux has had a specific requirement when choosing to use UTF-8 internally: compatibility with Unix. Windows didn't need that, and thus when the developers implemented Unicode, they added UCS-2 versions of almost all functions handling text and made the multibyte ones simply convert to UCS-2 and call the other ones. THey later replaces UCS-2 with UTF-16. Linux on the other hand kept to 8-bit encodings and thus used UTF-8, as it's the proper choice in that case.

    you may wish to read my answer again. Windows does not support UTF-16 properly to date. Also, the reason for choosing UCS-2 was different. Again, see my answer. For linux, I believe the main reason was compatibility not with unix but with existing code - for instance, if your ANSI app copies files, getting names from command arguments and calling system APIs, it will remain completely intact with UTF-8. Isn't that wonderful?

    @Pavel: The bug you linked to (Michael Kaplan's blog entry) has long been resolved by now. Michael said in the post already that it's fixed in Vista and I can't reproduce it on Windows 7 as well. While this doesn't fix legacy systems running on XP, saying that »there is still no proper support« is plain wrong.

    @Johannes: [1] many thanks for the info. [2] IMO a programmer, today, should be able to write programs that support windows XP. It is still a popular one, and I don't know of a windows update that fixes it.

    Well, the program works just fine; it just has a little trouble dealing with astral planes, but that's an OS issue, not one with your program. It's like asking that current versions of Uniscribe are backported to old OSes that people on XP can enjoy a few scripts that would render improperly before. It's not something MS does. Besides, XP is almost a decade old by now and supporting it becomes a major burden in some cases (see for example the reasoning why Paint.NET will require Vista with its 4.0 release). Mainstream support for that OS has already ended, too; only security bugs are fixed now

    Still not convincing to use UTF-16 for in-memory presentation of strings on windows :) I wish Windows7 guys would extend their support of already existing #define of CP_UTF8 instead..

    @Pavel Radzivilovsky: I fail to see how your code, using UTF-8 everywhere, will protect you from bugs in the Windows API? I mean: You're copying/converting strings for all calls to the WinAPI that use them, and still, if there is a bug in the GUI, or the filesystem, or whatever system handled by the OS, the bug remains. Now, perhaps your code has a specific UTF-8 handling functions (search for substrings, etc.), but then, you could have written them to handle UTF-16 instead, and avoid all this bloated code (unless you're writing cross-platform code... There, UTF-8 could be a sensible choice)

    @Pavel Radzivilovsky: BTW, your writings about *"I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite."* and *"In particular, I think adding wchar_t to C++ was a mistake, and so are the unicode additions to C++Ox."* are either quite naive or very very arrogant. And this is coming from someone coding at home with a Linux and who is happy with the UTF-8 chars. To put it bluntly: **It won't happen**.

    @paercebal: If majority of the code is API calls, this is a very simple code. Typically, majority of code dealing with strings is libraries that treat them as cookies, and they are optimized for. Hence, the bloating argument fails. As for the 'favorite utf16' for ICU and python, this is very questionable: these tools use UTF-16 internally, and changing it as a part of the evolution is the easiest. Can happen on any major release, coz it doesn't break the interfaces.

    In ICU we already see more and more UTF-8 interfaces and optimizations. However, UTF-16 works perfectly well, and makes complicated lookup efficient, more than with UTF-8. We will not see ICU drop UTF-16 internally. UTF-16 in memory, UTF-8 on the wire and on disk. All is good.

    @Steven, It looks like differentiating between wire and RAM is not a small thing as it may seem. BTW, comparison is cheaper with UTF8. I agree that ICU is certainly a major player on this market, and there's no need to "drop" support of anything. The simplification of application design and testing with UTF8 is exactly what will, in my humble opinion, drive UTf-16 to extinction, and the sooner the better.

    @Pavel Radzivilovsky I meant, drop UTF-16 as the internal processing format. Can you expand on 'not a small thing'? And, anyways, UTF-16/UTF-8/UTF-32 have a 1:1:1 mapping. I'm much more interested in seeing non-Unicode encodings die. As far as UTF-8 goes for simplification, you say "they can just pass strings as char*"- right, and then they assume that the char* is some ASCII-based 8-bit encoding. Plenty of errors creep in when toupper(), etc, is used on UTF-8. It's not wonderful, but it is helpful.

    @Steve First and foremost I agree about non-unicode. There's no argument about that. Practically, it already happened, they are already dead, in this exact sense: any non-unicode operation on a string is considered a bug just like any other software bug, or a 'text crime' in my company's slang. It is true that char* is misleading many into unicode bugs as well. Good luck with toupper() a UTF-8 string, or, say, with assuming that ICU toupper does not change the number of characters (as in german eszet converting to SS). After the standard has been established, there's no more reason to bug.

    @Steve, 2; and then we come to a more subtle thing, which is everything around human engineering and safety and designing proper way of work for a developer to do less and for the machine to do more. This is exactly where UTF-16 doesn't fit. Most applications do not reverse or even sort strings. Most often strings are treated as cookies, such as a file name here and there, concatenated here and there, embedded programming languages such as SQL and other really simple transformations. In this world, there's very little reason to have different format in RAM than on the wire.

    @Pavel: "...that widechar is going to be what UCS-4 now is." this is incorrect in general since widechar is not fixed to be 2 bytes in size, unless you restrict yourself to Windows. You should write "UCS-2" instead of "widechar".

    @ybungalobill Right; I should edit this. In fact, I will do this when wchar_t is standardized to hold one Unicode character.

    @Pavel: In fact your sentence is just wrong, because wchar_t is not meant to be UTF-16, it has absolutely no connection to UTF-16, and it is already UCS-4 on some compilers. wchar_t is a C++ type that is actually (quote from the standard) "a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales". So the only problem here is your system (windows) that doesn't have a UCS-4 locale.

    I'm mostly impressed with how this long rant completely fails in arguing its point, at least outside the narrow world of having to deal with UTF-16 in C and pointers. That might be considered dangerous, but that is if anything C's fault, not UTF-16.

    Well, as I mentioned earlier, I didn't find this post very convincing either. This post goes into details of handling UTF-16 in certain APIs or languages. If the software doesn't handle the standard properly, that's a problem. But what's wrong with the encoding itself anyway? If some software implements only half of the standard, that's not the standard's problem.

    There are so many things that are wrong in the bullets that can't even be captures in a comment. But probably the most dangerous one is to store UTF-8 in std::string in a Windows environment. Problem is, everything in the Windows world assumes that char* are strings in the current system code page. Use one wrong API on that string, and you are assured of many hours of debugging. The other problem is the religious recommendation for UTF-8 no matter what. "there's a huge advantage to using the same encoding everywhere, and I see no sufficient reason to do otherwise" is pushed, no advantage given

    Wow, that's a really insightful comment. I'll start converting all my apps to UTF-8 right now. Thanks!

    @Mihai there *is* this advantage. you’ll start noticing it when you don’t do it and get cryptic runtime encoding exceptions nobody can possibly understand nor track back to it’s source. python 3 has made the jump, and guess what: the frequent encoding issues i had in python 2 magically disappeared completely.

    @flying sheep: And the reason you had problems in Python2 and not in Python3 is that Py3 is much stricter with distinguishing between Unicode and bytes. Apart from how you encode literal strings in your source file (and that can be changed to any other encoding if wanted without problems), you can't - or more exactly you SHOULDN'T - detect what encoding python is using internally. When communicating with the rest of the world you have to specify the encoding anyhow to avoid problems (otherwise you get a platform specific encoding).

    Yeah, I know. I love it this way: Input gets converted to Strings while reading it and the error happens there (if any), and not anywhere else in the code. Makes us guess less where the fuck it sneaked in. (Just like Scala’s `Option` keeps Scala programmers from encountering `NullpointerExceptions`: A implicit, error-prone process is replaced by a deliberate choice I make)

    Sorry but our programming languages should hide the encoding from us an implementation detail. We need datatypes that logically represent unicode characters without worrying about how they were stored on the harddrive.

    I'll agree to use a variable-length encoding for text when I am given an O(1) way of accessing a random character from a string. UCS2 and UCS4 have this nice ability, and are therefore surely better suited to internal uses than UTF-{8,16}?

    @holdenweb: Yyou cannot access a random character even in UCS-4, remember that character != codepoint. Moreover, there is absolutely *no* application where accessing nth character make sense. All the access to text is always sequential.

    @Pavel: Why doesn't the answer contain an implementation for convert()?

    Guys, thank you all for the feedback, both positive and negative! The discussion has inspired me and friends to publish a clear manifesto on the subject. Please enjoy, share and support: http://www.utf8everywhere.org/

    _T("") is not part ofthe standard, it is MS/Windows stuff.

    @PavelRadzivilovsky There is a memory leak in ::SetWindowTextW(Utils::convert(someStdString or "string litteral").c_str()) ..., isn't it?

    There is no leak - the memory is freed properly by the destructor of std::wstring

    Python has switched away from UCS-2 (which was only used for windows builds): http://python.org/dev/peps/pep-0393/

    The very start of this answer **is wrong** The whole point of UTF16 is to encode 21-bit code points into a 16 bit wordspace

    I strongly disagree with the "Don't use `TCHAR`" part, because TCHAR is actually the **key** for switching from "ANSI" and UTF-16 to UTF-8 with less pain. Create `MessageBoxU8` as a different function from both `MessageBoxA` and `MessageBoxW` and do the same for all other string functions, and that way you can develop new programs that support UTF-8 without breaking the old ones that don't.

    Dear Medinoc, please consider that one of the major points of the utf8everywhere.org manifesto is to say that you should never be switching between ANSI and any unicode encoding or support ANSI in the first place. There should be no non-unicode-aware programs written, compiled or tested.

    The footnote on Python should mention the https://docs.python.org/3/library/io.html#io.StringIO API, as that is the preferred way to manipulate large amounts of text that nevertheless still fit into memory. The fact that it exposes a file like seek/tell API without providing support for O(1) code point indexing lets it default to avoiding the memory cost of providing the latter, and also provide a better foundation for the kinds of cursor based algorithms needed to manipulate graphemes and characters rather than code points.

    @ncoghlan, I would appreciate if you formulate the text to insert. Also, if you are into Python - it's high time to write to Python authors to change their internal string implementation to the better, UTF-8 way.

    I'm the author of http://developerblog.redhat.com/2014/09/09/transition-to-multilingual-programming-python/, and "UTF-8 everywhere" is a proposal that assumes every piece of software in the world is a POSIX program designed to manipulate streaming data. It's a wonderful design choice in that domain, but far more questionable elsewhere. The Python ecosystem, by contrast, also encompasses Windows, JVM and CLR based programming, and array oriented programming in addition to stream processing, with a suite of battle tested text manipulation algorithms optimised to run on fixed width encodings.

    As far as the text to insert goes: If you are manipulating text data in Python, and don't need O(1) code point indexing, then you should be using https://docs.python.org/3/library/io.html#io.StringIO as your data storage API, not `str`. https://docs.python.org/3/library/io.html#io.BytesIO is an alternative option for storing UTF-8 data directly in its encoded form.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM