I’m doing some research on Unicode and compression algorithms right now for a side-project I’m working on, and I came across a highly ranked Google search result for a UTF-8 munging code snippet that is so idiotic I couldn’t let it pass without comment. If this post helps even one person who would’ve otherwise followed the linked advice, it is worth it.
First, some background. UTF-8 is a character encoding format that can pretty much handle any character under the Sun, from the English alphabet to Japanese kanji to obscure extinct languages. It even includes thousands of esoteric symbols used in smaller fields of study that you’ve probably never even heard of before. But the nice thing about UTF-8 is that it is variable-length. Standard ASCII characters (including everything on a standard English keyboard) only take one byte to represent. All of the common characters from other widely used languages typically take just two bytes to encode. It’s only the really obscure characters that require more than two bytes.
So now you see why the linked “solution” is so stupid. This guy says he is “designing a little client/server binary message format” and wants “a simple and quick way to encode strings”. Well, duh — use UTF-8, no ifs, ands, or buts about it. It’s simple, quick, and already implemented in any programming language you can think of, so it requires no additional coding. There are all sorts of dumb ways to unnecessarily reinvent the wheel in sotware engineering, but trying to come up with your own character encoding is particularly idiotic. It’s really tricky to get right because there are so many corner cases you’ll never even know existed until they cause your application to break. The Unicode Consortium exists for a reason — what they do is hard.
This guy even confesses that his expected input will probably not contain Unicode characters that are longer than 2 bytes. So there is no justification at all for what he does next — he creates a mangled version of UTF-8 that turns all Unicode characters 3 bytes and longer into question marks, instead of just leaving them as is. So instead of allowing a rare character to take an additional byte or two, it gets mangled. And to accomplish this, he has to create his own custom encoding solution that is an intentionally broken version of UTF-8. That’s the worst part — he’s wasting time creating completely unnecessary code, that will need to be maintained, that will need to be debugged — and for what?
Of course, none of the people responding to his thread point out that what he is trying to do is stupid. They just smile and hand him some rope.