Best practices for character sets

By | 2014-09-24

CharsampleYou may not understand every language, but that doesn’t mean your applications can’t. Regardless of your customer’s language choice, your application should be able to process, transfer, and store their data. Even if you don’t provide a localized user interface, your application should allow your customer to enter text in their own language and in their own script. For example, my word processor is localized into English, but it allows me to enter text in a variety of scripts and languages.

How is that possible? The most basic requirement for this ability is to use a single character set internally. If you want to handle all scripts, your only choice is the Unicode character set.

Rule #1:
Use Unicode as your character set.

Unicode has several possible encodings, including UTF-32, UTF-16, UTF-16LE, UTF-16BE, and UTF-8. These encodings transform Unicode code points into code units. Code points are the values between 0 and 0x10FFFF, the range of integers that are allocated for character definitions. An encoding transforms code points into code units, which are used to serialize a character for storage or transmission. A single code point becomes 1 or more code units during encoding.

Although all the encodings are well-defined, UTF-8 is the easiest to use primarily because of its code unit size. UTF-8 has 8-bit (byte) code units that are immune to the common memory design issues involving byte ordering. I recommend that you use UTF-8 everywhere possible in your system. You’ll avoid mistakes in determining little endian and big endian layouts with the other encodings.

Rule #2:
Use UTF-8 everywhere possible

OK, so we’ve got the rules for character set and encoding choice. Now you have to implement those rules.

Complex systems typically have many points of failure for textual data loss. Those points are usually hand-off points across systems:
1. File export and import.
2. Outbound and inbound HTTP request data and parameters
3. Database connections and schemas

Each of these deserves its own discussion. Moreover, each has specific implementation details for different products. Unfortunately I can’t cover any of them adequately in this particular post. However, I’ll try to touch on these subjects in a future update. If you have questions about any of them, let me know. I’ll use your suggestions to help me decide which to address first.

For now you have my own best practices for character set choice when creating any system. Good luck!

//John O.

Leave a Reply

Your email address will not be published. Required fields are marked *