Best practices for character sets

CharsampleYou may not understand every language, but that doesn’t mean your applications can’t. Regardless of your customer’s language choice, your application should be able to process, transfer, and store their data. Even if you don’t provide a localized user interface, your application should allow your customer to enter text in their own language and in their own script. For example, my word processor is localized into English, but it allows me to enter text in a variety of scripts and languages.

How is that possible? The most basic requirement for this ability is to use a single character set internally. If you want to handle all scripts, your only choice is the Unicode character set.

Rule #1:
Use Unicode as your character set.

Unicode has several possible encodings, including UTF-32, UTF-16, UTF-16LE, UTF-16BE, and UTF-8. These encodings transform Unicode code points into code units. Code points are the values between 0 and 0x10FFFF, the range of integers that are allocated for character definitions. An encoding transforms code points into code units, which are used to serialize a character for storage or transmission. A single code point becomes 1 or more code units during encoding.

Although all the encodings are well-defined, UTF-8 is the easiest to use primarily because of its code unit size. UTF-8 has 8-bit (byte) code units that are immune to the common memory design issues involving byte ordering. I recommend that you use UTF-8 everywhere possible in your system. You’ll avoid mistakes in determining little endian and big endian layouts with the other encodings.

Rule #2:
Use UTF-8 everywhere possible

OK, so we’ve got the rules for character set and encoding choice. Now you have to implement those rules.

Complex systems typically have many points of failure for textual data loss. Those points are usually hand-off points across systems:
1. File export and import.
2. Outbound and inbound HTTP request data and parameters
3. Database connections and schemas

Each of these deserves its own discussion. Moreover, each has specific implementation details for different products. Unfortunately I can’t cover any of them adequately in this particular post. However, I’ll try to touch on these subjects in a future update. If you have questions about any of them, let me know. I’ll use your suggestions to help me decide which to address first.

For now you have my own best practices for character set choice when creating any system. Good luck!

//John O.

Unicode Characters and Alternative Glyphs

Smiley face

Unicode defines thousands of characters. Some “characters” are surprising, and others are obvious. When I look at the Unicode standard and consider the lengthy debates that occur when deciding upon whether a character should be included, I can imagine the discussion and rationalization that occurs. Deciding on including a character can be difficult.

One of the more difficult concepts for me to appreciate is the difference between light and dark (or black and white) characters. A real example will help me explain this. Consider the “smiley face” characters U+263A and U+263B:  ☺ and ☻. These characters are named WHITE SMILING FACE and BLACK SMILING FACE respectively.

These are not the only characters that have white and black options. Dozens of others exist. There are even white and black options for BLACK TELEPHONE and WHITE TELEPHONE.

Of course, once these characters go into the standard, they should stay. One shouldn’t remove existing characters. However, a serious question does arise when considering WHITE and BLACK options for a character.

The question I have is this: Why? Why isn’t the white and black color variation simply a font variation for the same character. The Unicode standard clearly states that it avoids encoding glyph variations of the same character. That makes a lot of sense. However, in practice, the standard at least appears to do exactly the opposite for many characters. I can only guess that someone on the standards committee made a very good, logical and well-supported argument for the character differentiation.

My hope for future versions of the standard is that these kind of color variations will be avoided. Not being on the committee when these characters were added, I cannot really complain. And I hope that my comments here don’t come across that way. However, in the future, I’d like the standard to include annotations for these characters that describe why they deserve separate code points. It certainly isn’t clear from the existing character’s notes, and I’m sure that others would be curious about the reasons as well.

Unicode 6.1 Released


The Unicode Consortium announced the release of Unicode 6.1.0 yesterday. The new version adds characters for additional languages in China, other Asian countries and Africa. This version of the standard introduces 732 new characters.

In addition, the standard also added “labels” for character properties that will supposedly help implementers create better regular expressions that are both easier to read and easier to validate. I admit little knowledge about these labels at the moment, but will research and report on them in the future if time allows.

One of the oddities of the new version is the inclusion of 200 emoji variants. This is perhaps the only issue of the standard that I just don’t understand. Back in the day when I was more involved in Unicode development, we had a huge effort to unify variants of Chinese characters. We preached that Unicode characters were abstract entities with glyph renderings that were determined by font, style preferences of developers and apps. Now it appears that the Unicode consortium has changed its position on this.  Or maybe partially?. The addition of 200 emoji “variants” just seems unnecessary, but that’s just my opinion and I admit that I may not know all the issues that formed the consortium’s decision.

We have some examples, straight from the announcement, that show only 4 of the 200 new emoji variants:

Emoji tents

As the image shows, the “TENT” emoji has two variants — a text style and a more colorful, graphical emoji style. The standard defends these variants by saying that it allows implementations to distinguish preferred display styles. I think that is what fonts are for. Personally, I just don’t think variants are needed. And, I think that the variants make things more difficult for applications.

What do you think about variants in general? And what about emoji variants specifically?


Terminology: Unicode Character Encoding


In a recent blog, I described the terms character set, charset, and coded character set. In this blog, we’ll take a small step forward to define a few more terms:

  • encoding form
  • code unit
  • encoding scheme

Before going to much further, you can get all the information in this blog from a much more authoritative source, the Unicode Technical Report 17 (UTR 17). UTR 17 describes the Unicode Character Encoding Model and more formally define all the terms you’ll find in this blog. The added value, if any, of this blog is that I’ll attempt to describe these terms in just a few paragraphs instead of several pages. Still, when you’re feeling a bit adventurous and energetic, you might take on that UTR 17.

Character Encoding Form

An encoding form is a mapping from a code point to a sequence of code units. A code unit has a specific width, for example, 8-bits, 16-bits, or 32-bits. Any Unicode code point value can be mapped to any of these forms. One other note about encoding forms — there are two varieties: fixed width and variable width.

A fixed width encoding form encodes every code point in a fixed number of code units. That is, every code point can be encoded into the same number of code units. UTF-32 is a fixed width encoding form.

Variable width encoding forms encode code points in 1, 2, or more code units. Some variable width encoding forms include UTF-8, and UTF-16. In UTF-8, a character may require from 1 to 4 code units of 8-bits. In UTF-16, characters require 1 or 2 code units of 16-bits.

Code Unit

I’ve already hinted at this definition. It’s worth repeating though. A code unit is a sequence of fixed-size integers that take up a specific number of bits. For example, a code unit can be 8, 16, 32, or even 64-bits on some computer architectures. Code points are mapped to sequences of code units. A single character (code point) can be be mapped to several different code unit representations depending on the encoding form.

Character Encoding Scheme

An encoding scheme is a serialization technique that encodes code units into a byte stream. Since UTF-8 is already an 8-bit (byte) oriented encoding form, UTF-8 is also an encoding scheme.

Because of little-endian and big-endian hardware differences, the UTF-32 and UTF-16 encoding forms can be serialized into two different schemes each. The specific scheme flavors for UTF-32 are UTF-32BE and UTF-32LE, big-endian and little-endian respectively. UTF-16 has similar schemes: UTF-16BE and UTF-16LE.

Did that clear anything up? Or just confuse more. Let me know and I’ll try to clarify.

“Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.”

Not forgotten

No, I haven’t forgotten my promise to cover a few more Unicode terms. However, please excuse me while I recover from my recent vacation. In this case, my vacation has rendered me useless for a couple days after my return. Hundreds of emails have gathered in my email INBOX, and I’m still processing them.

I will be back in a couple more days to describe the following:

  • encoding scheme
  • code unit
  • encoding form

In the process of describing the above, we’ll look at another term UTF-32. We’ll eventually get to UTF-16 and UTF-8 after that.

Be patient, I’m almost done with my INBOX.


Unicode Terminology


I am sometimes asked whether Unicode is a 16-bit character set. The answer is not a simple no, but it is no. The question always reminds me how important terminology is too. Terminology is the focus of this particular post.

At one point long ago, when Unicode was a relative newcomer to the character set stage, it did in fact start out with a character space that had potential character values in the range from 0x0000 through 0xFFFF. In that case, and at that time until around 1995, Unicode could have been called a 16-bit character set. That is, each character could be represented with a single 16-bit integer value.

However, starting in 1996, Unicode’s character range expanded. With Unicode 2.0, the character set defined character values in the range 0x0000 through 0x10FFFF. That’s 21-bits of code space. Unicode can no longer be called a 16-bit character set. With today’s computer architectures, you really have to say that Unicode is a 32-bit character set. But now we have to be careful how we use these terms.

The rest of this discussion is admittedly tiresome and precise. We have to define some terms to make sure we’re talking about the same things. Bear with me.

The fact is that Unicode is much more than just a character set. It embodies a set of best practices and standards for defining characters, encoding them, naming them, and processing them. In the Unicode Consortium’s own words, Unicode is:

the standard for digital representation of the characters used in writing all of the world’s languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language.

The Unicode standard also defines the Unicode character set. This is a coded character set (CCS). A coded character set assigns an integer value to each of its characters. Each character’s numeric integer value is also called a code point. The current Unicode standard allows for code point values all the way up to 0x10FFFF. Often when we refer to Unicode code point values, we use another notation. Instead of writing the code point value as a hexadecimal number with the ‘0x’ prefix, we use ‘U+”. So, in this alternate notation, to make sure others know that we’re explicitly talking about Unicode code point values, we write U+10FFFF. However, I’m not picky about this. It is, though, a noteworthy distinction. Strictly speaking, 0x10FFFF is just a very large hexadecimal number. U+10FFFF is a specific Unicode code point value.

So, we’ve established that Unicode is not a 16-bit character set, although it is a character set. Specifically, it is a coded character set. Remember how I’ve defined a CCS above. Sometimes you’ll hear other terms that are equivalent to a coded character set. The terms character set and charset are often used as synonyms, though strictly speaking neither imply that an assignment of code point values.

An encoding is something else, and it refers to how we serialize a code point for storage or transfer. Those clever people within the Unicode Technical Committee have devised several ways to encode the Unicode (coded) character set, giving us 3 common encodings:

  • UTF-32
  • UTF-16
  • UTF-8

Terms We’ve Learned

Here are the terms we’ve used so far:

  • character set
  • coded character set/charset
  • Character encoding

Next Up

Next time, let’s talk about these encodings: UTF-32, UTF-16, and UTF-8


Using Combining Sequences for Numbers

Circled number fifty

Today I just happened to be looking through some of the precomposed Unicode circled numbers, numbers like ①, ②, ③, and so on. Just in case your system, doesn’t support the fonts for these characters, here’s an image that shows what I mean:

Precomp circled numbers

I wasn’t all that surprised to see these CIRCLED DIGIT ZERO, CIRCLED DIGIT ONE, CIRCLED DIGIT TWO, through CIRCLED DIGIT NINE characters. However, I was surprised to see precomposed characters for other numbers, numbers all the way up to 50:

Circled number fifty CIRCLED NUMBER FIFTY

Why stop at 50? Well, obviously Unicode can’t encode every number. Although Unicode doesn’t define a CIRCLED NUMBER FIFTY ONE, how can I create this using combining sequences? For example, for the above single digits, I have a couple options for displaying these:

  1. a precomposed character like U+2460 — ①
  2. a combining sequence like U+0031 U+20DD —   1⃝  the digit 1 followed by the COMBINING ENCLOSING CIRCLE character     ⃝

Again, if you can’t see that character sequence, here’s the image of U+0031 U+20DD:

Combining one


Alright, so there we have a great example of using two Unicode code points together to form a single visual glyph on-screen. But how do I get the COMBINING ENCLOSING CIRCLE character to combine over two previous digits? What if there were not a precomposed CIRCLED NUMBER FIFTY ONE? There’s isn’t one, by the way. And yet I want to enclose two or more arbitrary digits with the COMBINING ENCLOSING CIRCLE character. Hmmm….

Sigh…. I have to admit that I don’t actually know how to do this. I suspect that I can use some of Unicode’s control characters like START OF GUARDED AREA and END OF GUARDED AREA or …. I don’t really know.

When I find out, I’ll repost. If you know, please share!

Encoding Unicode Characters When UTF-8 Is Not An Option

The other day I suggested that you use UTF-8 to encode your Java source code files. I still think that’s a best practice. If you can do that, you owe it to yourself to follow that advice.

But what if you can’t store text as UTF-8? Perhaps your repository won’t allow it. Or maybe you simply can’t standardize on UTF-8 across the groups. What then? In that case, you should use ASCII to encode your text files. It’s not an optimal solution. However, I can help you get more interesting Unicode characters into your Java source files despite the actual file encoding limitation.

The trick is to use the native2ascii tool to convert your non-ASCII unicode characters to a \uXXXX encoding. After editing and creating a file contain UTF-8 text like this:

String interestingText = "家族";

You would instead run the native2ascii tool on the file to produce an ASCII file that encodes the non-ASCII characters in  \u-encoded notation like this:

String interestingText="\u5BB6\u65CF";

In your compiled code, the result is the same. Given the correct font, the characters will display properly. U+5BB6 and U+65CF are the code points for “家族”. Using this type of \u-encoding, we’ve solved the problem of getting the non-ASCII characters into your text file and repository. Simply save the converted, \u-encoded file instead of the original, non-ASCII file.

The native2ascii tool is part of your Java Development Kit (JDK). You will use it like this:

native2ascii -encoding UTF-8 <inputfile> <outputfile>

There you have it…an option for getting Unicode characters into your Java files without actually using UTF-8 encoding.


Best practice: Use UTF-8 as your source code encoding


Software engineering teams have become more distributed in the last few years. It’s not uncommon to have programmers in multiple countries, maybe a team in Belarus and others in Japan and in the U.S. Each of these teams most likely speaks different languages, and most likely their host systems use different character encodings by default. That means that everyone’s source code editor creates files in different encodings too. You can imagine how mixed up and munged a shared source code repository might become when teams save, edit and re-save source files in multiple charset encodings. It happens, and it has happened to me when working with remote teams.

Here’s an example, you create a test file containing ASCII text. Overnight, your Japanese colleagues edit and save the file with a new test and add the following line in it:

String example = "Fight 文字化け!";

They save the file and submit it to the repository using Shift-JIS or some other common legacy encoding. You pick up the file the next day, add a couple lines to it, save it, and BAM! Data loss. Your editor creates garbage characters because it attempts to save the file in the ISO-8859-1 encoding. Instead of the correct Japanese text from above, your file now contains the text “Fight ?????” Not cool, not fun. And you’ve most likely broken the test as well.

How can you avoid these charset mismatches? The answer is to use a common charset across all your teams. The answer is to use Unicode, and more specifically, to use UTF-8. The reason is simple. I won’t try hard to defend this. It’s just seems obvious. Unicode is a superset of all other commonly used character sets. UTF-8, a specific encoding of Unicode, is backward compatible with ASCII character encoding, and all programming language source keywords and syntax (that I know) is composed of ASCII text. UTF-8 as a common charset encoding will allow all of your teams to share files, use characters that make sense for their tests or other source files, and never lose data again because of charset encoding mismatches.

If your editor has a setting for file encodings, use it and choose UTF-8. Train your team to use it too. Set up your ANT scripts and other build tools to use the UTF-8 encoding for compilations. You might have to explicitly tell your java compiler that source files are in UTF-8, but this is worth making the change. By the way, the javac command line argument you need is simply “-encoding UTF-8”.

Recently I was using NetBeans 7 in a project and discovered that it defaults to UTF-8. Nice! I was pleasantly surprised.

Regardless of your editor, look into this. Find out how to set your file encodings to UTF-8. You’ll definitely benefit from this in a distributed team environment in which people use different encodings. Standardize on this and make it part of your team’s best practices.

“Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.”

Encoding URLs for non-ASCII query params

Are you a web service API developer? The web truly is a world-wide web. Unfortunately, a great number of globally unaware developers are on the global web. This creates an odd situation in which web services are globally accessible but only locally or regionally aware.

There are a few important things to remember when creating a global web service. Let’s just cover ONE today: non-ASCII query parameters are valid, useful, and often necessary for a decent, global web service.

It seems so obvious to me, and it probably does to you. Sometimes a service needs to exchange or process non-ASCII data. The world is a big place, and although English is an important part of the global web, more people speak a different language. English is a big percent, but lots of people use Chinese or an Indic language too. Let’s make sure your web service can process all those non-ASCII characters in English or any other language!

Let’s look at some examples of non-ASCII query params:


In these examples, you must perform two steps to get the query params (both keys and values) into the correct form:

  1. Convert the keys and their values to UTF-8 if they are not already.
  2. Perform the “percent encoding” on each UTF-8 code unit

To do #1, you’ll need to use whatever character conversion utility you have in your developer’s library: the Java charset encoding converters, whatever.

The #2 step is the important one for this blog. For each hexadecimal code unit in the UTF-8 query portion, you must “percent encode” the code unit. Let’s look at the first example query params:


The JavaScript function encodeURI actually does a good job of doing this for us:

encodeURI("name=田中&city=東京") produces the string:


Notice that you should also include this encoding for the keys in the param list. In the next example, I’ve used Japanese values for both keys and values.

encodeURI(“名前=田中&市=東京”) produces this string:


Note that both the keys and vaues have been “percent encoded”.

On the server side, your server will understand how to decode these values into their correct UTF-8 string values if you have configured it correctly. Correct configuration of a server usually involves a charset conversion filter for a servlet container and sometimes just a config setting for Apache.

More on this at a later time.