A Little Java Character History

By | 2016-04-29

Logo60s2

The Java language has supported Unicode from its beginning. In those early days, the Unicode character set defined characters with integer values in the range 0x0000 through 0xFFFF. That’s 65,536 possible character values in the full Unicode set. Java’s char type was defined to represent a single character in that range.

However, Unicode changed. It grew bigger. It can now define character values all the way up to 0x10FFFF. The range grew by 16x. As a result of that growth, Java’s char type simply cannot represent every possible Unicode character anymore. A char still has its original range definition, so it can only have an unsigned integer value up to 0xFFFF.

Fortunately, the Unicode consortium considered how its growth might affect existing systems. It created a clever encoding form that allowed systems to use two 16-bit values as an alias for character values above 0xFFFF. That encoding form is called UTF-16. The consortium quickly partitioned a couple special ranges within the original 65,536 values that could be used in this encoding form. Those special values are called surrogates. A pair of surrogates, in the UTF-16 encoding form, can represent any defined character above 0xFFFF.

To keep up with the expanded Unicode range, Java’s char type has changed its definition a little bit. It is now a Unicode code unit in the UTF-16 encoding form. It’s still a 16-bit value, but you can’t really think of it as just a character anymore. It’s a code unit. Some 16-bit code unit values are complete characters, but some are only part of a surrogate pair. Remember surrogate values are not complete characters. A valid surrogate pair represents a single Unicode character somewhere above 0xFFFF.

So, let’s get right to the point. Sometimes a char is a complete character, and sometimes its only part of a surrogate pair. This makes text processing tricky.

In a future post, I’ll describe how to correctly iterate through a Java string. Because a char isn’t what is used to be, parsing a string isn’t as simple as it once was.

One thought on “A Little Java Character History

  1. I18n GAL

    Excellent illustration of the 16-bit unit handled by Java – Thanks, John

Leave a Reply

Your email address will not be published. Required fields are marked *