Archive

Posts Tagged ‘encoding’

Terminology: Unicode Character Encoding

November 13th, 2011 joconner No comments

Unicodelogo

In a recent blog, I described the terms character set, charset, and coded character set. In this blog, we’ll take a small step forward to define a few more terms:

  • encoding form
  • code unit
  • encoding scheme

Before going to much further, you can get all the information in this blog from a much more authoritative source, the Unicode Technical Report 17 (UTR 17). UTR 17 describes the Unicode Character Encoding Model and more formally define all the terms you’ll find in this blog. The added value, if any, of this blog is that I’ll attempt to describe these terms in just a few paragraphs instead of several pages. Still, when you’re feeling a bit adventurous and energetic, you might take on that UTR 17.

Character Encoding Form

An encoding form is a mapping from a code point to a sequence of code units. A code unit has a specific width, for example, 8-bits, 16-bits, or 32-bits. Any Unicode code point value can be mapped to any of these forms. One other note about encoding forms — there are two varieties: fixed width and variable width.

A fixed width encoding form encodes every code point in a fixed number of code units. That is, every code point can be encoded into the same number of code units. UTF-32 is a fixed width encoding form.

Variable width encoding forms encode code points in 1, 2, or more code units. Some variable width encoding forms include UTF-8, and UTF-16. In UTF-8, a character may require from 1 to 4 code units of 8-bits. In UTF-16, characters require 1 or 2 code units of 16-bits.

Code Unit

I’ve already hinted at this definition. It’s worth repeating though. A code unit is a sequence of fixed-size integers that take up a specific number of bits. For example, a code unit can be 8, 16, 32, or even 64-bits on some computer architectures. Code points are mapped to sequences of code units. A single character (code point) can be be mapped to several different code unit representations depending on the encoding form.

Character Encoding Scheme

An encoding scheme is a serialization technique that encodes code units into a byte stream. Since UTF-8 is already an 8-bit (byte) oriented encoding form, UTF-8 is also an encoding scheme.

Because of little-endian and big-endian hardware differences, the UTF-32 and UTF-16 encoding forms can be serialized into two different schemes each. The specific scheme flavors for UTF-32 are UTF-32BE and UTF-32LE, big-endian and little-endian respectively. UTF-16 has similar schemes: UTF-16BE and UTF-16LE.

Did that clear anything up? Or just confuse more. Let me know and I’ll try to clarify.

“Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.”

VN:F [1.9.22_1171]
Rating: 4.0/5 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Categories: Unicode Tags: ,

Unicode Terminology

October 23rd, 2011 joconner No comments

Logo60s2

I am sometimes asked whether Unicode is a 16-bit character set. The answer is not a simple no, but it is no. The question always reminds me how important terminology is too. Terminology is the focus of this particular post.

At one point long ago, when Unicode was a relative newcomer to the character set stage, it did in fact start out with a character space that had potential character values in the range from 0×0000 through 0xFFFF. In that case, and at that time until around 1995, Unicode could have been called a 16-bit character set. That is, each character could be represented with a single 16-bit integer value.

However, starting in 1996, Unicode’s character range expanded. With Unicode 2.0, the character set defined character values in the range 0×0000 through 0x10FFFF. That’s 21-bits of code space. Unicode can no longer be called a 16-bit character set. With today’s computer architectures, you really have to say that Unicode is a 32-bit character set. But now we have to be careful how we use these terms.

The rest of this discussion is admittedly tiresome and precise. We have to define some terms to make sure we’re talking about the same things. Bear with me.

The fact is that Unicode is much more than just a character set. It embodies a set of best practices and standards for defining characters, encoding them, naming them, and processing them. In the Unicode Consortium’s own words, Unicode is:

the standard for digital representation of the characters used in writing all of the world’s languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language.

The Unicode standard also defines the Unicode character set. This is a coded character set (CCS). A coded character set assigns an integer value to each of its characters. Each character’s numeric integer value is also called a code point. The current Unicode standard allows for code point values all the way up to 0x10FFFF. Often when we refer to Unicode code point values, we use another notation. Instead of writing the code point value as a hexadecimal number with the ’0x’ prefix, we use ‘U+”. So, in this alternate notation, to make sure others know that we’re explicitly talking about Unicode code point values, we write U+10FFFF. However, I’m not picky about this. It is, though, a noteworthy distinction. Strictly speaking, 0x10FFFF is just a very large hexadecimal number. U+10FFFF is a specific Unicode code point value.

So, we’ve established that Unicode is not a 16-bit character set, although it is a character set. Specifically, it is a coded character set. Remember how I’ve defined a CCS above. Sometimes you’ll hear other terms that are equivalent to a coded character set. The terms character set and charset are often used as synonyms, though strictly speaking neither imply that an assignment of code point values.

An encoding is something else, and it refers to how we serialize a code point for storage or transfer. Those clever people within the Unicode Technical Committee have devised several ways to encode the Unicode (coded) character set, giving us 3 common encodings:

  • UTF-32
  • UTF-16
  • UTF-8

Terms We’ve Learned

Here are the terms we’ve used so far:

  • character set
  • coded character set/charset
  • Character encoding

Next Up

Next time, let’s talk about these encodings: UTF-32, UTF-16, and UTF-8

 

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Categories: Unicode Tags: , ,