Terminology: Unicode Character Encoding



In a recent blog, I described the terms character set, charset, and coded character set. In this blog, we’ll take a small step forward to define a few more terms:

  • encoding form
  • code unit
  • encoding scheme

Before going to much further, you can get all the information in this blog from a much more authoritative source, the Unicode Technical Report 17 (UTR 17). UTR 17 describes the Unicode Character Encoding Model and more formally define all the terms you’ll find in this blog. The added value, if any, of this blog is that I’ll attempt to describe these terms in just a few paragraphs instead of several pages. Still, when you’re feeling a bit adventurous and energetic, you might take on that UTR 17.

Character Encoding Form

An encoding form is a mapping from a code point to a sequence of code units. A code unit has a specific width, for example, 8-bits, 16-bits, or 32-bits. Any Unicode code point value can be mapped to any of these forms. One other note about encoding forms — there are two varieties: fixed width and variable width.

A fixed width encoding form encodes every code point in a fixed number of code units. That is, every code point can be encoded into the same number of code units. UTF-32 is a fixed width encoding form.

Variable width encoding forms encode code points in 1, 2, or more code units. Some variable width encoding forms include UTF-8, and UTF-16. In UTF-8, a character may require from 1 to 4 code units of 8-bits. In UTF-16, characters require 1 or 2 code units of 16-bits.

Code Unit

I’ve already hinted at this definition. It’s worth repeating though. A code unit is a sequence of fixed-size integers that take up a specific number of bits. For example, a code unit can be 8, 16, 32, or even 64-bits on some computer architectures. Code points are mapped to sequences of code units. A single character (code point) can be be mapped to several different code unit representations depending on the encoding form.

Character Encoding Scheme

An encoding scheme is a serialization technique that encodes code units into a byte stream. Since UTF-8 is already an 8-bit (byte) oriented encoding form, UTF-8 is also an encoding scheme.

Because of little-endian and big-endian hardware differences, the UTF-32 and UTF-16 encoding forms can be serialized into two different schemes each. The specific scheme flavors for UTF-32 are UTF-32BE and UTF-32LE, big-endian and little-endian respectively. UTF-16 has similar schemes: UTF-16BE and UTF-16LE.

Did that clear anything up? Or just confuse more. Let me know and I’ll try to clarify.

“Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.”

Leave a Reply

Your email address will not be published. Required fields are marked *