Monthly Archives: November 2011

Deconstructing BCP 47

BCP 47 stands for Best Common Practice 47, and even without the acronym, the name alone means almost nothing. So, what is BCP 47?

BCP 47 is the current best practice for creating language codes. A language code is a text identifier that specifies a specific human language, and the code provides the means to define the language in terms of a basic language, a script used to write that language, and even a particular region in which the language is used. BCP 47 prescribes the code and its parts with enough precision to uniquely identify a natural, human language and distinguish it from other languages.

BCP 47 is a standard that uses other standards, and it prescribes how to combine those standards together to create a language code. BCP 47 is a combination of at least the following existing standards:

Why is this important to you in the internationalization or localization business? It is important because our industry requires common standards and agreement for how to communicate, transfer, and exchange language data. A BCP 47 tag is necessary to accurately identify language text across different applications and tools.

Lots of existing applications, tools, and platforms already use BCP 47:

This is not an exhaustive list, but hopefully it gives you a sense of the importance of this standard. When you need to tag data with a language identifier, you should seriously consider BCP 47 instead of any home-grown convention.

Having provided plenty of links in this post, I hope you’ll take some time to familiarize yourself with this important language tagging standard. Happy reading!



Terminology: Unicode Character Encoding


In a recent blog, I described the terms character set, charset, and coded character set. In this blog, we’ll take a small step forward to define a few more terms:

  • encoding form
  • code unit
  • encoding scheme

Before going to much further, you can get all the information in this blog from a much more authoritative source, the Unicode Technical Report 17 (UTR 17). UTR 17 describes the Unicode Character Encoding Model and more formally define all the terms you’ll find in this blog. The added value, if any, of this blog is that I’ll attempt to describe these terms in just a few paragraphs instead of several pages. Still, when you’re feeling a bit adventurous and energetic, you might take on that UTR 17.

Character Encoding Form

An encoding form is a mapping from a code point to a sequence of code units. A code unit has a specific width, for example, 8-bits, 16-bits, or 32-bits. Any Unicode code point value can be mapped to any of these forms. One other note about encoding forms — there are two varieties: fixed width and variable width.

A fixed width encoding form encodes every code point in a fixed number of code units. That is, every code point can be encoded into the same number of code units. UTF-32 is a fixed width encoding form.

Variable width encoding forms encode code points in 1, 2, or more code units. Some variable width encoding forms include UTF-8, and UTF-16. In UTF-8, a character may require from 1 to 4 code units of 8-bits. In UTF-16, characters require 1 or 2 code units of 16-bits.

Code Unit

I’ve already hinted at this definition. It’s worth repeating though. A code unit is a sequence of fixed-size integers that take up a specific number of bits. For example, a code unit can be 8, 16, 32, or even 64-bits on some computer architectures. Code points are mapped to sequences of code units. A single character (code point) can be be mapped to several different code unit representations depending on the encoding form.

Character Encoding Scheme

An encoding scheme is a serialization technique that encodes code units into a byte stream. Since UTF-8 is already an 8-bit (byte) oriented encoding form, UTF-8 is also an encoding scheme.

Because of little-endian and big-endian hardware differences, the UTF-32 and UTF-16 encoding forms can be serialized into two different schemes each. The specific scheme flavors for UTF-32 are UTF-32BE and UTF-32LE, big-endian and little-endian respectively. UTF-16 has similar schemes: UTF-16BE and UTF-16LE.

Did that clear anything up? Or just confuse more. Let me know and I’ll try to clarify.

“Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.”

Making Millions from Android App Store?

Finally, after a couple years wait, my cell phone plan contract ended, and I was able to affordably switch both carriers and phones. I promptly acquired the following new hardware after switching my cell service provider from ATT to Verizon (no particular reason other than ATT happened to be more intolerable at the time):

  • Droid Bionic cell phone
  • Samsung Galaxy Tab 10.1

Now that I have the hardware, I just have to download the Android SDK, write a new application, and become a millionaire by next weekend. Can’t wait!

Oh, but wait, that link about becoming a millionaire is for the iOS app store. Crud. Sigh….

It’s interesting to do a Google search for app store millionaire. It seems that many people have made both small and large fortunes from the Apple App Store. I’ve heard that luck is a big factor in hitting it BIG too. And of course, some of the big hits are just really stupid too.

It’s also interesting to note what my Google search didn’t immediately find. So far I haven’t found examples of Android market millionaires. Hmm… why is that?

Now that I have this hardware, my next step is to download the Android SDK and pump out a cool, viral app that makes me millions of dollars too. But have I picked the wrong platform? It’s obvious that the Android market doesn’t yet have the same number of customers as the Apple App Store. Will that change? What will it take?


Not forgotten

No, I haven’t forgotten my promise to cover a few more Unicode terms. However, please excuse me while I recover from my recent vacation. In this case, my vacation has rendered me useless for a couple days after my return. Hundreds of emails have gathered in my email INBOX, and I’m still processing them.

I will be back in a couple more days to describe the following:

  • encoding scheme
  • code unit
  • encoding form

In the process of describing the above, we’ll look at another term UTF-32. We’ll eventually get to UTF-16 and UTF-8 after that.

Be patient, I’m almost done with my INBOX.