Monthly Archives: October 2011

Unicode Terminology


I am sometimes asked whether Unicode is a 16-bit character set. The answer is not a simple no, but it is no. The question always reminds me how important terminology is too. Terminology is the focus of this particular post.

At one point long ago, when Unicode was a relative newcomer to the character set stage, it did in fact start out with a character space that had potential character values in the range from 0x0000 through 0xFFFF. In that case, and at that time until around 1995, Unicode could have been called a 16-bit character set. That is, each character could be represented with a single 16-bit integer value.

However, starting in 1996, Unicode’s character range expanded. With Unicode 2.0, the character set defined character values in the range 0x0000 through 0x10FFFF. That’s 21-bits of code space. Unicode can no longer be called a 16-bit character set. With today’s computer architectures, you really have to say that Unicode is a 32-bit character set. But now we have to be careful how we use these terms.

The rest of this discussion is admittedly tiresome and precise. We have to define some terms to make sure we’re talking about the same things. Bear with me.

The fact is that Unicode is much more than just a character set. It embodies a set of best practices and standards for defining characters, encoding them, naming them, and processing them. In the Unicode Consortium’s own words, Unicode is:

the standard for digital representation of the characters used in writing all of the world’s languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language.

The Unicode standard also defines the Unicode character set. This is a coded character set (CCS). A coded character set assigns an integer value to each of its characters. Each character’s numeric integer value is also called a code point. The current Unicode standard allows for code point values all the way up to 0x10FFFF. Often when we refer to Unicode code point values, we use another notation. Instead of writing the code point value as a hexadecimal number with the ‘0x’ prefix, we use ‘U+”. So, in this alternate notation, to make sure others know that we’re explicitly talking about Unicode code point values, we write U+10FFFF. However, I’m not picky about this. It is, though, a noteworthy distinction. Strictly speaking, 0x10FFFF is just a very large hexadecimal number. U+10FFFF is a specific Unicode code point value.

So, we’ve established that Unicode is not a 16-bit character set, although it is a character set. Specifically, it is a coded character set. Remember how I’ve defined a CCS above. Sometimes you’ll hear other terms that are equivalent to a coded character set. The terms character set and charset are often used as synonyms, though strictly speaking neither imply that an assignment of code point values.

An encoding is something else, and it refers to how we serialize a code point for storage or transfer. Those clever people within the Unicode Technical Committee have devised several ways to encode the Unicode (coded) character set, giving us 3 common encodings:

  • UTF-32
  • UTF-16
  • UTF-8

Terms We’ve Learned

Here are the terms we’ve used so far:

  • character set
  • coded character set/charset
  • Character encoding

Next Up

Next time, let’s talk about these encodings: UTF-32, UTF-16, and UTF-8


Seeking input on article topics

Hi there again,

This is just a quick note to say thank you for reading this blog. Internationalization is definitely a favorite topic of mine. The problem is that I enjoy so many topics that sometimes I don’t stay focused.

To help with that, I’m asking you to make suggestions. What would you like to read about here? What globalization topics interest you the most? What topics do you have trouble finding information about on the web or in your favorite magazine. I’m here to help you of course, but I get a lot out of researching material too!

Have a great weekend!

PDF files are bad source documents for translation?

While reading my latest copy of Multilingual magazine, I found an interesting assertion about creating translation-friendly source documents. One article starts its discussion by stating this:

Whenever possible, avoid using PDF files as the source document format
for translation. Always try to provide the original file format … [because]
PDF files cannot currently be edited in some programs and instead have to
be transformed into another format (usually Word) before translation.

[Multilingual, Oct/Nov2011, “Creating translation-oriented source documents,” p. 43]

This statement makes sense in some ways I suppose. For example, I don’t usually think of PDF files as being easily editable, and perhaps translation tools don’t usually understand the format. However, being somewhat new to the PDF format, I’ll ask that you please forgive my ignorance, but I do have some questions, especially if you are a translation or translation tools company:

  1. Do you agree with the author’s statement. If so, why?
  2. If PDF files do fall into this category of being difficult source documents, can translation tool vendors do something about that? What is possible?
  3. If you get a PDF document to translate, what do you do? Return it? Does it fit into your translation workflow and toolsets?

Google’s Dart is JavaScript++?

Several years ago, Google was unhappy with the pace of change in the Java language and community. Their solution was to create the Dalvik VM and their Android platform. They were careful to not call it a Java platform, a Java implementation, JVM, or JRE; instead, they said it was the Java language on the Dalvik VM.

Now is Google doing something similar to JavaScript? It is true that the JavaScript language is evolving slowly. Google certainly has shown that it can innovate without a standards body before (see Java above). Is Google trying the same thing again, but with JavaScript? Is Google attempting to ignore the Ecmascript/JavaScript standards community and move ahead without them?

Google has a new language called Dart. This language is like JavaScript, but has many new language features. It will be interesting to see if Google gets as much benefit and praise for this as they did Android and the Dalvik VM.

More on Dart:

What I find so interesting about the Dart announcement is that Google already has a great tool for developing web applications without coding directly in JavaScript — It’s GWT, the Google Web Toolkit. Basically, it’s their very popular toolset for writing applications in Java and compiling it down to browser neutral JavaScript. If you’re unfamiliar with GWT, yes, you read that correctly…a compiler from Java to JavaScript. As a user of GWT, I can say that it works great! And this has great appeal in the developer community already. So why Dart? Why yet another language that they compile to JavaScript?

Answers to Which Countries Have Multiple Time Zones

Yesterday I asked the question:

Which countries have multiple time zones?

And I promised an answer today. Congratulations to Paul Clapham, who made a great effort listing them!

The W3C document Working with Time Zones is the source of my information. That document lists the following countries as having more than one time zone:

  • Argentina
  • Australia
  • Brazil
  • Canada
  • Chile
  • Democratic Republic of the Congo
  • Ecuador
  • France
  • Greenland
  • Indonesia
  • Kazakhstan
  • Kiribati
  • Mexico
  • Micronesia
  • Mongolia
  • New Zealand
  • Portugal
  • Russia
  • Spain
  • The United States


Countries with Multiple Time Zones

Recently I came across a W3C document about times, dates, and time zones. The document claims that only 20 countries observe more than one time zone. The United States is in that list. Can you name the rest of them?

I’ll post the answers to this question tomorrow! Until then, which countries do you think have more than one time zone?

P.S. Please provide your answers as comments!

Offensive Hand Gestures of the World

Recently I told some colleagues that I wanted to write a book about offensive hand gestures around the world. In my business, it’s common knowledge that icons of hand gestures just don’t localize well. We’re told to avoid using icons of hands and fingers because we might offend someone in a different culture. I’ve always known that anything can be offensive to someone somewhere, so of course I’ve accepted this as truth. Seems reasonable to avoid hand gestures when I can.

However, in all my years working in globalization and localization, I’ve never once been provided a comprehensive list of gestures that should be avoided. And not surprisingly, I’ve always wanted to see that list! In fact, I’ve wanted that list of offensive or rude gestures so much that I promised I’d write a book on the topic.

Oddly enough, it turns out that someone else has already beat me to the task. And I’ve discovered this not a moment too soon — I was already interviewing hand models. 🙂 So, now that I know that someone has already beaten me to the job, I suppose I can now take that book off my todo list and just get a copy of it.

In my extensive research on the subject, I’ve found 3 books. THREE! On one of my late-night runs to Barnes and Noble, I saw the first one on my list — it was in the old discounted book pile. What a pity! Now pull out your credit card and take a look at these gems. I know I’ll order at least one of these:

  1. Rude Hand Gestures of the World
  2. Gestures: The Do’s and Taboos of Body Language Around the World
  3. 70 Japanese Gestures: No Language Communication

It’s really too bad too…I could really feel myself being pulled into this work. Maybe I can hope that the original authors do a 2nd edition and allow me to help? Man that would have been a fun book to research and write! But alas, I’m too late! But finally I’ve found the book I’ve wondered about for so long!