Archive

Author Archive

Deconstructing BCP 47

November 29th, 2011 joconner No comments

BCP 47 stands for Best Common Practice 47, and even without the acronym, the name alone means almost nothing. So, what is BCP 47?

BCP 47 is the current best practice for creating language codes. A language code is a text identifier that specifies a specific human language, and the code provides the means to define the language in terms of a basic language, a script used to write that language, and even a particular region in which the language is used. BCP 47 prescribes the code and its parts with enough precision to uniquely identify a natural, human language and distinguish it from other languages.

BCP 47 is a standard that uses other standards, and it prescribes how to combine those standards together to create a language code. BCP 47 is a combination of at least the following existing standards:

Why is this important to you in the internationalization or localization business? It is important because our industry requires common standards and agreement for how to communicate, transfer, and exchange language data. A BCP 47 tag is necessary to accurately identify language text across different applications and tools.

Lots of existing applications, tools, and platforms already use BCP 47:

This is not an exhaustive list, but hopefully it gives you a sense of the importance of this standard. When you need to tag data with a language identifier, you should seriously consider BCP 47 instead of any home-grown convention.

Having provided plenty of links in this post, I hope you’ll take some time to familiarize yourself with this important language tagging standard. Happy reading!

 

 

VN:F [1.9.13_1145]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)
Categories: Language, Standards, Web Tags:

Terminology: Unicode Character Encoding

November 13th, 2011 joconner No comments

Unicodelogo

In a recent blog, I described the terms character set, charset, and coded character set. In this blog, we’ll take a small step forward to define a few more terms:

  • encoding form
  • code unit
  • encoding scheme

Before going to much further, you can get all the information in this blog from a much more authoritative source, the Unicode Technical Report 17 (UTR 17). UTR 17 describes the Unicode Character Encoding Model and more formally define all the terms you’ll find in this blog. The added value, if any, of this blog is that I’ll attempt to describe these terms in just a few paragraphs instead of several pages. Still, when you’re feeling a bit adventurous and energetic, you might take on that UTR 17.

Character Encoding Form

An encoding form is a mapping from a code point to a sequence of code units. A code unit has a specific width, for example, 8-bits, 16-bits, or 32-bits. Any Unicode code point value can be mapped to any of these forms. One other note about encoding forms — there are two varieties: fixed width and variable width.

A fixed width encoding form encodes every code point in a fixed number of code units. That is, every code point can be encoded into the same number of code units. UTF-32 is a fixed width encoding form.

Variable width encoding forms encode code points in 1, 2, or more code units. Some variable width encoding forms include UTF-8, and UTF-16. In UTF-8, a character may require from 1 to 4 code units of 8-bits. In UTF-16, characters require 1 or 2 code units of 16-bits.

Code Unit

I’ve already hinted at this definition. It’s worth repeating though. A code unit is a sequence of fixed-size integers that take up a specific number of bits. For example, a code unit can be 8, 16, 32, or even 64-bits on some computer architectures. Code points are mapped to sequences of code units. A single character (code point) can be be mapped to several different code unit representations depending on the encoding form.

Character Encoding Scheme

An encoding scheme is a serialization technique that encodes code units into a byte stream. Since UTF-8 is already an 8-bit (byte) oriented encoding form, UTF-8 is also an encoding scheme.

Because of little-endian and big-endian hardware differences, the UTF-32 and UTF-16 encoding forms can be serialized into two different schemes each. The specific scheme flavors for UTF-32 are UTF-32BE and UTF-32LE, big-endian and little-endian respectively. UTF-16 has similar schemes: UTF-16BE and UTF-16LE.

Did that clear anything up? Or just confuse more. Let me know and I’ll try to clarify.

“Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.”

VN:F [1.9.13_1145]
Rating: 4.0/5 (1 vote cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)
Categories: Unicode Tags: ,

Making Millions from Android App Store?

November 12th, 2011 joconner No comments

Finally, after a couple years wait, my cell phone plan contract ended, and I was able to affordably switch both carriers and phones. I promptly acquired the following new hardware after switching my cell service provider from ATT to Verizon (no particular reason other than ATT happened to be more intolerable at the time):

  • Droid Bionic cell phone
  • Samsung Galaxy Tab 10.1

Now that I have the hardware, I just have to download the Android SDK, write a new application, and become a millionaire by next weekend. Can’t wait!

Oh, but wait, that link about becoming a millionaire is for the iOS app store. Crud. Sigh….

It’s interesting to do a Google search for app store millionaire. It seems that many people have made both small and large fortunes from the Apple App Store. I’ve heard that luck is a big factor in hitting it BIG too. And of course, some of the big hits are just really stupid too.

It’s also interesting to note what my Google search didn’t immediately find. So far I haven’t found examples of Android market millionaires. Hmm… why is that?

Now that I have this hardware, my next step is to download the Android SDK and pump out a cool, viral app that makes me millions of dollars too. But have I picked the wrong platform? It’s obvious that the Android market doesn’t yet have the same number of customers as the Apple App Store. Will that change? What will it take?

 

VN:F [1.9.13_1145]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)
Categories: Android Tags:

Not forgotten

November 2nd, 2011 joconner No comments

No, I haven’t forgotten my promise to cover a few more Unicode terms. However, please excuse me while I recover from my recent vacation. In this case, my vacation has rendered me useless for a couple days after my return. Hundreds of emails have gathered in my email INBOX, and I’m still processing them.

I will be back in a couple more days to describe the following:

  • encoding scheme
  • code unit
  • encoding form

In the process of describing the above, we’ll look at another term UTF-32. We’ll eventually get to UTF-16 and UTF-8 after that.

Be patient, I’m almost done with my INBOX.

 

VN:F [1.9.13_1145]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)
Categories: Unicode Tags:

Unicode Terminology

October 23rd, 2011 joconner 2 comments

Logo60s2

I am sometimes asked whether Unicode is a 16-bit character set. The answer is not a simple no, but it is no. The question always reminds me how important terminology is too. Terminology is the focus of this particular post.

At one point long ago, when Unicode was a relative newcomer to the character set stage, it did in fact start out with a character space that had potential character values in the range from 0×0000 through 0xFFFF. In that case, and at that time until around 1995, Unicode could have been called a 16-bit character set. That is, each character could be represented with a single 16-bit integer value.

However, starting in 1996, Unicode’s character range expanded. With Unicode 2.0, the character set defined character values in the range 0×0000 through 0x10FFFF. That’s 21-bits of code space. Unicode can no longer be called a 16-bit character set. With today’s computer architectures, you really have to say that Unicode is a 32-bit character set. But now we have to be careful how we use these terms.

The rest of this discussion is admittedly tiresome and precise. We have to define some terms to make sure we’re talking about the same things. Bear with me.

The fact is that Unicode is much more than just a character set. It embodies a set of best practices and standards for defining characters, encoding them, naming them, and processing them. In the Unicode Consortium’s own words, Unicode is:

the standard for digital representation of the characters used in writing all of the world’s languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language.

The Unicode standard also defines the Unicode character set. This is a coded character set (CCS). A coded character set assigns an integer value to each of its characters. Each character’s numeric integer value is also called a code point. The current Unicode standard allows for code point values all the way up to 0x10FFFF. Often when we refer to Unicode code point values, we use another notation. Instead of writing the code point value as a hexadecimal number with the ’0x’ prefix, we use ‘U+”. So, in this alternate notation, to make sure others know that we’re explicitly talking about Unicode code point values, we write U+10FFFF. However, I’m not picky about this. It is, though, a noteworthy distinction. Strictly speaking, 0x10FFFF is just a very large hexadecimal number. U+10FFFF is a specific Unicode code point value.

So, we’ve established that Unicode is not a 16-bit character set, although it is a character set. Specifically, it is a coded character set. Remember how I’ve defined a CCS above. Sometimes you’ll hear other terms that are equivalent to a coded character set. The terms character set and charset are often used as synonyms, though strictly speaking neither imply that an assignment of code point values.

An encoding is something else, and it refers to how we serialize a code point for storage or transfer. Those clever people within the Unicode Technical Committee have devised several ways to encode the Unicode (coded) character set, giving us 3 common encodings:

  • UTF-32
  • UTF-16
  • UTF-8

Terms We’ve Learned

Here are the terms we’ve used so far:

  • character set
  • coded character set/charset
  • Character encoding

Next Up

Next time, let’s talk about these encodings: UTF-32, UTF-16, and UTF-8

 

VN:F [1.9.13_1145]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)
Categories: Unicode Tags: , ,

Seeking input on article topics

October 14th, 2011 joconner No comments

Hi there again,

This is just a quick note to say thank you for reading this blog. Internationalization is definitely a favorite topic of mine. The problem is that I enjoy so many topics that sometimes I don’t stay focused.

To help with that, I’m asking you to make suggestions. What would you like to read about here? What globalization topics interest you the most? What topics do you have trouble finding information about on the web or in your favorite magazine. I’m here to help you of course, but I get a lot out of researching material too!

Have a great weekend!

VN:F [1.9.13_1145]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)
Categories: Uncategorized Tags:

PDF files are bad source documents for translation?

October 12th, 2011 joconner 3 comments

While reading my latest copy of Multilingual magazine, I found an interesting assertion about creating translation-friendly source documents. One article starts its discussion by stating this:

Whenever possible, avoid using PDF files as the source document format
for translation. Always try to provide the original file format … [because]
PDF files cannot currently be edited in some programs and instead have to
be transformed into another format (usually Word) before translation.

[Multilingual, Oct/Nov2011, "Creating translation-oriented source documents," p. 43]

This statement makes sense in some ways I suppose. For example, I don’t usually think of PDF files as being easily editable, and perhaps translation tools don’t usually understand the format. However, being somewhat new to the PDF format, I’ll ask that you please forgive my ignorance, but I do have some questions, especially if you are a translation or translation tools company:

  1. Do you agree with the author’s statement. If so, why?
  2. If PDF files do fall into this category of being difficult source documents, can translation tool vendors do something about that? What is possible?
  3. If you get a PDF document to translate, what do you do? Return it? Does it fit into your translation workflow and toolsets?
VN:F [1.9.13_1145]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)
Categories: Localization Tags: ,

Google’s Dart is JavaScript++?

October 11th, 2011 joconner 3 comments

Several years ago, Google was unhappy with the pace of change in the Java language and community. Their solution was to create the Dalvik VM and their Android platform. They were careful to not call it a Java platform, a Java implementation, JVM, or JRE; instead, they said it was the Java language on the Dalvik VM.

Now is Google doing something similar to JavaScript? It is true that the JavaScript language is evolving slowly. Google certainly has shown that it can innovate without a standards body before (see Java above). Is Google trying the same thing again, but with JavaScript? Is Google attempting to ignore the Ecmascript/JavaScript standards community and move ahead without them?

Google has a new language called Dart. This language is like JavaScript, but has many new language features. It will be interesting to see if Google gets as much benefit and praise for this as they did Android and the Dalvik VM.

More on Dart:

What I find so interesting about the Dart announcement is that Google already has a great tool for developing web applications without coding directly in JavaScript — It’s GWT, the Google Web Toolkit. Basically, it’s their very popular toolset for writing applications in Java and compiling it down to browser neutral JavaScript. If you’re unfamiliar with GWT, yes, you read that correctly…a compiler from Java to JavaScript. As a user of GWT, I can say that it works great! And this has great appeal in the developer community already. So why Dart? Why yet another language that they compile to JavaScript?

VN:F [1.9.13_1145]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.13_1145]
Rating: +2 (from 2 votes)
Categories: JavaScript, Web Tags: , ,

Answers to Which Countries Have Multiple Time Zones

October 6th, 2011 joconner No comments

Yesterday I asked the question:

Which countries have multiple time zones?

And I promised an answer today. Congratulations to Paul Clapham, who made a great effort listing them!

The W3C document Working with Time Zones is the source of my information. That document lists the following countries as having more than one time zone:

  • Argentina
  • Australia
  • Brazil
  • Canada
  • Chile
  • Democratic Republic of the Congo
  • Ecuador
  • France
  • Greenland
  • Indonesia
  • Kazakhstan
  • Kiribati
  • Mexico
  • Micronesia
  • Mongolia
  • New Zealand
  • Portugal
  • Russia
  • Spain
  • The United States

 

VN:F [1.9.13_1145]
Rating: 3.0/5 (1 vote cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)

Countries with Multiple Time Zones

October 6th, 2011 joconner 1 comment

Recently I came across a W3C document about times, dates, and time zones. The document claims that only 20 countries observe more than one time zone. The United States is in that list. Can you name the rest of them?

I’ll post the answers to this question tomorrow! Until then, which countries do you think have more than one time zone?

P.S. Please provide your answers as comments!

VN:F [1.9.13_1145]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.13_1145]
Rating: 0 (from 0 votes)