The absolute minimum you should know about internationalization


Internationalization is a design and engineering task that prepares your software product to be localized. It doesn’t create a localized product; instead, it puts your product in a state that allows localization. The goal of internationalization should be a single code base that can be used as-is to create multiple localized versions of your product.

This article provides a high-level description of some issues you must resolve during internationalization. This is not  a comprehensive list:

  • Character sets
  • Resource externalization
  • User interface design
  • Data formats
  • Sorting

Character Sets

Your application will most likely manage, manipulate, store, and display information. Much of the information will be user-readable text. One property of text is it’s character set.

If you want a global-ready application, the choice of character set is simple: Unicode. Unicode allows you to manage text in practically any script without losing data due to character conversion problems. Regardless of the default  character set of the underlying host OS, your application should convert text to Unicode for internal manipulation. Additionally, your application should transmit and store text as Unicode. Doing anything else is unnecessarily complicated and completely unnecessary in any modern operating environment.

Unicode has several possible encodings, including UTF-16 and UTF-8. My experience is that developers rarely get to use just one of these. However, they are BOTH Unicode. Their only significant difference is how a specific code point is encoded in code units. Unless you have a well-understood reason for doing otherwise, I suggest you store and transmit text in UTF-8. Your specific programming language may require you to use UTF-16 for text operations. When displaying text to your user, you might use UTF-16 in a desktop application. When rendering HTML views, you can typically use UTF-16 or UTF-8. I suggest you use UTF-8 everywhere possible.

Resource Externalization

A resource is any text label, message, graphic image, video, audio, or other application file that you intend to present to the user. Instead of hard-coding these resources into your application code, you should extract them into external files that can be used at run-time. By extracting user-facing resources into resource files, you make translation and localization easier. Practically every programming environment provides a mechanism for creating external resource files. 

User Interface Design

User interface layout is often affected by the length of text labels, fields and other visable text. When designing layouts, remember that field and label sizes will increase for some language translations. Design your user interface with the largest label and field lengths in mind. Additionally, follow the typical rules for avoiding culturally sensitive images, hand gestures, and body parts. Also, avoid concatenating shorter pieces of text to build up larger sentences. When translated, the concatenated text rarely has correct syntax or meaning.

Some languages are written from right-to-left. If targeting those languages, remember that the entire layout of page components is often arranged from right-to-left. You may need to create a “reversible” layout that can accommodate those languages and cultures.

Data Formats

Numbers and dates have different formats around the world. Digit separators, currency symbols, and date field orders are all part of the many differences that you’ll need to consider. Fortunately, you don’t have to discover the correct formats and standards for every culture. Many programming environments already provide libraries to format numbers, currencies, and dates using the Common Locale Data Repository (CLDR) formats. 

The main point I want to share about formats is this: separate concerns for data formats by storing and manipulating data in a canonical, non-localized form and apply localized formats only in the “view” layer of your application.


Languages have sorting rules. Those rules help you find names or products in long lists. Dictionaries, phone books, and product catalogs use linguistic sorting to help people find information quickly. When presenting long lists to your users, your application should use those sorting rules as well. Learn and use the sorting or collation libraries in your programming language or technology environment.


Internationalization is an effort to create products that can be translated and localized for many languages and cultures. Creating an internationalized product requires that you consider and plan for a variety of common technical issues. A few of those issues are character set choice, user-interface design, data formats and sorting. You rarely have to solve those issues yourself; you can often find and use existing libraries for this purpose.

More Resources


Reactions as Another Aspect of Social Media

One of the new trends in making web content more social is the recording of reader impressions or reactions. For example, I just read an article about Father’s Day and the article included a poll that allowed me to quickly provide my response or impression of the content. The poll wasn’t a questionnaire that I’d never take the time to fill out. Instead, it was just a few buttons or image maps that require a single click:

Article impression

What’s interesting about this is that not only do I get to enjoy the article content, but I also get an indication of how others perceive or respond to the content — obviously making the content more social. What a great idea! 

Another interesting part of this to me is the choice to keep the response anonymous and aggregated. The above image, for example, shows the categories of reader response but doesn’t tell me exactly who responded in any of the categories. Certainly it would be possible, especially if this were tied to Facebook or Google Plus, to see what my friends or colleagues think about the content too.

I wonder whether the anonymity preference is specific to US English readers. As I think about it, I’m happy to participate in the poll, but I might not want to make my specific opinion public knowledge. I wonder if other cultures would feel differently in general? What groups of people would feel more open to expressing opinions publicly and associating their real or online identities to their response?

Oh, my response to this particular article was “THINK”…but not about the article content. Instead, the article and the poll made me think about changes in social media. Every time I think we’ve tapped our creative juices out, somebody thinks of something new and impressive to make the online world more social. 

Language Signals on the Web


Presenting a user interface in the customer’s language should be a high priority from your product management team. If not, they’re not doing their job in my opinion. Assuming you have the feature in your product roadmap, how do you choose the UI language of your customer on the web. After all, web applications have multiple, sometimes conflicting language signals.

A language signal is an indicator that gives your application a hint of your customer’s preferred language. In a web application, these signals are numerous. To help you in choosing from all these signals, I believe you should honor the preferences in the following priority. That is, check each signal for its existence in this order, and use the first signal that is available:

  1. query parameters, for example
  2. domain name or path parameters, i.e. or
  3. persistent application preferences
    • cookies
    • customer profile or settings
  4. browser accept-language headers
  5. geolocation hints
  6. default application language

Query Parameters

Query parameters are often used to override every other language or application signal. If parameters are used, your customer (QE engineers or even end users) are intentionally trying to coerce the application into ignoring all other language signals. Query parameters beat out any other language signal when they are provided in the same request.

Domain name or Path Parameters

Sometimes you will partition your localized sites by domain name or by language tag paths. A domain name partition means that you select different or even localized domain names for specific markets. For example, your French site could be You can also distinguish language preference on the path like this: or When query parameters don’t exist, this is the next choice in our prioritization.

Persistent Settings

Of course, if your application has allowed the user to select a language preference, the application should honor that preference. The preference may be stored in a cookie or even in a user profile attribute on the server.

Accept-Language Header

Most browsers provide a list of user language preferences in each request. These languages are provided in request headers as values of the accept-language attribute. This attribute can have 1 or more language codes, and they indicate the priority of the user when requesting content. In the absence of other signals, your application should respond to the accept-language header.

Geolocation Hints

The last signal that actually provides information about the user is the geographic location from which the user is accessing your content. Although imperfect and imprecise, geography can provide a hint to your customer’s language preference. It’s definitely not the best indicator because multiple languages can be spoken in any geographic location. In a pinch, though, you may be able to provide a language selection tool that provides a list of the most prominent languages spoken in a specific area of the world.

Default Application Language

Finally, when all else fails and there have been no other indicators, you can provide the UI in the default language of the application. If your company is in Germany, maybe the default is German. If it’s the U.S., your default language is most likely English…or maybe even Spanish. You have to display the application in some language, and the default at this point is your last option.

In Summary

To summarize, a web application can serve a global audience. In doing so, it may accommodate customers in a variety of languages. Your application’s user interface may be selected from numerous possibilities, numerous signals from the user. Those signals are important data points to consider when making the language choice to present to the user. Using the signals described in this article, you’ll be able to consider some of the more important language preference indicators. Follow the prioritization I’ve outlined here, and you’ll make the right language choice most of the time…until you don’t. And there will be times when you don’t make the right choice from all these signals. When that happens, and it will happen, you have to give your users some way to indicate that problem. Take a look at my previous blog entry about language selection widgets for help with that.



Deconstructing BCP 47

BCP 47 stands for Best Common Practice 47, and even without the acronym, the name alone means almost nothing. So, what is BCP 47?

BCP 47 is the current best practice for creating language codes. A language code is a text identifier that specifies a specific human language, and the code provides the means to define the language in terms of a basic language, a script used to write that language, and even a particular region in which the language is used. BCP 47 prescribes the code and its parts with enough precision to uniquely identify a natural, human language and distinguish it from other languages.

BCP 47 is a standard that uses other standards, and it prescribes how to combine those standards together to create a language code. BCP 47 is a combination of at least the following existing standards:

Why is this important to you in the internationalization or localization business? It is important because our industry requires common standards and agreement for how to communicate, transfer, and exchange language data. A BCP 47 tag is necessary to accurately identify language text across different applications and tools.

Lots of existing applications, tools, and platforms already use BCP 47:

This is not an exhaustive list, but hopefully it gives you a sense of the importance of this standard. When you need to tag data with a language identifier, you should seriously consider BCP 47 instead of any home-grown convention.

Having provided plenty of links in this post, I hope you’ll take some time to familiarize yourself with this important language tagging standard. Happy reading!



The difference between 1st and 3rd party cookies

The question: What’s a third party cookie? OK, let’s assume I’m an expert at these things, which I’m not, but let’s just assume that I play an expert at these things. Here’s your answer….

Cookies are small pieces of information stored in your browser’s cached files. If you visit a site, say, that site might decide to store some state on your browser — a cookie. The cookie is a 1st party cookie because it is created and sent back and forth between your computer and that site,, in your browser’s url field.

So what’s a 3rd party cookie? Well, often a document will fetch additional pieces of information from other sites, maybe a javascript file or an image, or maybe even entire documents. Anytime the document from imports a file or script or image from a different web site (the 3rd party site), that site can also set cookies. Those cookies are 3rd party cookies.

Let’s look at this a little more closely. Imagine you browse to This site might send back a 1st party cookie: FOO=1. The cookie could be stored in the domain of, and only visits to that would prompt the browser to automatically send that cookie value to on subsequent visits. Because of the rules around cookie security, however, your browser would never send an cookie to another site like

Now imagine that the file has additional links to other sites. Maybe hello.html has an image link that pulls a photo from Now is being called from your document. Your browser will not send any cookie values to the referenced site. However, the example2 site may decide to drop a cookie as well. Since it is not the primary site of your document, which is, the cookie is called a 3rd party cookie.

Hmmm… so here we have examples of 1st party and 3rd party cookies. So, what’s a 2nd party cookie? I don’t actually know the answer to that question. If the site of your document is the 1st party and referenced sites are 3rd party, that really leaves the browser or its user as the 2nd party. Maybe a 2nd party cookie (which I’ve never really heard in any discussion) would simply be a cookie manually created by the user? Hmmmm…. probably not an important point. However, hopefully you know the difference between 1st and 3rd party cookies!

Cookie example

Determining a visitor’s timezone

We’ve already decided that determining a timezone for a desktop application is easy. It’s too easy, and so let’s not even waste our time there. Instead, let’s think about something more difficult: how do you determine the timezone of a visitor to your website?

If your site authenticates users, you have most of your problem solved. Along with your user’s preference for username, password, and favorite soccer team (if soccer is your web site’s focus), you can encourage users to register their locale and timezone. This really isn’t so much to ask, not if you are going to offer them rich, useful, or entertaining content.

So ask for a timezone preference! When you ask, however, make sure you ask for something more useful than a simple UTC time offset. Knowing that a visitor is in a UTC-8:00 time zone is helpful but not as helpful as knowing that that same visitor is in the Los Angeles/America time zone. The latter option obviously provides more information about the user. Of course, the Los Angeles/America time zone tells your system that a visitor requires a UTC-8 offset, but it also differentiates this user from someone in Canada that may use the same hour:minute offset. It’s more information! More information usually translates into a better user experience, especially if you take care to utilize that information to customize the experience.

It is also possible to get the browser’s default timezone using a bit of JavaScript. Is this the user’s preference? Maybe, maybe not, but it is available. I’ve heard arguments that suggest that this is not the correct timezone to use. However, my opinion is emphatically this: the timezone of the user’s host pc is probably the best thing you have available in the absence of a specific user preference setting. Yes, people move around; yes, a user can visit your site on the west coast one day and then the east coast the next day without changing the pc setting.

var d = new Date();
var tzOffset = d.getTimezoneOffset();

Is this perfect? No, not at all. In fact, this javascript really just provides a minute value offset from GMT and local time. Still, for formatting a time with a correct timezone offset, this is useful.

If you don’t use a user preference setting in your app or a bit of JavaScript to query your visitor’s host timezone, what else do we have? Hmm…that’s an interesting question. What else is available for determining the timezone of a user visiting your site? Well, they do have an IP address. There are public services and databases that attempt to map this for you, but I just don’t know how accurate this is. I suppose I don’t have any specific reason to doubt its viability; it certainly seems possible at some level. But I’ve not actually spoken with anyone that has used this accurately or successfully. If you have, let me know.

Until next time!