Java and BCP 47 Language Tags

Since Java 7, Java’s Locale object has been updated to take on the role of a language tag as identified in RFC 5646 and BCP 47 specs. The newer language tag support gives developers the ability to be more precise in identifying language resources. The new Locale, used as a language tag, can be used to identify languages in this general form:


Of course, you can continue to think of locale as a lang_region_variant identifier, but Java now uses the RFC 5646 spec to enhance the Locale class to support language, script, broader regions, and even newer extensions if needed. And if the rules for generating this string of identifiers seems intimidating, you can use the Locale.Builder class to build up the tag without worries of misforming it.

The primary language identifier is the almost the same item you’ve always known; it’s an ISO 639 2-letter code or 3-letter code. The spec recommends using the shortest id possible.

The script is new. You can now add a proper script identifier that specifies the writing system used for the language. People can use multiple writing systems to write languages. For example, Japanese speakers/writers can use 3 or more different scripts for Japanese: kanji, hiragana, katakana, and even “romaji” or Latin script. Serbian is another language often written in either Latin or Cyrillic characters.

The region identifier was once limited to 2-letter ISO 3166 codes, but now you can also use the United Nations 3-digit macro geographical region codes in the region portion of a language tag. A macro geographical region identifies a larger region that comprises more than one country. For example, the UN currently defines Eastern Europe to be macro region 151 and includes 10 countries within it.

Eastern Europe 151

Finally, you can use variant, extension, and privateuse sub-tags to provide even more context for a language tag. See RFC 5646 for more details on these. I suggest that you also use the Locale.Builder class to assist if you need to use this level of detail.

Take a look at the Locale documentation for all the details on using these new features. They definitely give you much more control of how you identify and use language resources in your internationalized applications.

Language Signals on the Web


Presenting a user interface in the customer’s language should be a high priority from your product management team. If not, they’re not doing their job in my opinion. Assuming you have the feature in your product roadmap, how do you choose the UI language of your customer on the web. After all, web applications have multiple, sometimes conflicting language signals.

A language signal is an indicator that gives your application a hint of your customer’s preferred language. In a web application, these signals are numerous. To help you in choosing from all these signals, I believe you should honor the preferences in the following priority. That is, check each signal for its existence in this order, and use the first signal that is available:

  1. query parameters, for example
  2. domain name or path parameters, i.e. or
  3. persistent application preferences
    • cookies
    • customer profile or settings
  4. browser accept-language headers
  5. geolocation hints
  6. default application language

Query Parameters

Query parameters are often used to override every other language or application signal. If parameters are used, your customer (QE engineers or even end users) are intentionally trying to coerce the application into ignoring all other language signals. Query parameters beat out any other language signal when they are provided in the same request.

Domain name or Path Parameters

Sometimes you will partition your localized sites by domain name or by language tag paths. A domain name partition means that you select different or even localized domain names for specific markets. For example, your French site could be You can also distinguish language preference on the path like this: or When query parameters don’t exist, this is the next choice in our prioritization.

Persistent Settings

Of course, if your application has allowed the user to select a language preference, the application should honor that preference. The preference may be stored in a cookie or even in a user profile attribute on the server.

Accept-Language Header

Most browsers provide a list of user language preferences in each request. These languages are provided in request headers as values of the accept-language attribute. This attribute can have 1 or more language codes, and they indicate the priority of the user when requesting content. In the absence of other signals, your application should respond to the accept-language header.

Geolocation Hints

The last signal that actually provides information about the user is the geographic location from which the user is accessing your content. Although imperfect and imprecise, geography can provide a hint to your customer’s language preference. It’s definitely not the best indicator because multiple languages can be spoken in any geographic location. In a pinch, though, you may be able to provide a language selection tool that provides a list of the most prominent languages spoken in a specific area of the world.

Default Application Language

Finally, when all else fails and there have been no other indicators, you can provide the UI in the default language of the application. If your company is in Germany, maybe the default is German. If it’s the U.S., your default language is most likely English…or maybe even Spanish. You have to display the application in some language, and the default at this point is your last option.

In Summary

To summarize, a web application can serve a global audience. In doing so, it may accommodate customers in a variety of languages. Your application’s user interface may be selected from numerous possibilities, numerous signals from the user. Those signals are important data points to consider when making the language choice to present to the user. Using the signals described in this article, you’ll be able to consider some of the more important language preference indicators. Follow the prioritization I’ve outlined here, and you’ll make the right language choice most of the time…until you don’t. And there will be times when you don’t make the right choice from all these signals. When that happens, and it will happen, you have to give your users some way to indicate that problem. Take a look at my previous blog entry about language selection widgets for help with that.



Picking a scripting language

I’ve been working with Java for a dozen years now, actually more. I don’t really want to learn another language, or forget one. I’ve already forgotten perl too many times. But the problem is that I actually do need to learn another language. Java just doesn’t do everything for me.

For example, when I need to process a huge log file and write some data to another file, I don’t really want the overhead of writing in Java. What I want is to scribble something out and run it. I might keep the script around, but I might throw it away too. I need another tool. Perl once did this for me, once long ago. Then for whatever reason, I didn’t need it anymore. Now I need it again, but remembering how many times I’ve forgotten perl, I’m thinking maybe there’s a better language. Maybe there’s something that I can actually remember from week to week as my infrequent needs call upon it.

I’ve been thinking about a few language options:

  • Python 
  • Ruby
  • Bash
  • JavaScript

I’ve only read the introduction sections of books about Ruby and Python. Python just irks me with its dependence on space. I’m sure that’s a frequent complaint. For those who overcome that somewhat petty problem, the language seems to satisfy. But something about those procedures and method with __something__ surrounded with those underscore characters. Come on, what’s up with that? But the things that really do appeal to me about Python are the general ideas that explicitness is better that obscurity, that one common way is better than a dozen equally flexible ways, and that there is a best way to do something….well, those ideas are comfortable and appealing.

Ruby is fully object-oriented, and it actually does read nicely. I’m also interested in multiple spoken languages, character sets, etc., and I’m not sure whether it fully embraces Unicode as it’s character set. Maybe there are ways to make it work with UTF-8, but I haven’t quite advanced that far.

Bash? Uh no.

Can you believe that I actually considered Javascript briefly. When run under a vm with the Rhino implementation, your JavaScript code has full access to the JRE class libraries. Used this way, it really is only a way to script Java calls. For what I want, no “native” javascript functions exist to read the underlying file system or to create new files.  Without the boilerplate overhead of a full Java application, I suppose I could squeak out some extra productivity. In the end though, it really is just a way to work with Java code. JavaScript might be great within a browser, but on the file system? Hmmm, probably not.

So what are your ideas about a general purpose scripting language? Is perl still the best choice for system work, moving files around, parsing out some key values and writing them elsewhere? Did you move to Python and finally just accept the annoying white space issue? Or is Ruby a good tool for me. What do you know about these? Any suggestions?