John O'Conner

Software internationalization

Wednesday Apr 23, 2008

Encoding URIs in JavaScript

As you pass data from the browser to the application server to the database, opportunities for data loss lurk. I highlighted some of those conversion points earlier, but I neglected a browser issue. The JavaScript layer has its own lossy points of interest. One of those points is the escape function.

The escape function "encodes" a string by replacing non-ASCII letters and some other punctuation symbols with escape sequences of the form %XX, where X is a hex digit. Unicode characters from \u0080 through \u00FF are converted to the %XX form as well. Unicode characters in higher ranges take the form %uXXXX. So, as an example, the name José will take the form Jos%E9. Go ahead, give it a try below:

The problem with this is that the escape mechanism is broken if you want to use UTF-8 as your document encoding. If you were dynamically composing URL strings with parameters, those parameters will definitely not be escaped correctly. Instead of Jos%E9 that URI component should really be Jos%C3%A9.

Fortunately, JavaScript has resolved the problem, but the solution means you'll have to use another function. The escape function is deprecated in ECMAScript v 3. Instead, you should use the function encodeURI or encodeURIComponent. These functions convert their argument to the UTF-8 encoding and then %XX encode all the non-ASCII characters. Two forms of the function exist so that you have greater control over whether characters like "?" and "&" are encoded. You'll need to check your documentation for details. You can experiment with the encodeURIComponent function here:

What's this mean for you? Maybe nothing if you're hopelessly attached to ISO-8859-1. However, if you're trying to reach a global market with your product, chances are very good that you've decided to use UTF-8 for your character set encoding. That's an excellent choice, but you'll have to manage the conversion points. In a nutshell, that simply means that you'll need to use UTF-8 from front to back consistently.

Part of managing those conversion points is consistently providing well-formed URIs to your application server. If you use JavaScript to manipulate data or to create dynamic URIs in your application, make sure you toss aside that deprecated escape function. Take a look at encodeURI and encodeURIComponent instead.

Saturday Apr 12, 2008

Migrating from Latin-1 to UTF-8

You'd think this sort of problem would be resolved by now, but it's not. It's still almost impossible to quickly and easily migrate an application from the too common default Latin-1 to UTF-8 character set encoding. The problem isn't that UTF-8 can't handle the conversion. No, that's definitely not it. UTF-8 can represent any Latin-1 character and much, much more. The problem is that the Latin-1 charset is so deeply ingrained as the default in every software interface that you just have so many faulty conversion points. A conversion point is a handoff point between one software component and another, a place where character encodings matter and where faulty conversions are way too common.

Here's an example: a simple web application that stores names and addresses in a database. Chances are, if you haven't done anything explicit to change this, the web page itself will have no charset encoding associated with it. And neither will your application server. And neither will your database. And without explicit settings, many applications use Latin-1 as the default character set. So, you'll be able to enter, store, retrieve, and display common Western European names, but you won't be able to handle Russian or Japanese or Chinese or, well, you get the idea.

So let's imagine you decide to convert from Latin-1 to UTF-8 so that you open up your application to the rest of the world's languages and scripts. What does that mean? What must you do? How do you start?

Here are some of the charset conversion points you'll need to resolve as you migrate through this problem:

  1. database tables
  2. database connections
  3. application and/or web server frameworks
  4. web page
  5. form encodings
  6. JavaScript or other browser scripts

To help you get started, I've discussed the first 4 conversion points in the article Character Conversions from Browser to Database. Go ahead, take a look. But come back here to let me know what you think.

I'll talk about some of the JavaScript issues in an upcoming blog.

Thursday Apr 10, 2008

Updating timezone data in older Java VMs

Wouldn't it be nice if you could install the latest version of the JDK or JRE in your production environment? But maybe you just can't do that because of your company policy, testing cycles, or adoption process. Unfortunately, whether you can update or not, things change. Some things affect your application whether you want them to or not. For example, as timezone data changes, software must change to keep pace. Wouldn't it be great if you could update just the timezone data in your existing vm...might be easier to get approval for that instead of approval to replace your complete JDK/JRE.

The Timezone Updater Tool allows you to update timezone data in older JDK/JREs. Using the updater tool, you're able to update the data without updating the entire JDK/JRE. You can learn more about this product from the online article Timezone Updater Tool.

Tuesday Apr 08, 2008

Basic Definitions for a Unicode Discussion

If you want to communicate, defining confusing terms right up front is always a good first step. So I'll try to define some Unicode terms:

CharacterThe smallest unit of meaning in a written language. This unit typically has a common shape and meaning, although specific shapes can vary quite dramatically. Specific shapes are more commonly called glyphs.
Character SetAn unordered collection of characters.
Coded Character Set An ordered character set in which each character has an assigned integer value.
Code PointThe integer value of a character within a coded character set.
Character EncodingA mapping of code points to a series of bytes.
Code UnitA single octet or byte of an encoded character.
CharsetOften used as a synonym for Coded Character Set.
.

You can always see more terms by visiting the Unicode Glossary.

Unicode 5.1 released this week

The Unicode Consortium released Unicode 5.1 this week. With more than 100,000 characters, Indic and South East Asian script additions, and more, this release enables ideographic variation sequences used in Japanese, Chinese, and Korean text. Learn more about these additions and more by reading the Unicode 5.1 Press Release.


Archives
Links
Referrers