Archive for the ‘Unicode’ Category.

These characters aren’t exotic!

Recently I had the opportunity to sign up for health benefits with a 3rd party site that manages these things for my employer. Sites that collect data often limit the set of characters that you must use for each field. That’s reasonable for numeric fields, date fields, etc. After all, you don’t want invalid data in a field, and you’d like to help users enter correct data wherever possible.

However, I think it’s unreasonable to limit characters that are legitimately used in a field type. For example, these characters show up all the time within perfectly valid names:

  • APOSTROPHE ‘
  • HYPHEN -
  • ACUTE ACCENT ´
  • DIAERESIS ¨

Come on…in 2010 these are not exotic characters. They exist in all kinds of unimpressive, common names….like O’Conner for example! In the figure below, the data collection form dislikes the apostrophe. Come on, it’s part of my name.

Unfortunately this is all too common. Do you have a problematic name? Share it with me…what name causes you grief in online forms?

VN:F [1.9.3_1094]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.3_1094]
Rating: 0 (from 0 votes)

IUC 34 Submission

The submission deadline for technical sessions and tutorials for the upcoming Unicode Conference has been extended. You now have until Friday to procrastinate.

I’ve submitted my proposal, a technical session. Here’s the somewhat offbeat proposed session:

Character Conversions from Browser to Database…and Back Again

As characters travel through a typical web-based application, they must cross several boundaries and borders. These include the following:

  • browser processing and JavaScript
  • browser request creation
  • middle-tier request processing
  • database storage

Along the way, many forces conspire to transform and manipulate your noble, beautiful characters into unrecognizable, mutant forms. What causes those conversions and transformations? How do you combat destructive conversions?

This session takes you on a journey with some great characters on their travels from browser to database…and back again. Some make it back whole, and others simply never survive…

VN:F [1.9.3_1094]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.3_1094]
Rating: 0 (from 0 votes)

Writing UTF-8 CSV Files for Excel

Yesterday a coworker complained that Excel wasn’t displaying a CSV (comma separated values) file correctly. Our application allows the user to send a report via email. The application provides the report as a CSV file. Because the report can contain multilingual text, we’ve decided to encode it in UTF-8. Unfortunately, when users click on the file to display it, usually in Excel, all of the multi-byte encoded characters display incorrectly.

The problem was immediately clear to me…Excel was opening the UTF-8 encoded files, but it was incorrectly identifying them as Latin-1 encoded files. In the absence of any charset identification, Excel must guess about a file’s content encoding. In our environment, many host PCs use en_US locales with Latin-1 as the typical charset. Excel uses that default to read and display CSV files.

My solution to the problem was to use the byte-order marker (BOM) to identify the CSV file as a Unicode file. I instructed my colleague to prepend the FEFF character to the file. The Java application that writes the file uses a FileWriter that encodes to UTF-8 to create the CSV file. It was simple to just output the BOM as the first character in the file.

Now when our customers double-click on these files, Excel opens the file, notices the BOM, and automatically selects UTF-8 as the file’s charset encoding. Now Excel displays the previously mangled characters correctly. And I was able to help resolve a problem with an easy solution.

Maybe you can give your applications a hint about plain text files as well. Writing the BOM to your file can help Unicode-enabled applications know how to decode your Unicode files.

VN:F [1.9.3_1094]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.3_1094]
Rating: 0 (from 0 votes)

Unicode Haiku #3

And sure to be someone’s favorite, submitted by Jon Hanna at IUC 33:

A harsh lonely night,
my Private Use Area
has no assignments

VN:F [1.9.3_1094]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.3_1094]
Rating: 0 (from 0 votes)

Unicode Haiku #2

Submitted at the IUC #33 by Ken Lunde:

Beyond BMP
So many ideographs
So many Extensions

VN:F [1.9.3_1094]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.3_1094]
Rating: 0 (from 0 votes)

Unicode Haikus #1

Submitted at the recent International Unicode Conference by Mark Crispin:

Unicode has planes
But not a power of two
Strangely seventeen
VN:F [1.9.3_1094]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.3_1094]
Rating: 0 (from 0 votes)

An Internationalization and Unicode Web Service?

Chances are that your favorite development platform already has internationalization support built into it. And probably Unicode charset support is there too. For example, Java and .NET platforms contain lots of APIs for formatting dates, numbers, etc in locale-sensitive ways. And you can get Unicode character data easily too. Unfortunately, the Unicode standard changes periodically, typically much more frequently than your development platform. You applications probably do NOT have full awareness of the latest Unicode database properties. Maybe that’s not important to you? But maybe it is? What do you do in that case? What can you do? You wait…and wait a little more. You wait until your platform’s authors update their Unicode support. That can be a very long time.

Do you think there’s a need for a web service to provide up-to-date Uncode character property information? If there were such a thing, would you use it?

Maybe Unicode character data isn’t exciting enough for you. So how about this? What if a web service existed that could provide human readable dates, times, and numbers in locale-sensitive formats? Sure those APIs exist in Java and .NET…but again, those platforms aren’t always up to date with the latest locale data and formats. What if your application could use use a web service to retrieve those formats?

Oh, and we haven’t mentioned calendar support yet. With Java, you get Gregorian, a Japanese Imperial calendar, a Thai one….and maybe that’s it. Those are important of course. But what if you need something more exotic? A Hebrew or Islamic or … any number of other calendars in use today and the past. What if a web service existed that could provide that information to your application. Would you use it?

No. I don’t have this web service today. But I wonder if others see a need for it. If so, drop me a note, let me know your ideas. If you don’t think it’s needed, let me know that too. I’m interested in arguments for and against such a service.

VN:F [1.9.3_1094]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.3_1094]
Rating: 0 (from 0 votes)

UTF-8 charset encoding update

The UTF-8 encoding is easy to abuse in some ways. Or rather, sometimes people use it in unexpected ways.

Recently the Java platform received an update to reject one malformed UTF-8 encoding sequence called “non-shortest form.” You can learn more about this fix and its implications for you in the article Overhauling the Java UTF-8 Charset.

VN:F [1.9.3_1094]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.3_1094]
Rating: 0 (from 0 votes)

Unicode support doesn’t mean your application is internationalized

Over the years, I’ve helped many organizations internationalize their software products. One of the most common misunderstandings is how Unicode will help their product. Customers sometimes mistakenly believe that Unicode support will be sufficient to internationalize their products. Sometimes they believe that Unicode “support” is a single, yes-no, on-off ability, when instead Unicode support is typically implemented in various stages and levels.

Unicode is a character encoding standard. It’s a big standard, with lots of nuances. Your products can implement “Unicode support” in many ways. The result is that those products will be able to manipulate, process, store, and perhaps even display the world’s scripts in a variety of ways BUT not usually in all ways. Your product’s ability to support Unicode is not a binary ability; instead, you should understand that products can have “Unicode support” in a variety of levels. In the most simple case, your product might only store and retrieve Unicode characters correctly. At a more sophisticated level, your product may be able to sort, search, or display Unicode characters. Again, Unicode “support” in a product cannot be evaluated by a single check-box or yes-no answer. Typically, products support Unicode in some ways but not in others.

Implementing even the most sophisticated levels of Unicode support doesn’t mean your product is internationalized. Internationalization is the process of preparing a software code base to be easily localized. Internationalization creates a product that has no particular bias towards a single culture or language. That product can be localized for a specific culture. Unicode support can be a key component of an internationalization effort, but it is only one component. Like Unicode support, your internationalization support will have different levels of sophistication and ability.

To summarize, products can support Unicode in a variety of ways. Supporting Unicode does not usually mean that your product has the ability to perform every possible function on Unicode characters. Instead, “support” usually means that you can do some things with Unicode but probably not others. Additionally, supporting Unicode isn’t the only step to internationalize your products. Unicode is only one step, an important step. Internationalization is the process of creating a product that is easier to localize, one that has cultural biases removed so that a specific culture or locale can be supported more easily after localization. You might use Unicode as a step in your internationalization efforts, but Unicode itself doesn’t create an internationalized product.

Contact me or leave a comment if you have questions about how Unicode can help your product. If I can help, I will. If I can’t, I probably know someone who can.

VN:F [1.9.3_1094]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.3_1094]
Rating: 0 (from 0 votes)

What is Unicode?

Unicode is a character set standard. This particular standard assigns a unique number to every character used around the globe, regardless of written and spoken language, computing platform, or application. Unicode includes all the characters used from other more limited character sets. Prior to Unicode, smaller character sets assigned character values differently from each other. Unicode unifies all other character sets; every character gets its own, unique value.

You can get more information about Unicode from the Unicode Home Page.

VN:F [1.9.3_1094]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.3_1094]
Rating: 0 (from 0 votes)