Monthly Archives: September 2011

Using Combining Sequences for Numbers

Circled number fifty

Today I just happened to be looking through some of the precomposed Unicode circled numbers, numbers like ①, ②, ③, and so on. Just in case your system, doesn’t support the fonts for these characters, here’s an image that shows what I mean:

Precomp circled numbers

I wasn’t all that surprised to see these CIRCLED DIGIT ZERO, CIRCLED DIGIT ONE, CIRCLED DIGIT TWO, through CIRCLED DIGIT NINE characters. However, I was surprised to see precomposed characters for other numbers, numbers all the way up to 50:

Circled number fifty CIRCLED NUMBER FIFTY

Why stop at 50? Well, obviously Unicode can’t encode every number. Although Unicode doesn’t define a CIRCLED NUMBER FIFTY ONE, how can I create this using combining sequences? For example, for the above single digits, I have a couple options for displaying these:

  1. a precomposed character like U+2460 — ①
  2. a combining sequence like U+0031 U+20DD —   1⃝  the digit 1 followed by the COMBINING ENCLOSING CIRCLE character     ⃝

Again, if you can’t see that character sequence, here’s the image of U+0031 U+20DD:

Combining one

 

Alright, so there we have a great example of using two Unicode code points together to form a single visual glyph on-screen. But how do I get the COMBINING ENCLOSING CIRCLE character to combine over two previous digits? What if there were not a precomposed CIRCLED NUMBER FIFTY ONE? There’s isn’t one, by the way. And yet I want to enclose two or more arbitrary digits with the COMBINING ENCLOSING CIRCLE character. Hmmm….

Sigh…. I have to admit that I don’t actually know how to do this. I suspect that I can use some of Unicode’s control characters like START OF GUARDED AREA and END OF GUARDED AREA or …. I don’t really know.

When I find out, I’ll repost. If you know, please share!

Encoding Unicode Characters When UTF-8 Is Not An Option

The other day I suggested that you use UTF-8 to encode your Java source code files. I still think that’s a best practice. If you can do that, you owe it to yourself to follow that advice.

But what if you can’t store text as UTF-8? Perhaps your repository won’t allow it. Or maybe you simply can’t standardize on UTF-8 across the groups. What then? In that case, you should use ASCII to encode your text files. It’s not an optimal solution. However, I can help you get more interesting Unicode characters into your Java source files despite the actual file encoding limitation.

The trick is to use the native2ascii tool to convert your non-ASCII unicode characters to a \uXXXX encoding. After editing and creating a file contain UTF-8 text like this:

String interestingText = "家族";

You would instead run the native2ascii tool on the file to produce an ASCII file that encodes the non-ASCII characters in  \u-encoded notation like this:

String interestingText="\u5BB6\u65CF";

In your compiled code, the result is the same. Given the correct font, the characters will display properly. U+5BB6 and U+65CF are the code points for “家族”. Using this type of \u-encoding, we’ve solved the problem of getting the non-ASCII characters into your text file and repository. Simply save the converted, \u-encoded file instead of the original, non-ASCII file.

The native2ascii tool is part of your Java Development Kit (JDK). You will use it like this:

native2ascii -encoding UTF-8 <inputfile> <outputfile>

There you have it…an option for getting Unicode characters into your Java files without actually using UTF-8 encoding.

 

Best practice: Use UTF-8 as your source code encoding

Logo60s2

Software engineering teams have become more distributed in the last few years. It’s not uncommon to have programmers in multiple countries, maybe a team in Belarus and others in Japan and in the U.S. Each of these teams most likely speaks different languages, and most likely their host systems use different character encodings by default. That means that everyone’s source code editor creates files in different encodings too. You can imagine how mixed up and munged a shared source code repository might become when teams save, edit and re-save source files in multiple charset encodings. It happens, and it has happened to me when working with remote teams.

Here’s an example, you create a test file containing ASCII text. Overnight, your Japanese colleagues edit and save the file with a new test and add the following line in it:

String example = "Fight 文字化け!";

They save the file and submit it to the repository using Shift-JIS or some other common legacy encoding. You pick up the file the next day, add a couple lines to it, save it, and BAM! Data loss. Your editor creates garbage characters because it attempts to save the file in the ISO-8859-1 encoding. Instead of the correct Japanese text from above, your file now contains the text “Fight ?????” Not cool, not fun. And you’ve most likely broken the test as well.

How can you avoid these charset mismatches? The answer is to use a common charset across all your teams. The answer is to use Unicode, and more specifically, to use UTF-8. The reason is simple. I won’t try hard to defend this. It’s just seems obvious. Unicode is a superset of all other commonly used character sets. UTF-8, a specific encoding of Unicode, is backward compatible with ASCII character encoding, and all programming language source keywords and syntax (that I know) is composed of ASCII text. UTF-8 as a common charset encoding will allow all of your teams to share files, use characters that make sense for their tests or other source files, and never lose data again because of charset encoding mismatches.

If your editor has a setting for file encodings, use it and choose UTF-8. Train your team to use it too. Set up your ANT scripts and other build tools to use the UTF-8 encoding for compilations. You might have to explicitly tell your java compiler that source files are in UTF-8, but this is worth making the change. By the way, the javac command line argument you need is simply “-encoding UTF-8″.

Recently I was using NetBeans 7 in a project and discovered that it defaults to UTF-8. Nice! I was pleasantly surprised.

Regardless of your editor, look into this. Find out how to set your file encodings to UTF-8. You’ll definitely benefit from this in a distributed team environment in which people use different encodings. Standardize on this and make it part of your team’s best practices.

“Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.”

User error using Java 7 Locale

Yesterday I pushed a blog entry about my experience with the Locale class in Java 7. As I experimented with the class, I discovered a new category enumeration:

  • Locale.Category.DISPLAY
  • Locale.Category.FORMAT

I learned some more about the Locale.setDefault methods that indicate a user error on my part. I stated that ResourceBundle is tied to the DISPLAY default locale. I made this assumption because all the Format subclasses do use the FORMAT category’s default locale setting. I inferred that ResourceBundle would do something similar — that it would use the DISPLAY default locale. It just isn’t true.

In some ways Java 7 has introduced the concept of 3 separate Locale categories: the two above and a system-wide category. It turns out that Locale.setDefault(aLocale) will reset both the DISPLAY and FORMAT defaults to the system-wide locale. However, a call to Locale.setDefault(Category.DISPLAY, anotherLocale) does not affect the system-wide locale.

It is interesting to note that various Format subclasses will use the FORMAT default locale if you don’t explicitly use a locale param in the their getInstance methods. Surprisingly (to me however), the ResourceBundle.getBundle method does NOT use the DISPLAY locale when you don’t provide an explicit locale parameter; it uses the system-wide default locale that was set at startup or the locale you’ve set with Locale.setDefault(aLocale).

So there you have it. My mistake. Hope this helps clarify.

Now that I’ve described the current behavior, I do have an opinion on this. For what it’s worth, my opinion is this: if the system default locale is used by ResourceBundle.getBundle, then other locale-sensitive classes should also use the system-wide default when you don’t explicitly provide a locale in their creation methods. The difference in how locale-sensitive classes use a default locale is confusing in its current Java 7 state. All methods that use a default locale instance should probably use the system-wide locale default.

Oh, and one other thing… If you’re going to have Locale “Categories”, you might as well introduce another: Locale.Category.ALL or Locale.Category.SYSTEM. When you call Locale.setDefault(ALL, aLocale), I would hope that the call would set the default for all other categories that exist now and in the future. Yes, I do realize that a call to Locale.setDefault(aLocale) already resets all the other category defaults, but for consistency’s sake, we need a Category.ALL. My argument for this is simple: it’s consistent and as a user I just expected to see it. When I didn’t, I became confused. I wrote some demo code, and it didn’t work as expected. Yes, it’s user error, but I just know some of you will make the same error if you haven’t already read about the differences here.

 

Bugs in Java 7 Locale or ResourceBundle?

While working on a chapter in an upcoming APress book, I was experimenting with Java 7’s Locale and ResourceBundle classes. Java 7 introduces two new Locale categories: DISPLAY and FORMAT. You can set the default locale for localizable user interface resources independently from the default locale for data Format subclasses. For example, you supposedly can set up a DISPLAY locale to display Spanish (es) resources for the user interface text but use American English (en-US) for date formats.

This does seem to work correctly the first time you load a ResourceBundle, but subsequent calls to getBundle seem to get stuck with previous DISPLAY locale settings. For example, take a look at the following code:


public void setCategoryDefaultLocale(Locale displayLocale, Locale formatLocale) {
    Locale.setDefault(Locale.Category.DISPLAY, displayLocale);
    Locale.setDefault(Locale.Category.FORMAT, formatLocale);   
}

public void demoDefaultLocaleSettings() {
    DateFormat df = DateFormat.getDateTimeInstance(DateFormat.SHORT,
        DateFormat.SHORT);
    ResourceBundle resource = ResourceBundle.getBundle("mypackage.resource.SimpleResources");
    String greeting = resource.getString("GOOD_MORNING");
    String date = df.format(NOW);
    System.out.printf("DISPLAY LOCALE: %s\n",
        Locale.getDefault(Locale.Category.DISPLAY));
    System.out.printf("FORMAT LOCALE:  %s\n",
        Locale.getDefault(Locale.Category.FORMAT));
    System.out.printf("%s, %s\n\n", greeting, date );
    ResourceBundle.clearCache();   
}

Now I call the two methods above in succession:


setCategoryDefaultLocale(Locale.forLanguageTag("es-MX"), Locale.US);
demoDefaultLocaleSettings();
setCategoryDefaultLocale(Locale.US, Locale.forLanguageTag("es-MX"));
demoDefaultLocaleSettings();

In the following output, notice that the first call to demoDefaultLocaleSettings does actually grab the es-MX resources for the DISPLAY locale, and it formats the NOW date using the en-US FORMAT locale. That’s exactly what I’d expect. However, the subsequent call (even after flushing the ResourceBundle cache), still loads the es-MX resources despite the explicit request for en-US resources. I expected that ResourceBundle would have loaded and retrieved “Good morning!” after the second call to setCategoryDefaultLocale.


DISPLAY LOCALE: es_MX
FORMAT LOCALE:  en_US
¡Buenos días!, 9/18/11 1:59 AM

DISPLAY LOCALE: en_US
FORMAT LOCALE:  es_MX
¡Buenos días!, 18/09/11 01:59 AM

I suspect that the real problem isn’t in the Locale.setDefault methods. Instead, I’m going to blame ResourceBundle for this until I can prove otherwise.

If you’ve seen this in the past, let me know. I’m coming up empty for now trying to resolve why I don’t get U.S. English in the second call. Let me know if you know anything about this! Maybe a bug? Maybe user error?

Java 7 Locale Changes

I admit that I haven’t been particularly active in the i18n or Java standards bodies in the last few years. Pardon me, but I had a wild rumpus in a startup for almost 3 years, then joined Yahoo for a couple years, and BAM! … during that time, the Unicode Consortium added emoji, and the Oracle Java folks overhauled the once relatively simple java.util.Locale class among other things. That’ll teach me to turn my back on those folks.

You can almost forgive the Unicode people for their addition of emoji… I just know that Ken Lunde had something to do with it but haven’t been able to pin the whole issue on him yet. :) [Just kidding people, just kidding Ken!]

BUT more interesting to me right now…have you seen java.util.Locale lately. What the heck happened there? Java 7 introduces Locale.Builder, Locale.Category, and a Locale.forLanguageTag method. With Java 7’s support of BCP 47 (the best practices for language coding), the Locale got a beefy new Builder inner class to help with…well, to help with building BCP 47 compliant locales. More on this in a different post. With the BCP 47 support, I suppose you might need the forLanguageTag method too. Now here’s what stumps me though — that Category class!

As far as I can tell, this category is to help you differentiate a locale instance for two different purposes: user interface language and locale-sensitive data formats. The new categories are the following:

  • DISPLAY
  • FORMAT

Do we really need a separate Category class to make that explicit?

Locale has another new method: setDefault(Locale.Category, Locale newLocale). I can’t wait to play with this a bit more. I suspect that this will set the default resource bundles (DISPLAY) for UI language and the default FORMAT for things like DateFormat, etc. However, I can’t understand why these two categories are actually needed in the platform libraries. Convenience maybe? You could always use two different Locale instances for this in the past, but I suppose this makes that easier? I’ll have to play around with these to find out more.

Come back again in a couple days after I look into this new Category and how it affects internationalization. I’m having fun discovering some of the new functionality and am eager to let you know what I find.

The difference between 1st and 3rd party cookies

The question: What’s a third party cookie? OK, let’s assume I’m an expert at these things, which I’m not, but let’s just assume that I play an expert at these things. Here’s your answer….

Cookies are small pieces of information stored in your browser’s cached files. If you visit a site, say example.com, that site might decide to store some state on your browser — a cookie. The cookie is a 1st party cookie because it is created and sent back and forth between your computer and that site, example.com, in your browser’s url field.

So what’s a 3rd party cookie? Well, often a document will fetch additional pieces of information from other sites, maybe a javascript file or an image, or maybe even entire documents. Anytime the document from example.com imports a file or script or image from a different web site (the 3rd party site), that site can also set cookies. Those cookies are 3rd party cookies.

Let’s look at this a little more closely. Imagine you browse to http://example.com/hello.html. This site might send back a 1st party cookie: FOO=1. The cookie could be stored in the domain of example.com, and only visits to that example.com would prompt the browser to automatically send that cookie value to example.com on subsequent visits. Because of the rules around cookie security, however, your browser would never send an example.com cookie to another site like example2.com.

Now imagine that the example.com/hello.html file has additional links to other sites. Maybe hello.html has an image link that pulls a photo from example2.com. Now example2.com is being called from your example.com/hello.html document. Your browser will not send any example.com cookie values to the referenced example2.com site. However, the example2 site may decide to drop a cookie as well. Since it is not the primary site of your document, which is example.com/hello.html, the example2.com cookie is called a 3rd party cookie.

Hmmm… so here we have examples of 1st party and 3rd party cookies. So, what’s a 2nd party cookie? I don’t actually know the answer to that question. If the site of your document is the 1st party and referenced sites are 3rd party, that really leaves the browser or its user as the 2nd party. Maybe a 2nd party cookie (which I’ve never really heard in any discussion) would simply be a cookie manually created by the user? Hmmmm…. probably not an important point. However, hopefully you know the difference between 1st and 3rd party cookies!

Cookie example