Archive

Archive for the ‘Unicode’ Category

Unicode Characters and Alternative Glyphs

February 13th, 2013 joconner No comments

Smiley face

Unicode defines thousands of characters. Some “characters” are surprising, and others are obvious. When I look at the Unicode standard and consider the lengthy debates that occur when deciding upon whether a character should be included, I can imagine the discussion and rationalization that occurs. Deciding on including a character can be difficult.

One of the more difficult concepts for me to appreciate is the difference between light and dark (or black and white) characters. A real example will help me explain this. Consider the “smiley face” characters U+263A and U+263B:  ☺ and ☻. These characters are named WHITE SMILING FACE and BLACK SMILING FACE respectively.

These are not the only characters that have white and black options. Dozens of others exist. There are even white and black options for BLACK TELEPHONE and WHITE TELEPHONE.

Of course, once these characters go into the standard, they should stay. One shouldn’t remove existing characters. However, a serious question does arise when considering WHITE and BLACK options for a character.

The question I have is this: Why? Why isn’t the white and black color variation simply a font variation for the same character. The Unicode standard clearly states that it avoids encoding glyph variations of the same character. That makes a lot of sense. However, in practice, the standard at least appears to do exactly the opposite for many characters. I can only guess that someone on the standards committee made a very good, logical and well-supported argument for the character differentiation.

My hope for future versions of the standard is that these kind of color variations will be avoided. Not being on the committee when these characters were added, I cannot really complain. And I hope that my comments here don’t come across that way. However, in the future, I’d like the standard to include annotations for these characters that describe why they deserve separate code points. It certainly isn’t clear from the existing character’s notes, and I’m sure that others would be curious about the reasons as well.

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Categories: Unicode Tags: ,

Unicode 6.1 Released

February 1st, 2012 joconner No comments

Unicodelogo

The Unicode Consortium announced the release of Unicode 6.1.0 yesterday. The new version adds characters for additional languages in China, other Asian countries and Africa. This version of the standard introduces 732 new characters.

In addition, the standard also added “labels” for character properties that will supposedly help implementers create better regular expressions that are both easier to read and easier to validate. I admit little knowledge about these labels at the moment, but will research and report on them in the future if time allows.

One of the oddities of the new version is the inclusion of 200 emoji variants. This is perhaps the only issue of the standard that I just don’t understand. Back in the day when I was more involved in Unicode development, we had a huge effort to unify variants of Chinese characters. We preached that Unicode characters were abstract entities with glyph renderings that were determined by font, style preferences of developers and apps. Now it appears that the Unicode consortium has changed its position on this.  Or maybe partially?. The addition of 200 emoji “variants” just seems unnecessary, but that’s just my opinion and I admit that I may not know all the issues that formed the consortium’s decision.

We have some examples, straight from the announcement, that show only 4 of the 200 new emoji variants:

Emoji tents

As the image shows, the “TENT” emoji has two variants — a text style and a more colorful, graphical emoji style. The standard defends these variants by saying that it allows implementations to distinguish preferred display styles. I think that is what fonts are for. Personally, I just don’t think variants are needed. And, I think that the variants make things more difficult for applications.

What do you think about variants in general? And what about emoji variants specifically?

 

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Categories: Fonts, Standards, Unicode Tags: , ,

Terminology: Unicode Character Encoding

November 13th, 2011 joconner No comments

Unicodelogo

In a recent blog, I described the terms character set, charset, and coded character set. In this blog, we’ll take a small step forward to define a few more terms:

  • encoding form
  • code unit
  • encoding scheme

Before going to much further, you can get all the information in this blog from a much more authoritative source, the Unicode Technical Report 17 (UTR 17). UTR 17 describes the Unicode Character Encoding Model and more formally define all the terms you’ll find in this blog. The added value, if any, of this blog is that I’ll attempt to describe these terms in just a few paragraphs instead of several pages. Still, when you’re feeling a bit adventurous and energetic, you might take on that UTR 17.

Character Encoding Form

An encoding form is a mapping from a code point to a sequence of code units. A code unit has a specific width, for example, 8-bits, 16-bits, or 32-bits. Any Unicode code point value can be mapped to any of these forms. One other note about encoding forms — there are two varieties: fixed width and variable width.

A fixed width encoding form encodes every code point in a fixed number of code units. That is, every code point can be encoded into the same number of code units. UTF-32 is a fixed width encoding form.

Variable width encoding forms encode code points in 1, 2, or more code units. Some variable width encoding forms include UTF-8, and UTF-16. In UTF-8, a character may require from 1 to 4 code units of 8-bits. In UTF-16, characters require 1 or 2 code units of 16-bits.

Code Unit

I’ve already hinted at this definition. It’s worth repeating though. A code unit is a sequence of fixed-size integers that take up a specific number of bits. For example, a code unit can be 8, 16, 32, or even 64-bits on some computer architectures. Code points are mapped to sequences of code units. A single character (code point) can be be mapped to several different code unit representations depending on the encoding form.

Character Encoding Scheme

An encoding scheme is a serialization technique that encodes code units into a byte stream. Since UTF-8 is already an 8-bit (byte) oriented encoding form, UTF-8 is also an encoding scheme.

Because of little-endian and big-endian hardware differences, the UTF-32 and UTF-16 encoding forms can be serialized into two different schemes each. The specific scheme flavors for UTF-32 are UTF-32BE and UTF-32LE, big-endian and little-endian respectively. UTF-16 has similar schemes: UTF-16BE and UTF-16LE.

Did that clear anything up? Or just confuse more. Let me know and I’ll try to clarify.

“Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.”

VN:F [1.9.22_1171]
Rating: 4.0/5 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Categories: Unicode Tags: ,

Not forgotten

November 2nd, 2011 joconner No comments

No, I haven’t forgotten my promise to cover a few more Unicode terms. However, please excuse me while I recover from my recent vacation. In this case, my vacation has rendered me useless for a couple days after my return. Hundreds of emails have gathered in my email INBOX, and I’m still processing them.

I will be back in a couple more days to describe the following:

  • encoding scheme
  • code unit
  • encoding form

In the process of describing the above, we’ll look at another term UTF-32. We’ll eventually get to UTF-16 and UTF-8 after that.

Be patient, I’m almost done with my INBOX.

 

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Categories: Unicode Tags:

Unicode Terminology

October 23rd, 2011 joconner No comments

Logo60s2

I am sometimes asked whether Unicode is a 16-bit character set. The answer is not a simple no, but it is no. The question always reminds me how important terminology is too. Terminology is the focus of this particular post.

At one point long ago, when Unicode was a relative newcomer to the character set stage, it did in fact start out with a character space that had potential character values in the range from 0×0000 through 0xFFFF. In that case, and at that time until around 1995, Unicode could have been called a 16-bit character set. That is, each character could be represented with a single 16-bit integer value.

However, starting in 1996, Unicode’s character range expanded. With Unicode 2.0, the character set defined character values in the range 0×0000 through 0x10FFFF. That’s 21-bits of code space. Unicode can no longer be called a 16-bit character set. With today’s computer architectures, you really have to say that Unicode is a 32-bit character set. But now we have to be careful how we use these terms.

The rest of this discussion is admittedly tiresome and precise. We have to define some terms to make sure we’re talking about the same things. Bear with me.

The fact is that Unicode is much more than just a character set. It embodies a set of best practices and standards for defining characters, encoding them, naming them, and processing them. In the Unicode Consortium’s own words, Unicode is:

the standard for digital representation of the characters used in writing all of the world’s languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language.

The Unicode standard also defines the Unicode character set. This is a coded character set (CCS). A coded character set assigns an integer value to each of its characters. Each character’s numeric integer value is also called a code point. The current Unicode standard allows for code point values all the way up to 0x10FFFF. Often when we refer to Unicode code point values, we use another notation. Instead of writing the code point value as a hexadecimal number with the ’0x’ prefix, we use ‘U+”. So, in this alternate notation, to make sure others know that we’re explicitly talking about Unicode code point values, we write U+10FFFF. However, I’m not picky about this. It is, though, a noteworthy distinction. Strictly speaking, 0x10FFFF is just a very large hexadecimal number. U+10FFFF is a specific Unicode code point value.

So, we’ve established that Unicode is not a 16-bit character set, although it is a character set. Specifically, it is a coded character set. Remember how I’ve defined a CCS above. Sometimes you’ll hear other terms that are equivalent to a coded character set. The terms character set and charset are often used as synonyms, though strictly speaking neither imply that an assignment of code point values.

An encoding is something else, and it refers to how we serialize a code point for storage or transfer. Those clever people within the Unicode Technical Committee have devised several ways to encode the Unicode (coded) character set, giving us 3 common encodings:

  • UTF-32
  • UTF-16
  • UTF-8

Terms We’ve Learned

Here are the terms we’ve used so far:

  • character set
  • coded character set/charset
  • Character encoding

Next Up

Next time, let’s talk about these encodings: UTF-32, UTF-16, and UTF-8

 

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Categories: Unicode Tags: , ,

Using Combining Sequences for Numbers

September 30th, 2011 joconner No comments

Circled number fifty

Today I just happened to be looking through some of the precomposed Unicode circled numbers, numbers like ①, ②, ③, and so on. Just in case your system, doesn’t support the fonts for these characters, here’s an image that shows what I mean:

Precomp circled numbers

I wasn’t all that surprised to see these CIRCLED DIGIT ZERO, CIRCLED DIGIT ONE, CIRCLED DIGIT TWO, through CIRCLED DIGIT NINE characters. However, I was surprised to see precomposed characters for other numbers, numbers all the way up to 50:

Circled number fifty CIRCLED NUMBER FIFTY

Why stop at 50? Well, obviously Unicode can’t encode every number. Although Unicode doesn’t define a CIRCLED NUMBER FIFTY ONE, how can I create this using combining sequences? For example, for the above single digits, I have a couple options for displaying these:

  1. a precomposed character like U+2460 — ①
  2. a combining sequence like U+0031 U+20DD —   1⃝  the digit 1 followed by the COMBINING ENCLOSING CIRCLE character     ⃝

Again, if you can’t see that character sequence, here’s the image of U+0031 U+20DD:

Combining one

 

Alright, so there we have a great example of using two Unicode code points together to form a single visual glyph on-screen. But how do I get the COMBINING ENCLOSING CIRCLE character to combine over two previous digits? What if there were not a precomposed CIRCLED NUMBER FIFTY ONE? There’s isn’t one, by the way. And yet I want to enclose two or more arbitrary digits with the COMBINING ENCLOSING CIRCLE character. Hmmm….

Sigh…. I have to admit that I don’t actually know how to do this. I suspect that I can use some of Unicode’s control characters like START OF GUARDED AREA and END OF GUARDED AREA or …. I don’t really know.

When I find out, I’ll repost. If you know, please share!

VN:F [1.9.22_1171]
Rating: 5.0/5 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Categories: Fonts, Unicode Tags: ,

Encoding Unicode Characters When UTF-8 Is Not An Option

September 27th, 2011 joconner No comments

The other day I suggested that you use UTF-8 to encode your Java source code files. I still think that’s a best practice. If you can do that, you owe it to yourself to follow that advice.

But what if you can’t store text as UTF-8? Perhaps your repository won’t allow it. Or maybe you simply can’t standardize on UTF-8 across the groups. What then? In that case, you should use ASCII to encode your text files. It’s not an optimal solution. However, I can help you get more interesting Unicode characters into your Java source files despite the actual file encoding limitation.

The trick is to use the native2ascii tool to convert your non-ASCII unicode characters to a \uXXXX encoding. After editing and creating a file contain UTF-8 text like this:

String interestingText = "家族";

You would instead run the native2ascii tool on the file to produce an ASCII file that encodes the non-ASCII characters in  \u-encoded notation like this:

String interestingText="\u5BB6\u65CF";

In your compiled code, the result is the same. Given the correct font, the characters will display properly. U+5BB6 and U+65CF are the code points for “家族”. Using this type of \u-encoding, we’ve solved the problem of getting the non-ASCII characters into your text file and repository. Simply save the converted, \u-encoded file instead of the original, non-ASCII file.

The native2ascii tool is part of your Java Development Kit (JDK). You will use it like this:

native2ascii -encoding UTF-8 <inputfile> <outputfile>

There you have it…an option for getting Unicode characters into your Java files without actually using UTF-8 encoding.

 

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)
Categories: Java, Unicode Tags: ,

Best practice: Use UTF-8 as your source code encoding

September 22nd, 2011 joconner No comments

Logo60s2

Software engineering teams have become more distributed in the last few years. It’s not uncommon to have programmers in multiple countries, maybe a team in Belarus and others in Japan and in the U.S. Each of these teams most likely speaks different languages, and most likely their host systems use different character encodings by default. That means that everyone’s source code editor creates files in different encodings too. You can imagine how mixed up and munged a shared source code repository might become when teams save, edit and re-save source files in multiple charset encodings. It happens, and it has happened to me when working with remote teams.

Here’s an example, you create a test file containing ASCII text. Overnight, your Japanese colleagues edit and save the file with a new test and add the following line in it:

String example = "Fight 文字化け!";

They save the file and submit it to the repository using Shift-JIS or some other common legacy encoding. You pick up the file the next day, add a couple lines to it, save it, and BAM! Data loss. Your editor creates garbage characters because it attempts to save the file in the ISO-8859-1 encoding. Instead of the correct Japanese text from above, your file now contains the text “Fight ?????” Not cool, not fun. And you’ve most likely broken the test as well.

How can you avoid these charset mismatches? The answer is to use a common charset across all your teams. The answer is to use Unicode, and more specifically, to use UTF-8. The reason is simple. I won’t try hard to defend this. It’s just seems obvious. Unicode is a superset of all other commonly used character sets. UTF-8, a specific encoding of Unicode, is backward compatible with ASCII character encoding, and all programming language source keywords and syntax (that I know) is composed of ASCII text. UTF-8 as a common charset encoding will allow all of your teams to share files, use characters that make sense for their tests or other source files, and never lose data again because of charset encoding mismatches.

If your editor has a setting for file encodings, use it and choose UTF-8. Train your team to use it too. Set up your ANT scripts and other build tools to use the UTF-8 encoding for compilations. You might have to explicitly tell your java compiler that source files are in UTF-8, but this is worth making the change. By the way, the javac command line argument you need is simply “-encoding UTF-8″.

Recently I was using NetBeans 7 in a project and discovered that it defaults to UTF-8. Nice! I was pleasantly surprised.

Regardless of your editor, look into this. Find out how to set your file encodings to UTF-8. You’ll definitely benefit from this in a distributed team environment in which people use different encodings. Standardize on this and make it part of your team’s best practices.

“Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.”

VN:F [1.9.22_1171]
Rating: 2.3/5 (3 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Categories: Java, NetBeans, Unicode Tags: , , ,

Encoding URLs for non-ASCII query params

June 24th, 2011 joconner No comments

Are you a web service API developer? The web truly is a world-wide web. Unfortunately, a great number of globally unaware developers are on the global web. This creates an odd situation in which web services are globally accessible but only locally or regionally aware.

There are a few important things to remember when creating a global web service. Let’s just cover ONE today: non-ASCII query parameters are valid, useful, and often necessary for a decent, global web service.

It seems so obvious to me, and it probably does to you. Sometimes a service needs to exchange or process non-ASCII data. The world is a big place, and although English is an important part of the global web, more people speak a different language. English is a big percent, but lots of people use Chinese or an Indic language too. Let’s make sure your web service can process all those non-ASCII characters in English or any other language!

Let’s look at some examples of non-ASCII query params:

  • http://example.com?name=田中&city=東京
  • http://example.com?名前=田中&市=東京

In these examples, you must perform two steps to get the query params (both keys and values) into the correct form:

  1. Convert the keys and their values to UTF-8 if they are not already.
  2. Perform the “percent encoding” on each UTF-8 code unit

To do #1, you’ll need to use whatever character conversion utility you have in your developer’s library: the Java charset encoding converters, whatever.

The #2 step is the important one for this blog. For each hexadecimal code unit in the UTF-8 query portion, you must “percent encode” the code unit. Let’s look at the first example query params:

name=田中&city=東京

The JavaScript function encodeURI actually does a good job of doing this for us:

encodeURI("name=田中&city=東京") produces the string:

name=%E7%94%B0%E4%B8%AD&city=%E6%9D%B1%E4%BA%AC

Notice that you should also include this encoding for the keys in the param list. In the next example, I’ve used Japanese values for both keys and values.

encodeURI(“名前=田中&市=東京”) produces this string:

%E5%90%8D%E5%89%8D=%E7%94%B0%E4%B8%AD&%E5%B8%82=%E6%9D%B1%E4%BA%AC”

Note that both the keys and vaues have been “percent encoded”.

On the server side, your server will understand how to decode these values into their correct UTF-8 string values if you have configured it correctly. Correct configuration of a server usually involves a charset conversion filter for a servlet container and sometimes just a config setting for Apache.

More on this at a later time.

 

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Categories: JavaScript, Unicode, Web Tags:

Attending IUC 34 and career longevity

October 20th, 2010 joconner No comments

After a few years being away from the internationalization crowd, I’m attending the Internationalization and Unicode Conference again this year. How great to see old friends and to make new ones. Some things are new — some new people. However, many things are old or definitely older.

What’s old? Well, for one, the problems. It’s the same problems, over and over again. It seems like every new tool, application, operating system, whatever… they all struggle with internationalization as if it’s a new problem. And it isn’t. After almost two decades in this industry, I still am surprised that we talk about resource bundles, date and time formats, etc. I keep thinking this stuff is resolved and over. But every year, the IUC reminds me that it is not. Every new platform and tool and application will repeat the mistakes of the past and solve these problems yet again and again and again as if they are new. Why is that?

Some things were very new, mainly products and specific technologies. We have new characters in Unicode. Old languages (JavaScript) are getting more internationalization support in some future version. Windows 7. Twitter. And gray hair. That’s definitely new. Some of my very good colleagues in the industry have aged…and it reminds me about my own age and career in internationalization.

About 15 years ago, my friend Bill Hall and I mused that we might be out of a job one day in the internationalization (i18n) industry. Maybe we thought that we and others like us would solve all the internationalization issues and make everyone aware and create libraries that everyone would use everywhere. We really thought that we could work hard, solve all the problems, and finally make our jobs unnecessary or obsolete. Funny thing is that here we are 15 years later, and it’s clear to me now that we didn’t permanently solve any problem. We provided temporary solutions, but nothing permanent. It’s humbling to think that one’s life-work hasn’t made many reusable solutions, but this knowledge does have a silver lining too.

So, I suppose my rant just boils down to this. Welcome to all the new individuals in this industry! All you people at Twitter, those working on Android or Chrome, all you newer Adobe Flash and Flex folks, and all newer individuals representing a host of other companies …. welcome to the internationalization industry and welcome to IUC! You can be happy to know that despite all your hard work today, you will always have a job in this industry tomorrow. You are in a great, vibrant, long-lived career. Despite your best efforts, you probably will never work yourself out of a job! What you do is needed, necessary today and tomorrow, and most likely always will be!

Bitter-sweet? Definitely. Sigh….

 

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)
Categories: Unicode Tags: ,