Home > Java, NetBeans, Unicode > Best practice: Use UTF-8 as your source code encoding

Best practice: Use UTF-8 as your source code encoding

September 22nd, 2011 joconner Leave a comment Go to comments

Logo60s2

Software engineering teams have become more distributed in the last few years. It’s not uncommon to have programmers in multiple countries, maybe a team in Belarus and others in Japan and in the U.S. Each of these teams most likely speaks different languages, and most likely their host systems use different character encodings by default. That means that everyone’s source code editor creates files in different encodings too. You can imagine how mixed up and munged a shared source code repository might become when teams save, edit and re-save source files in multiple charset encodings. It happens, and it has happened to me when working with remote teams.

Here’s an example, you create a test file containing ASCII text. Overnight, your Japanese colleagues edit and save the file with a new test and add the following line in it:

String example = "Fight 文字化け!";

They save the file and submit it to the repository using Shift-JIS or some other common legacy encoding. You pick up the file the next day, add a couple lines to it, save it, and BAM! Data loss. Your editor creates garbage characters because it attempts to save the file in the ISO-8859-1 encoding. Instead of the correct Japanese text from above, your file now contains the text “Fight ?????” Not cool, not fun. And you’ve most likely broken the test as well.

How can you avoid these charset mismatches? The answer is to use a common charset across all your teams. The answer is to use Unicode, and more specifically, to use UTF-8. The reason is simple. I won’t try hard to defend this. It’s just seems obvious. Unicode is a superset of all other commonly used character sets. UTF-8, a specific encoding of Unicode, is backward compatible with ASCII character encoding, and all programming language source keywords and syntax (that I know) is composed of ASCII text. UTF-8 as a common charset encoding will allow all of your teams to share files, use characters that make sense for their tests or other source files, and never lose data again because of charset encoding mismatches.

If your editor has a setting for file encodings, use it and choose UTF-8. Train your team to use it too. Set up your ANT scripts and other build tools to use the UTF-8 encoding for compilations. You might have to explicitly tell your java compiler that source files are in UTF-8, but this is worth making the change. By the way, the javac command line argument you need is simply “-encoding UTF-8″.

Recently I was using NetBeans 7 in a project and discovered that it defaults to UTF-8. Nice! I was pleasantly surprised.

Regardless of your editor, look into this. Find out how to set your file encodings to UTF-8. You’ll definitely benefit from this in a distributed team environment in which people use different encodings. Standardize on this and make it part of your team’s best practices.

“Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.”

VN:F [1.9.22_1171]
Rating: 1.0/5 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Best practice: Use UTF-8 as your source code encoding, 1.0 out of 5 based on 1 rating
Be Sociable, Share!
Categories: Java, NetBeans, Unicode Tags: , , ,
  1. September 22nd, 2011 at 04:38 | #1

    I agree completely. The ASCII base of UTF-8 plus its full Unicode compatibility make it natural for this purpose.

    Although not as human-readable, if the number of non-ASCII characters in a particular chunk of source code is relatively small, NCRs (Numeric Character References), such as 一 (if that displays as the actual ideograph, what I meant was the following eight-character ASCII sequence: “& # x 4 E 0 0 ;”) for U+4E00, or whatever notation is accepted in the programming language you’re using. I use NCRs in HTML on a regular basis, mainly because they work in all modern browsers, and are not susceptible to corruption.

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  2. Martin
    September 26th, 2011 at 13:33 | #2

    1) Eclipse editor defaults to Latin1 encoding, instead of UTF-8. Even more annoyingly, it is a workspace property, so one needs to change it with every new workspace. Easy to forget about.
    2) Java’s properties files only support Latin1. There are plugins for Eclipse which automatically convert non Latin1 (non-ASCII) characters to their \uXXXX notation when editing property files texts.

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  3. Martin
    September 26th, 2011 at 13:35 | #3

    Forgot about my favourite PIA:
    3) MySQL’s usually set to Latin1 by default.

    VA:F [1.9.22_1171]
    Rating: 1.0/5 (1 vote cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  4. Siva
    September 27th, 2011 at 01:48 | #4

    Hi,
    Last month I spend almost 2 days on debugging an issue related to character encoding issue.
    When generating a new JSP file using Eclipse by default it has some “windows-XXX” kind of encoding and I am getting some text from DB with the characters that are not supported by “Windows-XXX” and page rendering is terminated abruptly without any error/clue.

    So using consistent Character encoding will save developers from losing hair :-)

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  5. Leslie
    October 3rd, 2011 at 10:56 | #5

    Hello and thank you for your article and for the comments. I see your name is O’Connor. An bhfuil Gaeilge agat? An Labríonn tú Gaeilge? I’m wondering, because I’m attempting to pass some text in Irish from a MySQL database through my Java application to a browser, but I’m losing the Irish fada right from the database. So I’ve been referred to your article and have attempted to change the settings in Eclipse. But I’m still observing character distortions. Right now I’m going back to re-examine my settings in the database.

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  6. October 3rd, 2011 at 12:11 | #6

    @Leslie
    Unfortunately I do not speak Gaelic at all…and I’m only guessing at your question. :)

    I think you should read the following article, that describes some of the charset changes in a typical web application: http://java.sun.com/developer/technicalArticles/Intl/HTTPCharset/

    VA:F [1.9.22_1171]
    Rating: 4.0/5 (1 vote cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  7. Leslie
    October 14th, 2011 at 14:27 | #7

    @Leslie
    Resolved – Was just a problem communicating with the database.

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  8. John O’Conner
    October 14th, 2011 at 15:51 | #8

    @Leslie
    Glad to hear you resolved the problem Leslie!

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
  1. No trackbacks yet.