JavaScript file encodings

All text files have a character encoding regardless of whether you explicitly declare it. JavaScript files are no exception. This article describes both how and why you should declare an encoding when importing script files into an HTML document.

JavaScript’s Character Model

A JavaScript engine’s internal character set is Unicode. The Ecmascript 5.1 Standard standard says that all strings are encoded in 16-bit code units described by UTF-16. Once inside the JavaScript interpreter, all characters and strings are stored and accessed as UTF-16 code units. However, before being processed by the JavaScript engine, a JavaScript file’s charset can be anything, not necessarily a Unicode encoding.

Character Encoding Conversion

When you import a JavaScript file into an HTML document, by default he browser uses the document’s charset to convert the JavaScript file into the interpreter’s encoding (UTF-16). You can also use an explicit charset when importing a file. When an HTML file charset and a JavaScript file charset are different, you will most likely see conversion mistakes. The results are mangled, incorrect characters.

Conversion Problems

I created a simple demonstration of the potential problem. The demo has 5 files:

  • jsencoding.html — base HTML file, UTF-8 charset
  • stringmgr.js — a basic string resource mgr, UTF-8 charset
  • resource.js — an English JavaScript resource file containing the word family, UTF-8 charset
  • resource_es.js — a Spanish file containing the word girl, ISO-8859-1 charset
  • resource_ja.js — a Japanese file containing the word baseball, SHIFT-JIS charset

In the base HTML file, I’ve imported 3 JavaScript resource files using the following import statements:

    <script src="resource.js"></script>
    <script src="resource_es.js"></script>
    <script src="resource_ja.js"></script>


The image shows how the text resources have been converted incorrectly. The browser imported the Spanish JavaScript file using the HTML file’s UTF-8 encoding even though the file is stored using ISO-8859-1. The Japanese resource script is stored as SHIFT-JIS and doesn’t convert correctly either.

After updating the import statements, we see a better result:

    <script src="resource.js" charset="UTF-8"></script>
    <script src="resource_es.js charset="ISO-8859-1"></script>
    <script src="resource_ja.js" charset="SHIFT-JIS"></script>

Correct conversions


To avoid charset conversion problems when importing JavaScript files and JavaScript resources, you should include the file charset. An even better practice is to use UTF-8 as your charset in all files, which minimizes these conversion problems significantly.

You can checkout the code for this article on my github account here:
I18n Examples

Best practices for character sets

CharsampleYou may not understand every language, but that doesn’t mean your applications can’t. Regardless of your customer’s language choice, your application should be able to process, transfer, and store their data. Even if you don’t provide a localized user interface, your application should allow your customer to enter text in their own language and in their own script. For example, my word processor is localized into English, but it allows me to enter text in a variety of scripts and languages.

How is that possible? The most basic requirement for this ability is to use a single character set internally. If you want to handle all scripts, your only choice is the Unicode character set.

Rule #1:
Use Unicode as your character set.

Unicode has several possible encodings, including UTF-32, UTF-16, UTF-16LE, UTF-16BE, and UTF-8. These encodings transform Unicode code points into code units. Code points are the values between 0 and 0x10FFFF, the range of integers that are allocated for character definitions. An encoding transforms code points into code units, which are used to serialize a character for storage or transmission. A single code point becomes 1 or more code units during encoding.

Although all the encodings are well-defined, UTF-8 is the easiest to use primarily because of its code unit size. UTF-8 has 8-bit (byte) code units that are immune to the common memory design issues involving byte ordering. I recommend that you use UTF-8 everywhere possible in your system. You’ll avoid mistakes in determining little endian and big endian layouts with the other encodings.

Rule #2:
Use UTF-8 everywhere possible

OK, so we’ve got the rules for character set and encoding choice. Now you have to implement those rules.

Complex systems typically have many points of failure for textual data loss. Those points are usually hand-off points across systems:
1. File export and import.
2. Outbound and inbound HTTP request data and parameters
3. Database connections and schemas

Each of these deserves its own discussion. Moreover, each has specific implementation details for different products. Unfortunately I can’t cover any of them adequately in this particular post. However, I’ll try to touch on these subjects in a future update. If you have questions about any of them, let me know. I’ll use your suggestions to help me decide which to address first.

For now you have my own best practices for character set choice when creating any system. Good luck!

//John O.

The absolute minimum you should know about internationalization


Internationalization is a design and engineering task that prepares your software product to be localized. It doesn’t create a localized product; instead, it puts your product in a state that allows localization. The goal of internationalization should be a single code base that can be used as-is to create multiple localized versions of your product.

This article provides a high-level description of some issues you must resolve during internationalization. This is not  a comprehensive list:

  • Character sets
  • Resource externalization
  • User interface design
  • Data formats
  • Sorting

Character Sets

Your application will most likely manage, manipulate, store, and display information. Much of the information will be user-readable text. One property of text is it’s character set.

If you want a global-ready application, the choice of character set is simple: Unicode. Unicode allows you to manage text in practically any script without losing data due to character conversion problems. Regardless of the default  character set of the underlying host OS, your application should convert text to Unicode for internal manipulation. Additionally, your application should transmit and store text as Unicode. Doing anything else is unnecessarily complicated and completely unnecessary in any modern operating environment.

Unicode has several possible encodings, including UTF-16 and UTF-8. My experience is that developers rarely get to use just one of these. However, they are BOTH Unicode. Their only significant difference is how a specific code point is encoded in code units. Unless you have a well-understood reason for doing otherwise, I suggest you store and transmit text in UTF-8. Your specific programming language may require you to use UTF-16 for text operations. When displaying text to your user, you might use UTF-16 in a desktop application. When rendering HTML views, you can typically use UTF-16 or UTF-8. I suggest you use UTF-8 everywhere possible.

Resource Externalization

A resource is any text label, message, graphic image, video, audio, or other application file that you intend to present to the user. Instead of hard-coding these resources into your application code, you should extract them into external files that can be used at run-time. By extracting user-facing resources into resource files, you make translation and localization easier. Practically every programming environment provides a mechanism for creating external resource files. 

User Interface Design

User interface layout is often affected by the length of text labels, fields and other visable text. When designing layouts, remember that field and label sizes will increase for some language translations. Design your user interface with the largest label and field lengths in mind. Additionally, follow the typical rules for avoiding culturally sensitive images, hand gestures, and body parts. Also, avoid concatenating shorter pieces of text to build up larger sentences. When translated, the concatenated text rarely has correct syntax or meaning.

Some languages are written from right-to-left. If targeting those languages, remember that the entire layout of page components is often arranged from right-to-left. You may need to create a “reversible” layout that can accommodate those languages and cultures.

Data Formats

Numbers and dates have different formats around the world. Digit separators, currency symbols, and date field orders are all part of the many differences that you’ll need to consider. Fortunately, you don’t have to discover the correct formats and standards for every culture. Many programming environments already provide libraries to format numbers, currencies, and dates using the Common Locale Data Repository (CLDR) formats. 

The main point I want to share about formats is this: separate concerns for data formats by storing and manipulating data in a canonical, non-localized form and apply localized formats only in the “view” layer of your application.


Languages have sorting rules. Those rules help you find names or products in long lists. Dictionaries, phone books, and product catalogs use linguistic sorting to help people find information quickly. When presenting long lists to your users, your application should use those sorting rules as well. Learn and use the sorting or collation libraries in your programming language or technology environment.


Internationalization is an effort to create products that can be translated and localized for many languages and cultures. Creating an internationalized product requires that you consider and plan for a variety of common technical issues. A few of those issues are character set choice, user-interface design, data formats and sorting. You rarely have to solve those issues yourself; you can often find and use existing libraries for this purpose.

More Resources


What is Unicode?

Unicode is a character set standard. This particular standard assigns a unique number to every character used around the globe, regardless of written and spoken language, computing platform, or application. Unicode includes all the characters used from other more limited character sets. Prior to Unicode, smaller character sets assigned character values differently from each other. Unicode unifies all other character sets; every character gets its own, unique value.

You can get more information about Unicode from the Unicode Home Page.