JavaScript file encoding

JavaScript

Although JavaScript itself uses Unicode internally, you can still run into charset conversion problems. Consider the following example of charset conversion issues with a very simple HTML and JS file.

In this example, a hello.html document says “Hello” when you click a button. The button calls a snippet of JavaScript (the sayHello function) to display an alert dialog box. BTN1 invokes the sayHello function using a local variable localCustName. The localCustName variable contains the text “José”. BTN1 invokes the same function using an externally defined variable remoteCustName. The remoteCustName variable also contains the text “José”.

hello.html

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset="UTF-8">
        <title>Hello, world!</title>
        <script type="text/javascript" >
            var localCustName = "José";
	    function sayHello(custName) {
                if (custName == null || custName == "undefined") {
                    custName = "world";
                }
                alert("Hello, "+ custName);
            }
        </script>
        <script type="text/javascript" src="./remoteCustName.js"></script>
    </head>
    <body>

        <p>Hello, world!</p>
        <p>
        <button onclick="sayHello(localCustName)">BTN 1: Say hello to local José</button>
        </p>
        <p>
        <button onclick="sayHello(remoteCustName)">BTN 2: Say hello to remote José</button>
        </p>
    </body>
</html>

remoteCustName.js

// this file is encoded as charset = 8859-1
var remoteCustName = "José";

When you load the hello.html file, you’ll see a couple buttons. One button says hello to “José”, which is stored in a local JavaScript variable. One button says hello to José that is stored in an external js file. Note that the html file encoding is UTF-8, and the js file encoding is 8859-1. These are arbitrary encodings and could have been any of the encodings defined by the IANA charset registry?. The point is that the encodings are different from each other.

Suppose you click BTN 1. You should see this:

Figure 1:
btn1

In this example, the HTML file is UTF-8. Also, the localCustName variable begins as UTF-8 in the HTML file itself, and the interpreter converts it from UTF-8 into its own charset encoding — which is conveniently also Unicode.

Now let’s imagine you click BTN 2. You should see this:

Figure 2:
btn2

In Fig 2, we have linked to an external JS file, which has the encoding ISO-8859-1. When the browser pulls that remoteCustName.js file in, it converts it to Unicode. However, how does it know the source encoding? It assumes the source encoding is the same as the HTML document, which is UTF-8. So, now within the browser interpreter, the remoteCustName variable text is Unicode, but the conversion was incorrect. It guessed incorrectly that the external JS file was encoded as UTF-8; instead, the JavaScript file itself is encoded as ISO-8859-1. The visible display of the remoteCustName variable shows a garbled character for what should have been an ‘é’ character.

What’s the fix?

We can fix this by simply telling the interpreter explicitly what the JS file encoding is. The following revised HTML file does this:

...
<script type="text/javascript" charset="ISO-8859-1" src="./remoteCustName.js"></script>
...

Now, when we click on either BTN 1 or BTN 2, we see the same thing:

Figure 3:
btn1

The Problem

JavaScript uses Unicode as its underlying character set for all text strings. However, characters don’t instantly appear in the interpreter; they get there from a file. Common file types that include JavaScript program text include these:

  • html
  • js
  • jsp

The JavaScript interpreter receives text from these files and interprets that text into JavaScript. Although all text inside the interpreter is Unicode, a text’s source encoding from its surrounding html, js, or jsp file is not always Unicode. The text that contains JavaScript language lines can be in a variety of charset encodings.

The Solution

There are a couple things to remember about charset encodings and JavaScript:

  1. The JavaScript interpreter works with Unicode.
  2. The JavaScript interpreter converts JavaScript text into Unicode.
  3. The JavaScript interpreter assumes that JavaScript strings are encoded in the charset of the enclosing HTML or JSP document.
  4. When linking to external JavaScript files (.js) from HTML, the interpreter will assume that the external file is encoded in the same charset as the HTML document unless you override that assumption with a charset attribute
  5. Always use the charset attribute in script tags.
  6. Specifically, you probably should save all JavaScript files as UTF-8 encoded files and use the charset=”UTF-8″ attribute in script tags.

Leave a Reply

Your email address will not be published. Required fields are marked *