Writing UTF-8 CSV Files for Excel

Yesterday a coworker complained that Excel wasn’t displaying a CSV (comma separated values) file correctly. Our application allows the user to send a report via email. The application provides the report as a CSV file. Because the report can contain multilingual text, we’ve decided to encode it in UTF-8. Unfortunately, when users click on the file to display it, usually in Excel, all of the multi-byte encoded characters display incorrectly.

The problem was immediately clear to me…Excel was opening the UTF-8 encoded files, but it was incorrectly identifying them as Latin-1 encoded files. In the absence of any charset identification, Excel must guess about a file’s content encoding. In our environment, many host PCs use en_US locales with Latin-1 as the typical charset. Excel uses that default to read and display CSV files.

My solution to the problem was to use the byte-order marker (BOM) to identify the CSV file as a Unicode file. I instructed my colleague to prepend the FEFF character to the file. The Java application that writes the file uses a FileWriter that encodes to UTF-8 to create the CSV file. It was simple to just output the BOM as the first character in the file.

Now when our customers double-click on these files, Excel opens the file, notices the BOM, and automatically selects UTF-8 as the file’s charset encoding. Now Excel displays the previously mangled characters correctly. And I was able to help resolve a problem with an easy solution.

Maybe you can give your applications a hint about plain text files as well. Writing the BOM to your file can help Unicode-enabled applications know how to decode your Unicode files.

VN:F [1.9.3_1094]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.3_1094]
Rating: 0 (from 0 votes)
Share and Enjoy:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • Technorati
  • Twitter
  • Add to favorites
  • Yahoo! Bookmarks
  • DZone
  • LinkedIn
  • Reddit
  • Slashdot

2 Comments

  1. Ben says:

    pardon the obvious pun but that’s the BOM

    VA:F [1.9.3_1094]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.3_1094]
    Rating: 0 (from 0 votes)
  2. VM says:

    Hi there,

    I am trying to display Thai baht symbol in a CSV file. I write it using java.util.Currency

    Currency.getInstance(“THB”).getSymbol(new Locale(“th”, “TH”))

    I enabled Thai using Control panel but Excel was showing garbage character where as I am able to see this symbol in notepad. I used your suggestion but it didnt work.

    I tried UTF-16LE encoding for setting the bom by encoding but that didnot work either. Any suggestions?? Thanks.

    Here is the test case:
    ======================
    import java.io.*;
    import org.apache.commons.io.FileUtils;
    import com.Ostermiller.util.CSVPrint;
    import com.Ostermiller.util.ExcelCSVPrinter;
    import java.util.*;

    public class ThaiCurrencySymbolTest {

    public static void main(String[] args) throws Exception{
    File f1 = new File(“C:\\Test-Thai-Symbol.csv”);
    CSVPrint csvWriter = ThaiCurrencySymbolTest.getCSVPrintWriter(f1);
    csvWriter.print(Currency.getInstance(“THB”).getSymbol(new Locale(“th”, “TH”)) + “1523.65″);
    csvWriter.flush();
    csvWriter.close();
    }

    public static CSVPrint getCSVPrintWriter(File file) throws IOException {
    //OutputStreamWriter buf = new OutputStreamWriter(new FileOutputStream(file),”TIS620″);
    OutputStreamWriter buf = new OutputStreamWriter(new FileOutputStream(file),”UTF-8″);
    CSVPrint csvWriter = new ExcelCSVPrinter(buf);
    csvWriter.setAlwaysQuote(true);
    csvWriter.print(“\uFEFF”);
    return csvWriter;
    }
    }

    VA:F [1.9.3_1094]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.3_1094]
    Rating: 0 (from 0 votes)

Leave a Reply