Writing UTF-8 CSV Files for Excel
Yesterday a coworker complained that Excel wasn’t displaying a CSV (comma separated values) file correctly. Our application allows the user to send a report via email. The application provides the report as a CSV file. Because the report can contain multilingual text, we’ve decided to encode it in UTF-8. Unfortunately, when users click on the file to display it, usually in Excel, all of the multi-byte encoded characters display incorrectly.
The problem was immediately clear to me…Excel was opening the UTF-8 encoded files, but it was incorrectly identifying them as Latin-1 encoded files. In the absence of any charset identification, Excel must guess about a file’s content encoding. In our environment, many host PCs use en_US locales with Latin-1 as the typical charset. Excel uses that default to read and display CSV files.
My solution to the problem was to use the byte-order marker (BOM) to identify the CSV file as a Unicode file. I instructed my colleague to prepend the FEFF character to the file. The Java application that writes the file uses a FileWriter that encodes to UTF-8 to create the CSV file. It was simple to just output the BOM as the first character in the file.
Now when our customers double-click on these files, Excel opens the file, notices the BOM, and automatically selects UTF-8 as the file’s charset encoding. Now Excel displays the previously mangled characters correctly. And I was able to help resolve a problem with an easy solution.
Maybe you can give your applications a hint about plain text files as well. Writing the BOM to your file can help Unicode-enabled applications know how to decode your Unicode files.
pardon the obvious pun but that’s the BOM
Hi there,
I am trying to display Thai baht symbol in a CSV file. I write it using java.util.Currency
Currency.getInstance(“THB”).getSymbol(new Locale(“th”, “TH”))
I enabled Thai using Control panel but Excel was showing garbage character where as I am able to see this symbol in notepad. I used your suggestion but it didnt work.
I tried UTF-16LE encoding for setting the bom by encoding but that didnot work either. Any suggestions?? Thanks.
Here is the test case:
======================
import java.io.*;
import org.apache.commons.io.FileUtils;
import com.Ostermiller.util.CSVPrint;
import com.Ostermiller.util.ExcelCSVPrinter;
import java.util.*;
public class ThaiCurrencySymbolTest {
public static void main(String[] args) throws Exception{
File f1 = new File(“C:\\Test-Thai-Symbol.csv”);
CSVPrint csvWriter = ThaiCurrencySymbolTest.getCSVPrintWriter(f1);
csvWriter.print(Currency.getInstance(“THB”).getSymbol(new Locale(“th”, “TH”)) + “1523.65″);
csvWriter.flush();
csvWriter.close();
}
public static CSVPrint getCSVPrintWriter(File file) throws IOException {
//OutputStreamWriter buf = new OutputStreamWriter(new FileOutputStream(file),”TIS620″);
OutputStreamWriter buf = new OutputStreamWriter(new FileOutputStream(file),”UTF-8″);
CSVPrint csvWriter = new ExcelCSVPrinter(buf);
csvWriter.setAlwaysQuote(true);
csvWriter.print(“\uFEFF”);
return csvWriter;
}
}