Wiki

Clone wiki

Okapi / Charset_Handling

Background

If you're not familiar with Unicode, read this to get started.

Best Practices

Never trust the default platform encoding

Many I/O methods will cheerfully allow you to use them without specifying an encoding to use, and depending on what kind of machine you're using and what kind of data you're testing with, your code might even work! Later, when you are sleeping, somebody else will try to run the code and it will break horribly. This happens because Java assumes a platform-dependent default encoding and uses it, unless you tell it not to.

For example, I can write code like this:

String data = "This String contains a snowman: \u2603";
Writer w = new OutputStreamWriter(new FileOutputStream("snowman.txt"));
w.write(data);
w.close();

This code is trying to write out Unicode data (U+2603, "snowman") to a file. This will work on some platforms and corrupt the data on others. To ensure this always works, you need to specify the encoding for the character->byte conversion that OutputStreamWriter is doing:

Writer w = new OutputStreamWriter(new FileOutputStream("snowman.txt"), StandardCharsets.UTF_8);

Don't use FileWriter or FileReader

FileWriter and FileReader always assume the default encoding. Don't use them! Use OutputStreamWriter or InputStreamReader to wrap a stream and specify a Charset to use.

Prefer Charset objects to String charset names

There are very few cases in which we should ever need to catch UnsupportedEncodingException. Usually, this checked exception is being thrown by a standard library function that takes a String parameter representing a charset name. This charset is almost always something that we know is supported, like UTF-8:

InputStream is = ...;
try {
    Reader r = new InputStreamReader(is, "UTF-8");
}
catch (UnsupportedEncodingException e) {
   throw new OkapiUnsupportedEncodingException(e); // this will never happen
}

Many of these methods also offer a signature that takes a Charset instance instead. This is much nicer, since it doesn't throw an exception. Java 7 simplified Charset use by providing constants for some of the most commonly-used Charsets in the new StandardCharsets class. Now we can rewrite that code without the Charset handling getting in the way:

InputStream is = ...;
Reader r = new InputStreamReader(is, StandardCharsets.UTF_8);

There's lots of old code that still passes charsets around by name. We'll clean that up in time. Meanwhile, we should try to use and pass around Charset objects whenever possible. If you only have a name, you can convert to Charset by using Charset.forName(), which throws an unchecked exception for reasons that only the JDK developers know. Notably, you will need to do this if you need to use UTF-32, which StandardCharsets doesn't expose.

Additional Charset Utilities (new in M25)

The net.sf.okapi.common.Util class contains several additional methods to wallpaper over Charset-related gaps in the Java class library: * Util.URLEncodeUTF8() and Util.URLDecodeUTF8() are wrappers for URLEncoder and URLDecoder methods that don't take Charset and require exception handling. * Util.charsetPrintWriter() creates a PrintWriter that writes to a given filename with a given Charset. (PrintWriter provides a constructor that takes filename + charset name, but not a native Charset.)

Updated