Detect the charset in Java strings

Blog
Dec 13, 2017
Lluís Turró Cutiller
7,077
0

Before start, I would like to mention Apache Tika and juniversalchardet. Tika is a full-featured file type detection library and, because so much features, takes a big amount of dependencies. I haven't tried juniversalchardet for does not detect ISO-8859-1, which is the reason I needed charset detection.

Since none well suited my problem, I decided to detect charsets myself and, once results were in production, share it with anyone else. Hope you like it

Why charset detection?

Anyone developing web applications with data inputs and third-party frameworks, with a different charset than UTF-8, might have encountered the need to auto-detect charset. Guessing the source of the input on utility classes, or passing the charset along among methods, doesn't seem to be the right way and isn't always possible.

Changing the string charset

We'll need a convert method, in order to change the string charset. The most simple way would be using String supplied methods. Something like:

public String convert(String value, String fromEncoding, String toEncoding) {
  return new String(value.getBytes(fromEncoding), toEncoding);
}

The problem remains, though. The variable fromEncoding isn't always known.

Charset guessing

Guessing? Well, let's be clear, we are guessing. Also taking some premises that might be not true. For instance, we probe using UTF-8 against a set of expected charsets. The good thing about it is that we know the elements at play and can change them at will.

The approach is very simple: if I do change the string from the expected charset to UTF-8 and then back from UTF-8 to the expected charset, shouldn't be the resulting string exactly the same than the original one?

Let's put this at work:

public static String charset(String value, String charsets[]) {
  String probe = StandardCharsets.UTF_8.name();
  for(String c : charsets) {
    Charset charset = Charset.forName(c);
    if(charset != null) {
      if(value.equals(convert(convert(value, charset.name(), probe), probe, charset.name()))) {
        return c;
      }
    }
  }
  return StandardCharsets.UTF_8.name();
}

A possible call to the charset() method would be:

String detectedCharset = charset(value, new String[] { "ISO-8859-1", "UTF-8" });

As I said, the approach uses the premise that UTF-8 will behave well on all transformations and that there is a reduced set of expected charsets. I haven't tried probing the whole Charset.availableCharsets(). In case you do and find a better way, please let me know.

7 review(s)

Comments


© turro.org, 2011-2018

lluis@turro.org
Tel. +34 609323947