RSS Feed for This PostCurrent Article

Java - Automatic Charset Detection of Web Page

jchardet is a java port of the source from mozilla’s automatic charset detection algorithm. It is the library that I used to detect the charset of webpages.

How does browser guess the charset of web pages ? The way browsers handle this problem is to look in to the data byte-by-byte and try to guess the charset (When you click on the menu View->Auto-Select or Auto-Detect). The algorithm (originally developed by Frank Tang) looks into the byte sequence and based on the values of each byte uses a elimination logic to narrow down to the final charset. If there is a tie between EUC charsets, it uses the second logic to narrow down. This logic uses the frequency statistics of characters in a given language.

As quoted from the website, the Java string (and char) class store data in Unicode values. When handling international text from outside source we need to provide information about the encoding of the text so that they are converted to correct Unicode values. This means you have to know the encoding of all the text that your Java code handles. Many Internet based Java application has to deal with data from random source and the encoding is not always explicitly known. E.g. in a HTML page, if there is no meta-tag explicitly specifying the charset of the page, it is very hard to determine the encoding and the conversio n to Java Unicode string will end up corrupting the data.

NOTE: For Python, Universal Encoding Detector can be used.


Trackback URL


RSS Feed for This PostPost a Comment