Current Article

Java – Language Identification in Web Page

By admin on Nov 9, 2007 in Java, open source

Text categorization is a fundamental task in document processing, allowing the automated handling of enormous streams of documents. One fundamental problem of text categorization is the identification of the language, which can be resolved by using N-Gram based text categorization.

An ngram is a (short) sequence of atoms like bytes, characters, words or whatsoever.

NGramJ is a Java based library containing two types of ngram based applications. It’s major focus is to provide robust and state of the art language recognition (or language guessing how some call it more correctly). Both types are meant to be embedded into larger applications.

NGramJ -This uses ngrams of bytes to determine from a sequence of bytes both language and encoding. In symbols:
NGramJ : byte[] –> (Language, Encoding)
CNgram – This uses ngrams of characters to determine the langauge of a character sequence. In symbols
CNgram : char[] –> Language

Post a Comment