N-Gram Analysis with TCatNG
By admin on Aug 30, 2009 in Java, open source
The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques to the process of categorizing text files.
N-Grams are short sequences of bytes or letters, and their statistics provide valuable informations about byte sequences and strings. N-Gram based approaches are very powerful in text categorization because every string is decomposed into small parts, and errors tend to affect only a limited number of those parts, leaving the remainder intact.
The use of character N-Grams also does not explicitly or implicitly require the specification of a separator, as it is necessary for words. Consequently, analyzing a text in terms of N-Grams constitutes a valuable approach for text written in any language based on an alphabet and the concatenation text-construction operator, eliminating the need for complex text tokenization, stemming, and/or lemmatization.
There are many possible applications: categorizing documents by topic, detecting the author of a text, or recognizing the language and encoding for a bunch of bytes (i.e. in a search engine, to figure the language of a document). This is actually the first application this software package was designed for, but many other uncharted areas are up to you to explore.
In sum, this package offers a robust research framework for experimenting with text categorization using character N-Grams.
Sorry, comments for this entry are closed at this time.