RSS Feed for This PostCurrent Article

N-Gram Analysis with TCatNG

The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques to the process of categorizing text files.

N-Grams are short sequences of bytes or letters, and their statistics provide valuable informations about byte sequences and strings. N-Gram based approaches are very powerful in text categorization because every string is decomposed into small parts, and errors tend to affect only a limited number of those parts, leaving the remainder intact.

The use of character N-Grams also does not explicitly or implicitly require the specification of a separator, as it is necessary for words. Consequently, analyzing a text in terms of N-Grams constitutes a valuable approach for text written in any language based on an alphabet and the concatenation text-construction operator, eliminating the need for complex text tokenization, stemming, and/or lemmatization.

There are many possible applications: categorizing documents by topic, detecting the author of a text, or recognizing the language and encoding for a bunch of bytes (i.e. in a search engine, to figure the language of a document). This is actually the first application this software package was designed for, but many other uncharted areas are up to you to explore.

In sum, this package offers a robust research framework for experimenting with text categorization using character N-Grams.

Trackback URL

Sorry, comments for this entry are closed at this time.