The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques
to the process of categorizing text files.
N-Grams are short sequences of bytes or letters, and their statistics provide valuable informations
about byte sequences and strings. N-Gram based approaches are very powerful in text
categorization because every string is decomposed into small parts, and errors tend to affect only a
limited number of those parts, leaving the remainder intact.
The use of character N-Grams also does not explicitly or implicitly require the specification of a
separator, as it is necessary for words. Consequently, analyzing a text in terms of N-Grams
constitutes a valuable approach for text written in any language based on an alphabet and the
concatenation text-construction operator, eliminating the need for complex text tokenization,
stemming, and/or lemmatization.
There are many possible applications: categorizing documents by topic, detecting the author
of a text, or recognizing the language and encoding for a bunch of bytes (i.e. in a search engine,
to figure the language of a document). This is actually the first application this software package
was designed for, but many other uncharted areas are up to you to explore.