RSS Feed for This PostCurrent Article

Open Source Text Analytics

GATE – General Architecture for Text Engineering


Apache UIMA

Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.

UIMA enables applications to be decomposed into components, for example “language identification” => “language specific segmentation” => “sentence boundary detection” => “entity detection (person/place names etc.)”. Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.


RapidMiner (formerly YALE) and its plugins provide more than 400 operators for all aspects of Data Mining. Meta operators automatically optimize the experiment designs and users no longer need to tune single steps or parameters any longer. A huge amount of visualization techniques and the possibility to place breakpoints after each operator give insight into the success of your design – even online for running experiments. On this page we discuss the main groups of operators and give operator examples for each of the groups.


NLTK is an open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks, with distributions for Windows, Mac OSX and Linux.


OpenNLP provides the organizational structure for coordinating several different projects which approach some aspect of Natural Language Processing. OpenNLP also defines a set of Java interfaces and implements some basic infrastructure for NLP components.

R Text Mining

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. R Text Mining package can be used for text analysis.

Trackback URL

Sorry, comments for this entry are closed at this time.