Open Source Text Analytics
By admin on Mar 14, 2009 in Java, open source
GATE – General Architecture for Text Engineering
GATE is
- the Eclipse of Natural Language Engineering, the Lucene of Information Extraction, a leading toolkit for Text Mining
- used worldwide by thousands of scientists, companies, teachers and students
- comprised of an architecture, a free open source framework (or SDK) and graphical development environment
- used for all sorts of language processing tasks, including Information Extraction in many languages
- funded by the EPSRC, BBSRC, AHRC, the EU and commercial users
- 100% Java reference implementation of ISO TC37/SC4 and used with XCES in the ANC
- 10 years old in 2005, used in many research projects and compatible with IBM’s UIMA
- based on MVC, mobile code, continuous integration, and test-driven development, with code hosted on SourceForge
Apache UIMA
Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.
UIMA enables applications to be decomposed into components, for example “language identification” => “language specific segmentation” => “sentence boundary detection” => “entity detection (person/place names etc.)”. Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.
RapidMiner
RapidMiner (formerly YALE) and its plugins provide more than 400 operators for all aspects of Data Mining. Meta operators automatically optimize the experiment designs and users no longer need to tune single steps or parameters any longer. A huge amount of visualization techniques and the possibility to place breakpoints after each operator give insight into the success of your design – even online for running experiments. On this page we discuss the main groups of operators and give operator examples for each of the groups.
NTLK
NLTK is an open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks, with distributions for Windows, Mac OSX and Linux.
OpenNLP
OpenNLP provides the organizational structure for coordinating several different projects which approach some aspect of Natural Language Processing. OpenNLP also defines a set of Java interfaces and implements some basic infrastructure for NLP components.
R Text Mining
R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. R Text Mining package can be used for text analysis.
Sorry, comments for this entry are closed at this time.