In my previous article, Combine Crunch and Lucene for Efficient Web Page Indexing, I mentioned that I used Crunch and Lucene in one of my projects. The project actually aims to build a semantic knowledge framework. As part of the project, I need to do content/knowledge extraction, semantic tagging, knowledge storage and knowledge representation.
Here I am going to describe GATE which is a general architecture for text engineering. Actually I have came across also Apache UIMA (formerly IBM UIMA), which is used for Unstructured Information Management applications that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. After comparing the two, I decided GATE will be more suitable for me.
GATE is (from the website)
- the Eclipse of Natural Language Engineering, the Lucene of Information Extraction, a leading toolkit for Text Mining
- used worldwide by thousands of scientists, companies, teachers and students
- comprised of an architecture, a free open source framework (or SDK) and graphical development environment
- used for all sorts of language processing tasks, including Information Extraction in many languages
GATE has these benefits for scientists or developers performing experiments with language and computation.
The various scientific and engineering disciplines to which GATE is relevant are:
- Computational Linguistics: part of the science of language that uses computation as an investigative tool.
- Natural Language Processing: part of the science of computation whose subject matter is data structures and algorithms for human language processing.
- Language Engineering: building language processing systems whose cost and outputs are measurable and predictable.
GATE is used for different project areas, which include
- Knowledge Management and Semantic Web
- Digital Libraries and Cultural Heritage
- E-science and bioinformatics
- Human Language Technology
For starter, it may take awhile for you to dig through the documentation to get familiar with it.
It is a great open source tool, at least for me