The core of my work is on semantic enrichment through various NLP techniques (e.g. Annotators or Taggers, being implemented using statistical methods of various kinds, from Bayesian to rule based and gazetteer /ontology based).
Until now I've been using UIMA-based commercial solution not leveraging on hadoop, but I'm now pressed to move over to the hadoop ecosystem in order to gain flexibility and speed, and be able to deliver mobile applications with latest GUI standards.
I've searched in this discussion group and found already several postings - however mostly preceding the release of the plugin/ingest-*. In my opinion this is a major improvement and opens up a wider application of semantic technologies in the core of ElasticSearch.
Nevertheless, I am missing an architectural design for the plugin/ingest mechanism w.r.t. dependency resolution in tagging.
For example, let's say a tagger is using a semantic pattern such as/[0-9]+(Mio)* $CURRENCY$/
to specify any kind of large money transaction. It assumes that the text has been previously tokenized, and then a "currency tagger" has been previously run.
In my current understanding of the Ingest mechanism, the dependency resolution of the above strategy is entirely left to the implementation of the Ingest node. This would have as consequence
a) large code duplications (e.g. some basics activities would have to be repeated again and again),
b) recomputing (e.g. the same basic features would have to be recomputed several times on the same string when using different NERS...)
c) different NLP taggers could be implemented multiple times within the ES environment ...
Several open source Java technologies are available for that:
- GATE (https://gate.ac.uk/download/; https://sourceforge.net/projects/gate/)
- Standford NLP (http://nlp.stanford.edu/software/; https://github.com/stanfordnlp/CoreNLP)
- DKPro (https://www.ukp.tu-darmstadt.de/research/current-projects/dkpro/)
- LingPipe (http://alias-i.com/lingpipe/)
- Weka (http://www.cs.waikato.ac.nz/ml/weka/)
- Mallet (https://github.com/mimno/Mallet.git)
-JSRE (https://hlt-nlp.fbk.eu/technologies/jsre) - cTakes (http://ctakes.apache.org/)
- openNLP (http://opennlp.apache.org/)
- behemoth (https://github.com/DigitalPebble/behemoth)
-bioNLPwrappers (http://bionlp-uima.sourceforge.net/)
-clearTK (http://cleartk.github.io/cleartk/)
-JCore (http://www.julielab.de/Resources/JCoRe+NLP+Tools.html)
-NactemTools (http://www.nactem.ac.uk/software.php)
-...
I am considering setting up an hackaton in Germany to speed up this development - thus transforming ElasticSearch into a CognitiveElasticSearch ... therefore the timing is essential, but even more essential is the QCQA of the underlying infrastructure.
I recently found out that also others have been walking on these path before, and decided to use Spark with UIMA.
I would very much like to understand the design methodology of ElasticSearch/Ingest plugin, namely where it makes more sense to use this approach instead of Spark UIMA.
For example, here is a complete set of example on how to use DL4J (Deep Learning for Java) that utilses UIMA on the SPARK platform
https://github.com/deeplearning4j/dl4j-spark-cdh5-examples
and in the following project the use of CTAKES UIMA module from within the SPARK framework