Plugin/ingest-NLP semantic enrichment design and pluggability vs Spark UIMA?

The core of my work is on semantic enrichment through various NLP techniques (e.g. Annotators or Taggers, being implemented using statistical methods of various kinds, from Bayesian to rule based and gazetteer /ontology based).
Until now I've been using UIMA-based commercial solution not leveraging on hadoop, but I'm now pressed to move over to the hadoop ecosystem in order to gain flexibility and speed, and be able to deliver mobile applications with latest GUI standards.

I've searched in this discussion group and found already several postings - however mostly preceding the release of the plugin/ingest-*. In my opinion this is a major improvement and opens up a wider application of semantic technologies in the core of ElasticSearch.

Nevertheless, I am missing an architectural design for the plugin/ingest mechanism w.r.t. dependency resolution in tagging.

For example, let's say a tagger is using a semantic pattern such as/[0-9]+(Mio)* $CURRENCY$/to specify any kind of large money transaction. It assumes that the text has been previously tokenized, and then a "currency tagger" has been previously run.
In my current understanding of the Ingest mechanism, the dependency resolution of the above strategy is entirely left to the implementation of the Ingest node. This would have as consequence
a) large code duplications (e.g. some basics activities would have to be repeated again and again),
b) recomputing (e.g. the same basic features would have to be recomputed several times on the same string when using different NERS...)
c) different NLP taggers could be implemented multiple times within the ES environment ...

Several open source Java technologies are available for that:

I am considering setting up an hackaton in Germany to speed up this development - thus transforming ElasticSearch into a CognitiveElasticSearch ... therefore the timing is essential, but even more essential is the QCQA of the underlying infrastructure.

I recently found out that also others have been walking on these path before, and decided to use Spark with UIMA.

I would very much like to understand the design methodology of ElasticSearch/Ingest plugin, namely where it makes more sense to use this approach instead of Spark UIMA.

For example, here is a complete set of example on how to use DL4J (Deep Learning for Java) that utilses UIMA on the SPARK platform

https://github.com/deeplearning4j/dl4j-spark-cdh5-examples

and in the following project the use of CTAKES UIMA module from within the SPARK framework

At this point, there are very few people that know the details of the
Ingest node. Unless someone has been bravely running off nightly builds,
only the developers of the code know the inner details.

My take is that it will be somewhat of a replacement for logstash, which is
slow (written in Ruby, what do you expect). Without looking at the code, I
am assuming most of your concerns (duplication of logic, etc...) will hold
true.

There are plugins for most Java-based NLP libraries. Mostly unofficial and
not maintained.

Good luck,

Ivan

The ingest plugin architecture is described at

To me, the feature to mangle documents before they reach the indexer was always essential. I use client-based approaches but also I experimented with UIMA/OpenNLP plugins. Ingest nodes might ease the decision where to modify documents.

The main reason for implementing was the observation that performance in Logstash is weak, and plugins like attachment-mapper are too clumsy for reuse. Also the doubling of code. So, ingest nodes are in my understanding targeted at replacing certain Logstash input filter and slow/clumsy ES plugins (because I developed the JDBC river, I am also planning to get something implemented like a tabular data ingest).

Using NLP techniques, there are three challenges:

  1. the training data
  2. the excessive resource consumption (dozens of GB RAM helps a lot) while indexing
  3. the many methods of retrieval, most important the document scoring for relevance

Training NLP datasets are mostly a magic skill and are outside the scope of ES. But the trained NLP datasets drive the quality of subsequent ES indexing. So one challenge would be to find a method to integrate the generation of such datasets into the overall workflow, preferably, also via an ES API. The NLP software is very different at this stage.

Dedicated ES ingests node will help definitely for semantic analysis of bulk input, but there is nothing in ES for semantic retrieval. For example, UIMA output is an object graph, and to index/search on such graphs, additional machinery is required: a query language that returns graphs, not documents, and a data model (maybe JSON-LD somehow embedded in ES JSON)

Another topic is scoring documents (object graphs). Boosting docs by adding numerical factors from semantic analysis is clearly on the Lucene analyzer side and not a task for ingest.

@luto you have listed a lot of software, I'm not sure if you have studied them in detail with respect to fit them into ES. I have a little experience with OpenNLP, UIMA, and Stanford NLP, with a bit of NER. I got them run as a Lucene analyzer/tokenizer, but failed the quest for graphs.

If Elasticsearch was to be used as the training set, there would need to
exist a bi-directional dependency between the ingest node and the rest of
the data cluster. The ingest node would need to know not only the document
being indexed, but the state of the cluster to close the feedback loop. I
doubt the ingest node was created for such a use case.

Would the ingest nodes even have state? If the model was trained
beforehand, can the ingest node be able to reload this data? I have seen
the Github issue, but I have not been following. I am comfortable with
structuring the data before indexing it to Elasticsearch.

Ivan

Ingest allows for general purpose document preprocessing, which allows people to adjust the source of documents just before ES indexes the data. How the source of a document is adjusted is completely up the implementation of a processor. My knowledge NLP is very limited, but the ingest infrastructure can be used for semantic analysis of documents. For example this OpenNLP plugin [1], can easily be implemented as an ingest processor. The ingest processor will just use the already provided models in order to semantically tag / enrich documents. Training of models is not something that ingest will ever solve and I think that this is something that should happen offline / in the background.

I'm not sure how things eventually evolve, this depends on how ingest ends up getting used. However I think that is likely that we endup with a ingest-nlp plugin that encapsulates several processors, so that we avoid code duplication and be able to reuse certain computations.

In cases with more complex ingestion architectures Logstash should be used. For example when data also needs to be stored in other data stores or when queuing is required (for instance to deal with spikes in the ingestion rate). Logstash is continuously being improved, for example, parts of it are being rewritten in Java for performance reasons. Actually the long term plan is that both ingest and Logstash will under the hood use the same processor framework / processor implementations.

1: GitHub - spinscale/elasticsearch-opennlp-plugin: Additional opennlp mapping type for elasticsearch in order to perform named entity recognition

In theory the model could be training in the background and a processor's model could be changed on the fly when training has completed. However there is no specific support for this at the moment.

spinscale highlighted this limitation in the deprecation notice of the
plugin you referenced:

"Upgrading your model requires a restart of all of the Elasticsearch nodes,
resulting in unwanted downtime."

So the ingest nodes will also not be able to, at least as of now, reload
the model. I do think the ingest nodes are better designed to be stateless.
Much rather keep some logic outside of the cluster.

Ivan

I am doing text mining since 20 years. I first started with my own code, in Perl, then moved to GATE, then shifted in the commercial area (Highlight from SRI International; LUXID from TEMIS, now ExpertSystems). My experience with UIMA is limited to the LUXID product, that underlying uses UIMA.
For some academic projects I then worked with Peregrine and JSRE.
I formally evaluated most of the other technologies I mentioned, that I also reviewed in a chapter of a book I published last year.
However, things go fast, and just discovered that ClearTK (http://cleartk.github.io/cleartk/) has much evolved and is now offering itself as interface to many established NLP components... perhaps would be worth having a ingest-ClearTK node.

Unfortunately, spinscale/elasticsearch-opoennlp-plugin has been discontinued, the Authors recommend not to use it.

I was wondering, what @davidtalby would think about the above strategy. Hi published recently an approach where he

The architecture is built out of open source big data components – Kafka and Spark Streaming for real-time data ingestion & processing, Spark for modeling, and Titan and Elasticsearch for enabling low-latency access to results.

The data science components include a UIMA pipeline with custom annotators, machine learning models for implicit inferences, and dynamic ontologies for representing & learning new relationships between concepts. Source code will be made available after the talk to enable you to hack away on your own.
enter link description here

Just discovered the Pojo wrapper approach published in 2014 by Phil Ogren using Spark https://spark-summit.org/2014/wp-content/uploads/2014/07/Leveraging-UIMA-in-Spark-Philip-Ogren.pdf

What is not clear to me is when is more appropriate to perform the enrichment using Spark or using Elasticsearch Inject node

Indeed that seems to be the solution I was seeking: see foloowing report in August 2015
https://groups.google.com/forum/#!topic/dkpro-core-user/X1z4Ziuas68