Using python plugin to create zentity models in elastic

@gioorso

You're right that zentity was designed to resolve a single entity per request in real-time. This contrasts with the more common approach of resolving a population of entities in batch. I made a brief comparison of the two approaches in this presentation (Slide 13).

At some level it would be possible to use zentity to resolve a population of entities. For example, you could scroll every document in an index, resolve the document with others using zentity, associate each document _id from the hits with an entity ID that you generate, and exclude each document _id from subsequent iterations of this batch process. But this approach has limited scalability. The list of excluded document identifiers will grow unbounded with each request, and omitting those exclusions will results in many redundant searches. There are more appropriate solutions for population scale entity resolution that operate in batch, but none are open source as far as I'm aware.

I view zentity as an appropriate solution in two cases:

  1. When the scope of your analysis is limited to a single entity or a small network of entities from the greater population, and you want to simplify your architecture by skipping batch entity resolution; or
  2. When you have resolved a population of entities in batch and then want to resolve subsequent incoming entities in real-time.