I believe you've already read about the Enrich pipeline and I think that's one option. Another would be to have an isolated data enrichment service, in which your application requests enrichment before indexing the data.
The second option is how we do it here at our company.
Thanks @RabBit_BR ,
I read that using Enrich pipeline will create an enrich index which I felt was a bit unclear of how it is maintained. In this usecase the "enrich" documents only need to be there for a short time and was concerned that it might grow quite large since we will have a lot of those coming in (many millions per month). Can one set time to live setting on the "enrich index"?
If you need to update the data on the enrich index you need to use the execute API every time, enrich indices are best used for static data or data that is not update frequently.
In your case it seems that you would need to frequently update it.
The best way is to do that enrichment before indexing the data in Elasticsearch, but sometimes this is pretty hard to do or even not possible in a useful way.
Is the session id unique? Could the events like created and action happens too close for each other, like milliseconds?
The sesssionId is unique over that session which normally would be seconds/minutes (1 - 10 documents). Events would not be miliisecons but more like seconds, it is manual user input.
We havent done any index mapping for now if that is what you mean, just default.
This is what the mapping looks like:
Yeah, but how are you indexing the data? Are you using Logstash, Filebeat, a Custom script?
Do you have any control over the source of your data? For example, how is it created?
The enrich depends on how you are indexing your data, you would need to store the key value pair with the sessionId and country somewhere, and read it while indexing.
Yes, we have control, we are saving
through org.springframework.data.elasticsearch.repository.ElasticsearchRepository.
The problem for us is that we can scale our services and thus we would have to first do a lookup to ES before enriching. If we had just one entry point i guess it would be easy to cache the first entry. Unless we had a shared cache/store mechanism outside ES but the comlicates things.
Yeah, but this is the right approach in this case.
You would need to store it somewhere else, maybe a cache database like Redis/Memcached and query it every time you receive a new document, before indexing in Elasticsearch.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.