Enrich document with data from same index

Hi, we got one index that we insert documents where one can follow e.g. a session and what is done, i.e.:

document 1;
sessionId = 17
type=created
country=DE

document 2:
sessionId = 17
type=action

now at insertion of doc 2 we want to enrich it with data from doc 1 (same sessionId), in this case add field country=DE to doc 2.

A session is not long lived (secconds/minutes) and there will be a lot of newly created sessions.

How can this be accomplished? I read a bit about Ingest pipelines but wondering if that is the way to go or if there are other better ways?

Hi @rickardo

I believe you've already read about the Enrich pipeline and I think that's one option. Another would be to have an isolated data enrichment service, in which your application requests enrichment before indexing the data.
The second option is how we do it here at our company.

Thanks @RabBit_BR ,
I read that using Enrich pipeline will create an enrich index which I felt was a bit unclear of how it is maintained. In this usecase the "enrich" documents only need to be there for a short time and was concerned that it might grow quite large since we will have a lot of those coming in (many millions per month). Can one set time to live setting on the "enrich index"?

If you need to update the data on the enrich index you need to use the execute API every time, enrich indices are best used for static data or data that is not update frequently.

In your case it seems that you would need to frequently update it.

The best way is to do that enrichment before indexing the data in Elasticsearch, but sometimes this is pretty hard to do or even not possible in a useful way.

Is the session id unique? Could the events like created and action happens too close for each other, like milliseconds?

How are you indexing your data?

Thanks @leandrojmp !

ok, thought it was best suited for static data..

The sesssionId is unique over that session which normally would be seconds/minutes (1 - 10 documents). Events would not be miliisecons but more like seconds, it is manual user input.

We havent done any index mapping for now if that is what you mean, just default.
This is what the mapping looks like:

"country": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},sessionId": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}

Yeah, but how are you indexing the data? Are you using Logstash, Filebeat, a Custom script?

Do you have any control over the source of your data? For example, how is it created?

The enrich depends on how you are indexing your data, you would need to store the key value pair with the sessionId and country somewhere, and read it while indexing.

Yes, we have control, we are saving
through org.springframework.data.elasticsearch.repository.ElasticsearchRepository.

The problem for us is that we can scale our services and thus we would have to first do a lookup to ES before enriching. If we had just one entry point i guess it would be easy to cache the first entry. Unless we had a shared cache/store mechanism outside ES but the comlicates things.

Yeah, but this is the right approach in this case.

You would need to store it somewhere else, maybe a cache database like Redis/Memcached and query it every time you receive a new document, before indexing in Elasticsearch.

The enrich processor will not work in this case.

Ok, good to know! Thanks for your quick response!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.