Hi all,
I'm fairly new to elasticsearch and have a question regarding merging of documents and removing duplicates.
I got the following data structure:
topic1:
{
"_id": “a12345",
"userId" : "x123”,
“externalId”: “0987654",
"firstName" : "first name",
"lastName" : "last name",
"birthDate" : -157770000000,
"vipFlag" : "1",
"email" : "user@domain.org",
"mobile" : "+12345678",
"zipCode" : "12345",
"city" : "City name"
}
topic2:
{
"_id": “b12345",
"userId" : "x123”,
“externalId”: “0987654",
"firstName" : "first name",
"lastName" : "last name",
"birthDate" : -157770000000,
"email" : "user@domain.org",
"mobile" : "+12345678",
"zipCode" : "12345",
"city" : "City name"
}
topic3:
{
"_id": “c12345",
“externalId”: “0987654",
"firstName" : "first name",
"lastName" : "last name",
"email" : "user@domain.org",
"mobile" : "+12345678",
"zipCode" : "12345",
}
topic4:
{
"_id": “b12345",
"userId" : "x123”,
"firstName" : "first name",
"lastName" : "last name",
"birthDate" : -157770000000,
"email" : "user@domain.org",
"mobile" : "+12345678",
"zipCode" : "12345",
"city" : "City name"
}
The information is pushed continuously into ES from the topics in kafka.
I’m considering using Kafka Elasticsearch Sink Connector to store the data in separate indexes per topic and merge the values based on the merging rules into a new index that will be used for querying.
Following values should be used for merging the documents from topics:
- topic1 with topic2: userId
- topic1 with topic4: userId
- topic1_2_4 with topic3: externalId
As a result I’m expecting one topic with either parent-child relation or with merged fields (whatever is easier)
Is it doable in elasticsearch or should I write a an external kafka consumer + elasticsearch indexer application to merge and insert the data?
If it's doable in elasticsearch, what would be the right way to solve this?