I am loading employee data from json files to elasticSearch using logstash
The json file can have multiple records for an employee with different data e.g. addresses, languages etc.
Each of these records can have same entities with same or different values.
Goal is to reconcile all records and have only one record per employee in elastic which has data(reconciled, no duplicates) from multiple incoming json records.
Logstash processes each incoming record in different thread, based on number of workers.
This will lead to a scenario(rare but possible)
- employee: e1 has 4 records in json file
- each record is picked up by 4 logstash workers/thread at the same time,
- each worker/thread will try to write/update the elasticSearch index document for employee e1.
- however the worker/thread that finishes last, data from that record will be pushed in the index document, and data from other 3 workers/thread will be lost. (either completely or partially)
Questions:
- I hope this can happen, but I am not sure. so first question is, is this valid scenario?
- if it can happen, is there a way to get over such use case in logstash or in elasticSEarch?
Sample employee record from the elasticSearch index
{
"id": "e1",
"languages" : [ "en","spn" ],
"addresses" : [
{
"zip" : "12345",
"phone" : "1234512345",
"address_2" : "MAYO CLINIC",
"address" : "200 1ST ST SW",
"state" : "MN",
"city" : "ROCHESTER"
},
{
"zip" : "12346",
"phone" : "1234512346",
"address_2" : "a2",
"address" : "a1",
"state" : "MN",
"city" : "ROCHESTER"
},
],
}