Scenario: We are adding object to array via logstash to elasticsearch using logstash's ES output plugin. Both are on version 6.8. It is a scripted upsert using stored script on Elasticsearch and here is the stored script on ES:
"add_script" : {
"lang" : "painless",
"source" : "if (ctx._source.tags != null ) { ctx._source.tags.add(params.event.get('tags')[0])} else {ctx._source.tags = params.event.get('tags')} "
}
Logstash output is:
output {
elasticsearch {
"action" =>"update"
"hosts" => "*****"
"index" => "%{index}"
"document_id" => "%{cid}"
"scripted_upsert" => true
"upsert" => ""
script_lang => ""
script_type => "indexed"
script => "add_script"
timeout => 120
}
}
With above two systems we are trying to push tags object into an array if it exists, else create a new tag array if it doesn't. Simple stuff so far!
Issue: Under load testing where ES endpoint become momentarily unavailable (and this issue is for later discussions) or a logstash restart with few events in queue, the moment connection is restored, the script ignores the != null part and directly overwrites the entire array of tags. For instance, if the tags array was initially like this in ES:
{
cid: 1,
tags: [{
id: 1,
"tag": "test_tag1"
},
{
id: 2,
"tag": "test_tag2"
},
{
id: 3,
"tag": "test_tag3"
}
]
}
and the last event was that came in for update was
tags: [{
id: 4,
"tag": "test_tag4"
}]
then instead of pushing this array the end document looks like this:
{
cid: 1,
tags: [{
id: 4,
"tag": "test_tag4"
}]
}
This happens only when there is connection restore to ES, as mentioned earlier either when ES encounter host unreachable error or logstash restarts while some events are still in queue and they start getting applied as soon it starts.
So far few things which we have played around with no luck are
- Reducing bulk size and number of workers in logstash pipeline - this was done keeping in mind that we may be overwhelming ES.
- Increase the retry_interval to 30s from 2s, this was done with a theory that when we reconnect to ES and the script executes, there is a window of time when ES return no records for matching documents and thus thinks that document key is empty and goes ahead and runs the else part, overwriting the entire array. So to give ES sometime to warmup, increased this.
Help needed as soon as possible to rectify this situation, as we just stopped a major product feature release due to this issue and we are not able to figure out any way to get out of this. Why is logstash on connection not able to honour the null check if (ctx._source.tags != null )
on reconnection? Can we call this as bug in ElasticSeach?