We're experiencing an intermittent issue for weeks/months since upgrading to elasticsearch 5.5.3 (from 5.2.x)
In short, one or both elasticsearch nodes will suddenly and without no apparent warning stop accepting search requests. In fact it appears as if it literally refuses to do any work at all. The world stops. Thats more than just poetic. Maybe GC is stopping the world?!?!?
We run a 2-node cluster (with minimum master nodes set to 1) on Ubuntu 16.04 LTS. Each node running on its own server. Each node has a 16G heap
Servers are very capable bare metal Dell blade servers, with SSDs, 180+ GB memory, 80 cpu cores etc. #justsayin
We run a few other things on that server and if its relevant I'll disclose but basically the server is way under-utilized all day
The outage happens any time of day. Even in the wee hours of the morning when website traffic is at an all-time low. The outage will last 5-20mins. Sometimes the stalled node mysteriously self-recovers and starts accepting requests once more. But more often than not we manually restart the stalled elasticsearch node (the node, not the server).
Noteworthy: The Kibana/X-Pack monitoring graph for the stalled node mysteriously goes blank for the duration of the outage. No data. Nothing
Here is excerpts of the elasticsearch log file(s). Too large to gist!
app1 is the elastic node that stalls. so its log file is incredibly noisy and verbose
app2 is the other node which of course observes the issue from a different but useful viewpoint
Some commentary on app1 log:
- The outage appears to start around 2018-01-03T18:50
- Some ParsingException's are thrown before that but i think/feel its unrelated
- A build up of "monitoring execution is skipped until previous execution terminated" log lines around 2018-01-03T18:51. Seems to suggest something awry?
- By 2018-01-03T18:52:36, the outage is in full swing. Numerous EsRejectedExecutionException exceptions thrown
Commentary on app2 log:
- at 2018-01-03T18:55:08, app2 learns that app1 has left the cluster and it all goes downhill from there
We've been running elasticsearch for 3/4 years. Starting with 1.7, then 2.0, then 5.2, and now 5.5.x. Its been stable as a rock until we upgraded to 5.5
We're out of ideas on what the problem may be. Early on we thought the problem might be GC related and so we progressively increased heap size from 8G to 16G but the problem persists
We tried googling over and over again to find someone else on the intertubes with the same problem but came up empty