We've recently upgraded from ES 2.4.4 to 5.2.2 and observe random cluster crashes since then. We're at a loss as to what causes the crashes other than knowing that heap memory suddenly grows to its full capacity on all nodes in a matter of minutes. The nodes don't recover by themselves and a full cluster restart is required.
There is no increase of requests to the cluster that would justify the growth. Also, the cluster may run for a few days without any heap spikes (see first graph below).
The log only shows an increased number of GC runs due to the growth in used heap memory that eventually start to take a long time (over 1min).
We've yet to take a heap dump at the time of a crash. Up until now the priority was to restore the cluster since the issue is affecting our production environment.
The following graph shows how our cluster behaves normally and the heap spike in the middle.
This is the same heap spike as in the graph before, but zoomed in. Heap memory grew from ~10GB to 30GB in around 8min.
Any advice on how to analyse this further? What other information can I provide to help shed light into this issue?