Upgrading ES 2.3.3 to 5.2 Causing Cluster Crash!

itaydvir · March 30, 2017, 8:52am

Hi,

We upgraded our cluster from ES 2.3.3 to 5.2.
our new 5.2 cluster was built with much stronger servers (for future growth) than our 2.3.3 cluster which was rock solid.
Everything was running smoothly for few hours, monitoring shows that servers were calm and didn't work hard at all.
Then, suddenly Heap size jumps from 6.1GB to MAX (15GB) causing long GC's and then Nodes became unresponsive and then disconnected from cluster.
All other metrics looks stable, in terms of CPU (very low), indexing rate, get rate and search rate.
query cache, request cache and fielddata are stable as well.
Our usage is pretty basic, mainly textual search, some aggregations (sum, avg, and some contains scripts). very very low usage of nested documents and no usage of parent/child relationships.

It is not clear why the Heap jumps so fast to MAX - seems to be a MAJOR bug of elasticsearch or lucene.

then we found about this lucene 6.4.0 memory leak, so we tried deploing again but this time with ES 5.2.2 + lucene 6.4.1 but still same scenario happend.

Here is some technical information about the cluster:

3 master nodes
5 data nodes (32GB RAM, 8 cores)
indices are not that big, most used ones contains 1.5 Mil docs, and take 4GB store size
indices used for aggs contains about 6Mil docs and weight about 15GB.

Attaching some graphs from crush time:

Heap (All nodes - The bottom lines are master nodes)
GC data (of first node that started the issue):

GC.png964×231 24.4 KB
Index, Merge, Get & Search rates:

rates.png1902×218 46.4 KB

We've been working with elasticSearch for 5 years now and upgraded versions few times. this is the first time we encounter such major issue.
This is critical for us, so thank you in advance for helping!

spinscale · March 31, 2017, 7:09am

Hey,

it might make sense to open up a github issue for this one. If you have/can obtain one, a heap dump, when the GC is happening might also be useful. Also any logfile entries would be great (like the GC log information) - and of course any other non-standard messages in the log like exceptions/warnings/errors.

--Alex

itaydvir · April 1, 2017, 7:36pm

Thanks Alex, will do.

system · April 29, 2017, 7:37pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES 5.2.2: Sudden heap spikes followed by cluster crash Elasticsearch	15	5213	June 8, 2017
ES 2.4 to 5.2 Upgrade Followed By Major Cluster Instability Elasticsearch	24	3369	April 26, 2017
ES 1.5.2 cluster crashes Elasticsearch	6	606	July 5, 2017
Heap Usage is not as usual Elasticsearch	6	786	July 3, 2017
Elasticsearch 7.3 Heap Usage Elasticsearch	4	451	September 16, 2019

Upgrading ES 2.3.3 to 5.2 Causing Cluster Crash!

Related topics