ES 5.2.2: Sudden heap spikes followed by cluster crash

Hello community,

We've recently upgraded from ES 2.4.4 to 5.2.2 and observe random cluster crashes since then. We're at a loss as to what causes the crashes other than knowing that heap memory suddenly grows to its full capacity on all nodes in a matter of minutes. The nodes don't recover by themselves and a full cluster restart is required.

There is no increase of requests to the cluster that would justify the growth. Also, the cluster may run for a few days without any heap spikes (see first graph below).

The log only shows an increased number of GC runs due to the growth in used heap memory that eventually start to take a long time (over 1min).

We've yet to take a heap dump at the time of a crash. Up until now the priority was to restore the cluster since the issue is affecting our production environment.

The following graph shows how our cluster behaves normally and the heap spike in the middle.

This is the same heap spike as in the graph before, but zoomed in. Heap memory grew from ~10GB to 30GB in around 8min.

Any advice on how to analyse this further? What other information can I provide to help shed light into this issue?

Cheers
Stefan

Hey,

we are also seeing this issue with our 5.2.2 cluster, we have to restart some misbehaving nodes every couple of days at the moment. We managed to bring that number down to a few nodes per week by limiting the query cache to three percent.

Our most recent node crash looks like this (you can clearly see the query cache ramping up until the node is stuck):

There is also a thread from people that had similar problems, in their case it is related to a bug in aggregations:

Hope this will help you until everything's fixed :slight_smile:

Hi André

Thanks for your reply.

Do your nodes crash individually or all at once? Since our first crash and the full cluster restart the nodes always crashed all at once. The situation would be much improved if only one node would crash at a time.

Nevertheless, another crash would be beneficial for analysis purposes. I'd like to create a heap dump and submit it to ES.

I've also seen the other discuss post you posted. But it seems our symptoms are a bit different, that's why I created my own post.

I'll give reducing the query cache a go and will keep this post updated.

Cheers
Stefan

Ouch, all nodes heap-spike and die?

For us it's usually just one node stalling, but the whole cluster becomes unresponsive.

A quick dig into the master log helps to identify the misbehaving node, e.g. in this case it was es-big-15:

[2017-04-03T14:17:28,806][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [es-big-03] collector [cluster-stats-collector] timed out when collecting data
[2017-04-03T14:17:50,244][ERROR][o.e.x.m.c.i.IndicesStatsCollector] [es-big-03] collector [indices-stats-collector] timed out when collecting data
[2017-04-03T14:18:00,251][ERROR][o.e.x.m.c.i.IndexStatsCollector] [es-big-03] collector [index-stats-collector] timed out when collecting data
[2017-04-03T14:18:10,253][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [es-big-03] collector [cluster-stats-collector] timed out when collecting data
[2017-04-03T14:18:20,260][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [es-big-03] collector [index-recovery-collector] timed out when collecting data
[2017-04-03T14:18:38,889][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [es-big-03] failed to execute on node [Brcd-4CaQ3ejYCv9y_L5ug]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-big-15][192.168.16.36:9300][cluster:monitor/nodes/stats[n]] request_id [2239432562] timed out after [15000ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:908) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.2.jar:5.2.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]

@AirOnSkin before we found the issue with our aggregations we were given the suggestion to turn off all our cacheing and clear the caches due to a lucene bug that was discovered and can lead to memory leaks. This didn't help us specifically but I thought I would pass it along in case it might help your case.

Can you correlate any searches or aggregations to the time the spike occurs?

@AndreCimander: Yes, the heap spike happens on all three search nodes, the master nodes are unaffected. I've added the logs of all three nodes here until the crash here:

@mstruve: We're going to test a reduced query cache setting of 3% on one node as soon as possible. Other than that I'm reluctant to test out multiple changes at once on production (our stage environment doesn't behave the same way as production because there's no load :frowning: ).
We haven't been able to pin the behaviour to anything specific that happens at the start of the incident. The amount and type of requests to the cluster seem to be normal.

@AirOnSkin What type of requests are you making to your cluster the most? Those might be helpful to see even though the type and number have not changed, the way ES is processing them probably has.

@mstruve: We're mainly doing searches.

Our cluster froze yesterday evening at around 18:10 again. This time without a memory peak.

Earlier yesterday I configured a query cache size of 3% for our first node (search01) and restarted ES. I didn't touch the other nodes, though. The cluster suddenly froze without any clear indication of something (heap/old/young/cache size/...) growing abnormally or reaching max.

I've tried to create a heap dump as soon as we realized we're down, but jmap wouldn't complete with the cluster being unresponsive.

The logs show the restart of the first node (as mentioned above) at 16:53 and the next log entry on all three nodes is at 18:09 when suddenly the GC starts to take a long time. A few minutes later the cluster froze.

This situation is pretty bad because we have no idea how to counter it. We've been lucky up until now because the crashes only happened during work hours and we were able to react pretty quickly, but if the cluster goes down during the night or on a weekend our reaction times will increase.

We're considering upgrading to 5.3.0, mainly because we're out of ideas (even if it is unsatisfactory).

Is the misbehaving node of your data nodes maxing out one cpu? I've seen ES instances that refuse to shutdown via SIGTERM keeping a 99% cpu process around until I SIGKILL it, but for us it was always tied to a heap exhaustion.

Good luck with 5.3, I hope it fixes your problems, keep us posted :slight_smile: We are holding out for Lucene 6.5.1, hopefully arriving with 5.4.

I have pretty much the same issue on the similar setup:

  • ElasticSearch 5.2.2
  • java 8_u121
  • 31 GB heap

Here is what I tried to fix it:

  • added circuit breakers:

indices.breaker.total.limit : 70%
network.breaker.inflight_requests.limit : 65%
indices.breaker.request.limit : 30%

  • limited field data cache:

indices.fielddata.cache.size : 30%

  • limited max index buffer size:

indices.memory.max_index_buffer_size : 6gb

None of that really helped. The biggest pain is that it seems to happen without no apparent reason, it apparently not caused by load. Here is how heap/CPU utilization looks like:

It doesn't recover by itself. I had to put this node out of its misery to prevent the whole cluster from being frozen.

Currently there are two things that help me:

  • minimal failure discovery timeout which helps to exclude problem node as soon as it stops being responsive.
    discovery.zen.fd.ping_timeout: 1s
    discovery.zen.fd.ping_retries: 1

  • monit tool that detects misbehaving process and kills it:
    check process elasticsearch with pidfile /var/run/elasticsearch/elasticsearch.pid
    start = "/etc/init.d/elasticsearch start"
    if total cpu >= 60% for 7 times within 10 cycles then exec "/usr/bin/killall -9 java"

I'm going to try to upgrade my ES cluster to version 5.3.0 and hope that it will help.

Upgrade to ES 5.3.0 have not resolved the issue.

It looks like upgrade to ES 5.3.1 resolved the issue. I think it's because ES 5.3.1 uses Lucene 6.4.2.

2 Likes

Ivan,

Did you roll back the configuration changes you mentioned in your first post after the 5.3.1 upgrade?

We have not had any more crashes since my last post, but that could have to do with the fact, that we restarted the cluster quite often due to work we were doing on it. We upgraded to 5.3.0 last week and this week we've started to reduce heap size from 30GB to 12GB. I don't have any numbers, but it feels like a reduced heap size also increased the performance of our cluster.
We're going to keep reducing heap size until we see a drop in performance.

Anyways, I'll report back on our success with 5.3.1.

Cheers.

1 Like

Hi Stefan,

No, I didn't roll back the configuration changes but I suspect that it will be stable even if I do that. I will certainly try to experiment with that. I'm going to roll them back one by one and see what happens.

Also I noticed that after the upgrade, ES started working significantly faster for search queries:

p99 describes 1% of the slowest requests.

Our cluster has been stable since the upgrade to 5.3.2 and the heap size reduction to 12GB (from 30GB). The graph in my first post clearly shows that the GC had too much garbage to collect. Now the data that stays in memory after a GC run is around 5GB to 6GB. There was no reduction in performance on our cluster due to the heap size reduction.
I can only recommend to anyone to consider reducing their heap sizes to as much as needed, but as less as possible.

Cheers

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.