5.5.3 Node stalls for no apparent reason

(Q) #1

We're experiencing an intermittent issue for weeks/months since upgrading to elasticsearch 5.5.3 (from 5.2.x)

In short, one or both elasticsearch nodes will suddenly and without no apparent warning stop accepting search requests. In fact it appears as if it literally refuses to do any work at all. The world stops. Thats more than just poetic. Maybe GC is stopping the world?!?!?

We run a 2-node cluster (with minimum master nodes set to 1) on Ubuntu 16.04 LTS. Each node running on its own server. Each node has a 16G heap

Servers are very capable bare metal Dell blade servers, with SSDs, 180+ GB memory, 80 cpu cores etc. #justsayin

We run a few other things on that server and if its relevant I'll disclose but basically the server is way under-utilized all day

The outage happens any time of day. Even in the wee hours of the morning when website traffic is at an all-time low. The outage will last 5-20mins. Sometimes the stalled node mysteriously self-recovers and starts accepting requests once more. But more often than not we manually restart the stalled elasticsearch node (the node, not the server).

Noteworthy: The Kibana/X-Pack monitoring graph for the stalled node mysteriously goes blank for the duration of the outage. No data. Nothing

Here is excerpts of the elasticsearch log file(s). Too large to gist!

app1 is the elastic node that stalls. so its log file is incredibly noisy and verbose

app2 is the other node which of course observes the issue from a different but useful viewpoint

Some commentary on app1 log:

  • The outage appears to start around 2018-01-03T18:50
  • Some ParsingException's are thrown before that but i think/feel its unrelated
  • A build up of "monitoring execution is skipped until previous execution terminated" log lines around 2018-01-03T18:51. Seems to suggest something awry?
  • By 2018-01-03T18:52:36, the outage is in full swing. Numerous EsRejectedExecutionException exceptions thrown

Commentary on app2 log:

  • at 2018-01-03T18:55:08, app2 learns that app1 has left the cluster and it all goes downhill from there

We've been running elasticsearch for 3/4 years. Starting with 1.7, then 2.0, then 5.2, and now 5.5.x. Its been stable as a rock until we upgraded to 5.5

We're out of ideas on what the problem may be. Early on we thought the problem might be GC related and so we progressively increased heap size from 8G to 16G but the problem persists

We tried googling over and over again to find someone else on the intertubes with the same problem but came up empty

Please help!?!?!

(Christian Dahlqvist) #2

If you have 2 nodes that are both master eligible, you should have minimum_master_nodes set to 2 in order to avoid split-brain scenarios. Can you change this to see if it makes a difference?

(Q) #3

Thanks for quick reply. Worth trying out

(Christian Dahlqvist) #4

If you are looking to be able to handle one node going down, you can install one dedicated master node (which can be a lot smaller than the data nodes) on one of the hosts. This will allow a master too be elected even if one of the nodes is not available, which will improve resiliency. If you lose the host with 2 nodes, your cluster will however be in a red state just as with the current configuration. This is why we always recommend 3 master eligible nodes on separate hosts for production.

(Q) #5

thanks again

is adding a 3rd data-only node as simple as setting node.data to false for that particular node?


(Christian Dahlqvist) #6

Yes. How to set up a dedicated master node is described a bit higher up on that page.

(Q) #7

Me again

We've changed our minimum_master_nodes setting from 1 to 2

(Also, we've just realized that our network.host setting is probably not right. Our nodes are co-ordinating over a public network segment and not over a private one as we intended. We fixed that too)

Nodes no longer stall but we're still observing the occasional flurry of EsRejectedExecutionExceptions


[2018-02-16T12:54:27,664][DEBUG][o.e.a.s.TransportSearchAction] [redacted_host2] [9947854] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [redacted_host1][][indices:data/read/search[phase/fetch/id]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.transport.TcpTransport$RequestHandler@300401b7 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6d6bf90a[Running, pool size = 121, active threads = 121, queued tasks = 1000, completed tasks = 14100175]]

This lasts for a few mins and then subsides as if nothing happened

Any ideas?

(Christian Dahlqvist) #8

How many indices/shards do you have in the cluster?

(Q) #9

6 open indices (excluding system indices). 5 primary shards per index

(Only 3 indices are actively indexed and searched on. The rest is backups/rollovers)

The biggest and busiest index (by a long way) has 1.6M docs. Search Rate is about 400/s

(Christian Dahlqvist) #10

What is the size of the indices/shards?

(Q) #11

largest index is 2.8GB. The unused (no searches, no indexing) stale copy of that index is 2.7GB

The rest of the indices collectively less than 20MB

(Christian Dahlqvist) #12

2.8GB is not very large for a shard, so you may want to bring the number of primary shards down to 1 for the indices using the shrink index API. This will ease the pressure on the queues and may affect latencies (positive or negative), so it may be worth benchmarking it first...

(Q) #13

ok. will try that thanks

can it also help to add more nodes to the cluster?

(Christian Dahlqvist) #14

Yes, that could help too.

(Q) #15

I've reduced the num shards from 5 to 1

Will let you know how this goes

Have a great weekend and thanks for all the assistance! Amazing

(Q) #16

Since reducing shards from 5 to 1 we're seeing new behavior

Just 20 mins ago, one of the nodes appeared to stall again. Search and GET requests suddenly taking a long time to run. Our client (official elasticsearch js client) times out requests after 5 seconds. Our monitoring system picked up hundreds of timeouts

This time around, no EsRejectedExecutionException entries in logs

Elasticsearch logs set to DEBUG level. No WARN or ERROR entries for the duration of the stall. Just these...

[2018-02-17T09:50:05,287][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [filters.occasion.option_value.raw][13] took [240.2micros]
[2018-02-17T09:50:05,287][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [filters.lense_type.agg_key][6] took [184.6micros]
[2018-02-17T09:50:05,287][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [filters.length_detail-bottoms.agg_key][6] took [165.3micros]
[2018-02-17T09:50:05,287][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [filters.leg_length.option_value.raw][3] took [111.5micros]
[2018-02-17T09:50:05,287][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [filters.fit.agg_key][21] took [237.8micros]
[2018-02-17T09:50:05,287][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [filters.style.agg_key][565] took [1.6ms]
[2018-02-17T09:50:05,287][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [filters.fabric.option_value.raw][4] took [149micros]
[2018-02-17T09:50:05,287][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [filters.strap_material.agg_key][9] took [237.6micros]
[2018-02-17T09:50:05,287][DEBUG][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [filters.size.agg_key][3] took [193.6micros]


Search rate across the cluster was 40-50/s. Relatively low traffic (for us)

(Q) #17

One more worthy observation, this time around its the REPLICA node that is acting up. So far its two occasions in a row. The MASTER node seems to be more stable

(Q) #18

Just an update. We've reduced heap size from 16G to 4G and seeing a significant reduction in extreme p99 latencies

We did that progressively ie. 16, 12, 10, 8, 6, 4

(Q) #19

Update. Reverted heap size to 8GB as we regularly perform large scan & scroll queries which needs a bit more heap else we run into OLD GC runs