Repeated cluster failures in multi-node cluster

Hi All,

Long time reader, very few posts over the years...

I've recently implemented a new multi-node cluster of the following topology, all on Ubuntu 18.04 LTS servers running v7.5.2 ES/LS/Kibana etc.:

ES-Node01 = Elasticsearch Master Eligible & Data Node
ES-Node02 = Elasticsearch Master Eligible & Data Node
ES-Node03 = Elasticsearch Master Eligible & Data Node
ES-IngestNode01 = Logstash and Elasticsearch Ingest Node (No master, no data)
ES-VisualNode01 = Kibana and Elasticsearch Coordinating Node (No master, no data)

The main requirement was around keeping the ES indices safe and available across multiple nodes, while I know the ingest pipelines and visualisation elements are vulnerable to a node failure at the moment.

The nodes are all in a single ES Cluster, which when set up was working really well and indices/shards are balancing evenly across the data nodes as desired. What I've noticed is that I have been getting repeated "crashes" of the cluster, where Kibana suddenly cannot connect and the API output of a cluster health check at that point also fails. When I try and check by hand, I get authentication failures (the cluster is HTTPS and security login enabled only) and cannot connect with any account. If I check the Ubuntu servers I find that the ES services on the Linux OS are running OK at that point.

Shard/Index wise, all data comes in through a Logstash pipeline and I have all my indices under ILM Policies which configure each index for 3 primary and 1 replica shard. There are currently around 550 shards in the cluster.

The only solution I have found so far is to restart the ES Services, which clears the fault but the problem often comes back a few days later and I'm struggling to find the root cause. I'm sure my topology is not too abnormal, so am hoping someone might be able to offer some troubleshooting advice to bring me back to normal operation? I've been using walkthroughs and forum posts to try and get to the bottom of it, including this: https://thoughts.t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode-ce196e20ba95 as an example.

Thanks in advance and happy to add more to assist!

Hi @Carv,

Can you share exactly how you're trying to check that things work by hand, and the exact response you get? Also can you confirm that this exact check does work when the cluster is working normally.

Can you also share the logs (ideally from all nodes) from around the time of the failure? Use https://gist.github.com/ since there'll be too much to include inline here.

Hi @DavidTurner

Thanks for the reply. I've created a gist with one node's log output and will add the others soon: https://gist.github.com/Carv01/2765a8090c050f2eed961676c8d82430 . It's truncated to fit as the whole file failed to upload, but is an indicative snapshot at least!

When I'm doing any checks, whether it be checking cluster health via a browser using "/_cluster/health?pretty" or the equivalent via CURL, I get the same login failure error as is shown in that gist output. I've not really tried more than checking allocation, cluster health, shard status, etc. via CURL or via Web Browser as they all fail with security login failed as above.

Thanks again!

Thanks, I think this would explain why you can't authenticate, at least not with the native realm:

org.elasticsearch.action.UnavailableShardsException: at least one primary shard for the index [.security-7] is unavailable

Now the question is why that shard is red. Do you see any log messages about that, probably from the start of the outage? Can you set up a file realm user to fall back on when the native realm fails? This would let you use APIs like cluster health and allocation explain to investigate further.

Hi @DavidTurner

I've set up a file realm user as an emergency backup, which is working so far thank you for that. I've output a load of base information using CAT API and a few other bits, updating that same gist with a new file of the health check outputs that I've added: https://gist.github.com/Carv01/2765a8090c050f2eed961676c8d82430 . When checked just now the indices are still being initialised as the shards are brought online, currently around 88%. When I restarted the cluster yesterday during our conversation however, it was not getting this far and kept dropping down to a low % number and then starting to initialise again, which made me wonder if I had a corrupt shard or something at that point in time, however that might not explain why the cluster goes from working fine to then occasionally crashing and stopping working.

Still, progress. Thanks again! If you need any further outputs then please let me know.

The missing information is this:

The logs you've shared so far are swamped by messages about the .security-7 index being unavailable, but they don't tell us why. We need to see earlier logs (and/or logs with these messages filtered out) to find out why that index is unavailable.

Hi David, I've had a trawl through a lot of log files and have found the following and updated the gist:

2020-01-28-1 log = .security-7 index created by API
No mention of .security-7 index/shards in 2020-01-29 or 2020-01-30 log files.
2020-01-31-1 log = .security-7 index failed to obtain shard lock, shard is marked as failed

I'm not sure if the log indicates what exactly caused the lock failure or not, but it might help guide us. Thanks :slight_smile:

P.S. I'm not sure, but reading this issue https://github.com/elastic/elasticsearch/issues/23199 it looks very similar to what I'm experiencing.

No, sorry, that only really tells us that the shard already failed. It's not really possible to help much without seeing a much more complete set of logs. You can get rid of the ones saying at least one primary shard for the index [.security-7] is unavailable if you want to save a bit of space, and we only really care about the messages from around the start of the outage (say ±30min), but individual log messages are rarely useful.

Hi David, OK no worries I've updated the gist (https://gist.github.com/Carv01/2765a8090c050f2eed961676c8d82430) with a more full log output for that day, both before the issue occurred, during and for a few hours afterwards.
Thanks.

Thanks, that helps a bit. This node spent 33 seconds in a GC pause so the master figured it had failed and removed it from the cluster:

[2020-01-31T10:36:50,412][WARN ][o.e.m.j.JvmGcMonitorService] [me-es-es02.medschl.private.cam.ac.uk] [gc][old][251438][10] duration [33.3s], collections [1]/[36.6s], total [33.3s]/[5m], memory [2.1gb]->[297.4mb]/[2.9gb], all_pools {[young] [43.5mb]->[109.8mb]/[133.1mb]}{[survivor] [11.8mb]->[11.8mb]/[16.6mb]}{[old] [2.1gb]->[175.7mb]/[2.8gb]}
[2020-01-31T10:36:50,511][WARN ][o.e.m.j.JvmGcMonitorService] [me-es-es02.medschl.private.cam.ac.uk] [gc][251438] overhead, spent [33.3s] collecting in the last [36.6s]
[2020-01-31T10:36:53,241][INFO ][o.e.c.c.Coordinator      ] [me-es-es02.medschl.private.cam.ac.uk] master node [{me-es-es01.medschl.private.cam.ac.uk}{ne8V9ue7TYmXSqgN_cTnpg}{TeQeL9e8S8C1kzwv5bqslA}{172.27.118.107}{172.27.118.107:9300}{dim}{xpack.installed=true}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{me-es-es01.medschl.private.cam.ac.uk}{ne8V9ue7TYmXSqgN_cTnpg}{TeQeL9e8S8C1kzwv5bqslA}{172.27.118.107}{172.27.118.107:9300}{dim}{xpack.installed=true}] failed [3] consecutive checks

Then it recovers. An hour or so later it nearly suffers the same fate, but just scrapes through:

[2020-01-31T11:42:03,453][WARN ][o.e.m.j.JvmGcMonitorService] [me-es-es02.medschl.private.cam.ac.uk] [gc][old][255302][11] duration [27.8s], collections [1]/[29.9s], total [27.8s]/[5.5m], memory [2.2gb]->[821.9mb]/[2.9gb], all_pools {[young] [70.1mb]->[48.3mb]/[133.1mb]}{[survivor] [11.3mb]->[4.9mb]/[16.6mb]}{[old] [2.1gb]->[220.9mb]/[2.8gb]}

40 minutes later it sees another GC pause for ≥30 sec and drops out of the cluster again:

[2020-01-31T12:22:09,183][WARN ][o.e.m.j.JvmGcMonitorService] [me-es-es02.medschl.private.cam.ac.uk] [gc][257611] overhead, spent [32.5s] collecting in the last [34.1s]

Were you doing anything of note to the cluster at these times, e.g. a very large or expensive search? It looks like it has quite a small heap, so there's a limit to how hard you can push it before it starts to struggle.

1 Like

Also have you disabled swap? It shouldn't take 30+ seconds to GC such a small heap, but it might if swap is enabled.

2 Likes

Hi David,

Thanks for the feedback that's really great. I can't say for certain what the cluster was being used for at that moment unfortunately, as we have multiple users with access that could be querying for data. We do get a fairly regular stream of incoming data from 4 domain controllers (Windows winlogbeat), 4 Mail Servers (Exchange logs via winlogbeat and filebeat) and a SYSLOG source. I wonder if the environment is not properly specified to handle the load of additional queries on top of that, and I could make some changes?

The 3 ES data/master nodes were configured with 2 vCPU and 4GB RAM, with 3GB assigned for ES heap. I've just upped the spec on all 3 nodes to 4 vCPU and 6GB RAM, with 4GB assigned for ES heap. Would I be right in saying that the ES JVM heap size influences the number of shards that are supportable, so a low heap with high shard count will perform poorly or maybe even grind down to a halt?

With regard to swap, this was enabled by default on my VMware VM Ubuntu template. I've disabled this now and restarted all the VMs to bring them back to in-memory only.

Thanks again, really appreciate it.

P.S. For reference, if the cluster stats output with shard count helps influence sizing discussions:

Summary

{
"_nodes" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"cluster_name" : "es-prdcluster",
"cluster_uuid" : "Gtx-g6EBS6WSSqkCgpxIKw",
"timestamp" : 1583337133416,
"status" : "yellow",
"indices" : {
"count" : 86,
"shards" : {
"total" : 449,
"primaries" : 248,
"replication" : 0.8104838709677419,
"index" : {
"shards" : {
"min" : 1,
"max" : 6,
"avg" : 5.22093023255814
},
"primaries" : {
"min" : 1,
"max" : 3,
"avg" : 2.883720930232558
},
"replication" : {
"min" : 0.0,
"max" : 1.0,
"avg" : 0.8023255813953488
}
}
},
"docs" : {
"count" : 237410357,
"deleted" : 3165647
},
"store" : {
"size" : "475.4gb",
"size_in_bytes" : 510474672954
},
"fielddata" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"total_count" : 0,
"hit_count" : 0,
"miss_count" : 0,
"cache_size" : 0,
"cache_count" : 0,
"evictions" : 0
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 6729,
"memory" : "499.4mb",
"memory_in_bytes" : 523679947,
"terms_memory" : "225.8mb",
"terms_memory_in_bytes" : 236805394,
"stored_fields_memory" : "229.4mb",
"stored_fields_memory_in_bytes" : 240621760,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "3.9kb",
"norms_memory_in_bytes" : 4032,
"points_memory" : "22.8mb",
"points_memory_in_bytes" : 23988045,
"doc_values_memory" : "21.2mb",
"doc_values_memory_in_bytes" : 22260716,
"index_writer_memory" : "92.9mb",
"index_writer_memory_in_bytes" : 97469900,
"version_map_memory" : "2.6mb",
"version_map_memory_in_bytes" : 2782834,
"fixed_bit_set" : "2.5mb",
"fixed_bit_set_memory_in_bytes" : 2676928,
"max_unsafe_auto_id_timestamp" : 1583336925190,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 5,
"coordinating_only" : 1,
"data" : 3,
"ingest" : 2,
"master" : 3,
"ml" : 0,
"voting_only" : 0
},
"versions" : [
"7.5.2"
],
"os" : {
"available_processors" : 16,
"allocated_processors" : 16,
"names" : [
{
"name" : "Linux",
"count" : 5
}
],
"pretty_names" : [
{
"pretty_name" : "Ubuntu 18.04.2 LTS",
"count" : 5
}
],
"mem" : {
"total" : "27.1gb",
"total_in_bytes" : 29138182144,
"free" : "702.6mb",
"free_in_bytes" : 736747520,
"used" : "26.4gb",
"used_in_bytes" : 28401434624,
"free_percent" : 3,
"used_percent" : 97
}
},
"process" : {
"cpu" : {
"percent" : 9
},
"open_file_descriptors" : {
"min" : 309,
"max" : 42490,
"avg" : 18780
}
},
"jvm" : {
"max_uptime" : "5.5h",
"max_uptime_in_millis" : 19991310,
"versions" : [
{
"version" : "13.0.1",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "13.0.1+9",
"vm_vendor" : "AdoptOpenJDK",
"bundled_jdk" : true,
"using_bundled_jdk" : true,
"count" : 5
}
],
"mem" : {
"heap_used" : "3.2gb",
"heap_used_in_bytes" : 3486837528,
"heap_max" : "16.8gb",
"heap_max_in_bytes" : 18114150400
},
"threads" : 292
},
"fs" : {
"total" : "976.8gb",
"total_in_bytes" : 1048904966144,
"free" : "422.3gb",
"free_in_bytes" : 453460488192,
"available" : "375.3gb",
"available_in_bytes" : 403061383168
},
"plugins" : ,
"network_types" : {
"transport_types" : {
"security4" : 5
},
"http_types" : {
"security4" : 5
}
},
"discovery_types" : {
"zen" : 5
},
"packaging_types" : [
{
"flavor" : "default",
"type" : "deb",
"count" : 5
}
]
}
}

You should not set the heap to more than 50% of the available RAM. Ideally less, so you've more space for filesystem cache.

Each shard consumes some heap so there's a limit to how many shards a given node can support. Fewer larger shards is often better in terms of memory usage, and 500 shards on a 5x2GB cluster is more than twice as many as the usual recommended limit. It all depends on exactly how you're using it of course.

That said, I suspect that swapping was the main problem here. This doesn't look much like a running-out-of-heap-space failure.

Hi David,

Thanks, this is my first "proper" cluster, so I'm not surprised that I've missed something key. This is very useful reading!

I'm lucky that my ESXi hosts are quite memory rich, so I can increase the memory again to be more in line with the sizing guidance given in that page. I was going to ask about the rule of thumb mentioned, that you've indicated to:

A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured

With my cluster consisting of 3 data/master nodes and 2 ancillary nodes that do not hold a data or master role, I imagine I should focus my efforts on the set of 3. My shards are quite small, but I tried to ensure I had multiple copies to tolerate data node failure (3 primary, 1 replica as set by ILM policy). If I follow that rule at 500 shards, I'd be looking at 3 nodes of 8.333GB (read 9GB!) heap per node (though that would have to be provided by around 16-18GB of VM RAM to meet the other condition of 50% of available RAM to heap allocated. I wonder if I would be better served reducing my shard count as well as increasing RAM/Heap for my small cluster and shard storage size, as this seems quite a lot of memory required (though I accept that this is listed as a recommended best practice). Would you have any thoughts on that matter at all? I don't want to cause more problems by further mis-sizing of course.

I too am hoping that it was swapping causing a large GC lag, like you say. I've noticed the cluster appears much more performant now, so already I am seeing a benefit.

Thanks!

If you're ingesting logs from just 10 sources then 3 primaries for each index is probably 2.9 too many :grin: I expect you'll get benefits from dropping down to 1 primary and shrinking the existing indices.

How much data are you managing here, and at what rate are you indexing new documents? What's your ILM rollover config? Are you generating shards with sizes in the 10s-of-GBs range?

Hi David,

Alas, in trying to make the cluster more resilient I might have made things worse!

I store one index per day for each beat type (Exchange IIS, Exchange Win logs, AD Win Logs, Syslog)

Largest index = 43GB (IIS logs), up to 22 million documents stored.

Average index size for IIS logs = 26GB approx.
Average index size for Exchange Winlogbeat = 17.2GB approx.
Average index size for AD Winlogbeat = 5.6GB approx.
Average index size for Syslog source = 14MB approx.

Retaining all indices for 41 days, then they are purged as they age out. All indices are configured for 3 primary shards and 1 replica.

I lost some monitoring data awhile ago due to a separate issue, so I'm not sure about the peak index rate for all indices, however I know that when I restart my directory controller winlogbeat agents I get a spike of just under 2.5k/s total shards, 1.2k/s primary shards, and then it settles down to a regular flow of 24-60/s total shards, 12-32/s primary shards. I imagine the IIS logs from my Exchange environment are a little busier than that. I've currently got my Exchange server beats pointing back to an older stand-alone ES 6.x instance without monitoring during this repair/recovery work so I can't say what they perform at unfortunately.

So from that and the extra reading I've done, it might be prudent to reduce the primary shard count for the AD Winlogbeat and Syslog sources to 1 primary, 1 replica (for resiliency?), and reduce the Exchange IIS and Winlogbeat sources to 2 primary, 1 replica as they are a little larger in overall index size?

Thanks again :sweat_smile:

I think 1 primary on all these indices would be fine. 1 replica is needed for resilience indeed.

Do you really need to retain things for exactly 41 days, or can you tolerate retaining some logs for a bit longer? If there's wiggle room, consider rolling them over at a lower frequency than daily too. Could you try a max size of 50GB and a max age of 7 days? This'll roll over your IIS logs every day or two but will combine ~3 days of Exchange logs and ~7 days of AD logs and syslogs giving you more reasonably-sized shards. Larger max-ages would be even better, but will further decrease the accuracy of your retention policy.

By my reckoning this should give you ~100 shards in total. You might even be able to fit that on nodes with a 2GB heap.

1 Like

Hi David,

I'm essentially retaining one month of data, but then giving myself some overhead, so I'm not wedded to 41 days exactly, no. I'd be more interested in system stability and load than precise daily index rollover.

Currently I'm rolling over based on the Logstash pipeline using a date variable to specify which index the beat logs go into, so it naturally creates a new one per day. Would this need to change, i.e. remove the date variable and keep the indices named consistently, and shift to an ILM policy based on size rollover instead to control it as you discussed: https://www.elastic.co/guide/en/elasticsearch/reference/current/using-policies-rollover.html ?

Thanks again for all your help, I've bumped the cluster data/master nodes to 12GB RAM and 6GB heap per node for ES, with swap disabled still of course, and the whole thing appears much more stable and faster to query now. :cowboy_hat_face:

Yes, this is basically the point of the rollover feature. It's hard to pick a date-based naming scheme for indices that generates good-sized shards, and that tends to result in clusters like yours with more shards than they need. ILM addresses this.

Hi David,

Awesome, thanks. I'd only been using it for the phase transition beforehand. I've reconfigured a test index to use ILM for rollover and can see it working now, so I'll spec that into the design for the production indices and reconfigure for rollover on bigger index size/longer date range, then enjoy lower shard count from there. :partying_face: :tada:

I'll try and stop bothering you now, but it's been great so thank you for your advice :smiley:

1 Like