Forcemerge of segments cause cluster to RED status

Hi all. I have 7 node cluster with elasticsearch version 7.1.1 which was upgraded from 6.7.2. Each node has 31gb heap size. Cluster contains 75 daily indexes 7 shards each and 1 replica shard. Each index has 500gb data. If I try to forcemerge segments into single one for any index - cluster stops to indexing data and after some time nodes begin to leave cluster and it fails to RED status. It returns to normal only when forcemerge operation aborted or cluster fully restarted. Problem is appeared only after upgrade from 6.7.2 to 7.1.1 via full cluster upgrade. Thanks.

Welcome! :slight_smile:

Do you have logs from the nodes that are leaving the cluster?

Yes, this is the log from the first node which faced problems:

[2019-06-06T19:44:01,660][INFO ][o.e.c.s.ClusterApplierService] [elk4] master node changed {previous [{elk4}{9wBIqy8MS3irgQFkzX724g}{JVOYWz2oSR6s4hSxQKGeIA}{AAA.AAA.AAA.AAA}{AAA.AAA.AAA.AAA:9300}{xpack.installed=true}], current []}, term: 6868, version: 7630, reason: becoming candidate: Publication.onCompletion(false)
[2019-06-06T19:44:01,660][WARN ][o.e.c.s.MasterService    ] [elk4] failing [create-index-template [.management-beats], cause [api]]: failed to commit cluster state version [7631]
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
	at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1354) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:192) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1294) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.cluster.coordination.Publication.cancel(Publication.java:88) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$2.run(Coordinator.java:1261) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-7.1.1.jar:7.1.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.elasticsearch.ElasticsearchException: publication cancelled before committing: timed out after 3m
	at org.elasticsearch.cluster.coordination.Publication.cancel(Publication.java:85) ~[elasticsearch-7.1.1.jar:7.1.1]
	... 5 more
[2019-06-06T19:44:01,698][DEBUG][o.e.a.a.i.t.p.TransportPutIndexTemplateAction] [elk4] master could not publish cluster state or stepped down before publishing action [indices:admin/template/put], scheduling a retry
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
	at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$3.onFailure(Coordinator.java:1354) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:101) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:192) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:54) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1294) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:124) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.cluster.coordination.Publication.cancel(Publication.java:88) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$2.run(Coordinator.java:1261) ~[elasticsearch-7.1.1.jar:7.1.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-7.1.1.jar:7.1.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.elasticsearch.ElasticsearchException: publication cancelled before committing: timed out after 3m
	at org.elasticsearch.cluster.coordination.Publication.cancel(Publication.java:85) ~[elasticsearch-7.1.1.jar:7.1.1]
	... 5 more
[2019-06-06T19:44:02,728][WARN ][o.e.t.OutboundHandler    ] [elk4] send message failed [channel: Netty4TcpChannel{localAddress=/AAA.AAA.AAA.AAA:43928, remoteAddress=BBB.BBB.BBB.BBB/BBB.BBB.BBB.BBB:9300}]
javax.net.ssl.SSLException: SSLEngine closed already
	at io.netty.handler.ssl.SslHandler.wrap(...)(Unknown Source) ~[?:?]
[2019-06-06T19:44:05,217][INFO ][o.e.c.c.JoinHelper       ] [elk4] failed to join {elk4}{9wBIqy8MS3irgQFkzX724g}{JVOYWz2oSR6s4hSxQKGeIA}{AAA.AAA.AAA.AAA}{AAA.AAA.AAA.AAA:9300}{xpack.installed=true} with JoinRequest{sourceNode={elk4}{9wBIqy8MS3irgQFkzX724g}{JVOYWz2oSR6s4hSxQKGeIA}{AAA.AAA.AAA.AAA}{AAA.AAA.AAA.AAA:9300}{xpack.installed=true}, optionalJoin=Optional[Join{term=6869, lastAcceptedTerm=6868, lastAcceptedVersion=7631, sourceNode={elk4}{9wBIqy8MS3irgQFkzX724g}{JVOYWz2oSR6s4hSxQKGeIA}{AAA.AAA.AAA.AAA}{AAA.AAA.AAA.AAA:9300}{xpack.installed=true}, targetNode={elk4}{9wBIqy8MS3irgQFkzX724g}{JVOYWz2oSR6s4hSxQKGeIA}{AAA.AAA.AAA.AAA}{AAA.AAA.AAA.AAA:9300}{xpack.installed=true}}]}
org.elasticsearch.transport.RemoteTransportException: [elk4][AAA.AAA.AAA.AAA:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: received a newer join from {elk4}{9wBIqy8MS3irgQFkzX724g}{JVOYWz2oSR6s4hSxQKGeIA}{AAA.AAA.AAA.AAA}{AAA.AAA.AAA.AAA:9300}{xpack.installed=true}

The numbers do not add up for me. 75 indices each with 7 primary and 7 replica shards gives 1050 shards per day. If each index is 500GB in size you are generating 75 * 500GB = 37.5TB of indexed data per day across these 7 nodes, which to me sounds a bit unlikely. If this is accurate I would be interested in learning how the nodes are configured and what the specification of the hosts are.

It looks like the cluster is having problems publishing the cluster state, which could indicate that it is very large and/or that nodes are heavily loaded. If you have not already, I would recommend you have a look at this blog post.

If you can provide us with the full output of the cluster stats API, we should be able to determine the correct amount of data, indices and shards in the cluster and get a better feeling for how it is coping.

1 Like

I think this indicates that you have adjusted some of the settings listed here under the following warning:

If you adjust these settings then your cluster may not form correctly or may become unstable or intolerant of certain failures.

It also strongly suggests that your master nodes are overloaded, and possibly that you're not using dedicated masters. If you're indexing TB of data per day you almost certainly need to isolate your master nodes from your data nodes. From the docs:

To ensure that your master node is stable and not under pressure, it is a good idea in a bigger cluster to split the roles between dedicated master-eligible nodes and dedicated data nodes [...] It is important for the stability of the cluster that master-eligible nodes do as little work as possible.

2 Likes

75 daily indexes - it's overall count. One day = one index. So I have 75 indexes in the cluster and 7 primary shards and 1 replica shard for each index. It's 75 * 7 (primary) * 2 (replica) = 1050 in total for all cluster. Curator deletes indexes every day to keep only 75 indexes in the cluster. I can't say that cluster is overloaded: heap utilization us something about 27Gb on each node and processor is utilized only on half. I had to increase cluster.publish.timeout up to 180 seconds (currently it's 90 seconds) because cluster wasn't able to form (I saw errors in logs about node joins fails because of timeout 60 seconds). The cluster is working good: I'm able to search quickly, it's recovering well if one of the nodes has network issue or hard drive failed or even node shutdown. And merge of segments worked fine until the upgrade. I don't think that my cluster is so big to separate master nodes from data nodes because I don't have any problems with the election in any case. Each node has 10 hard drives which are mounted as separate data paths. Which information should I provide to find the problem with merge segments?

There are some inconsistencies here.

This strongly suggests that your cluster is overloaded. So does the fact that a force-merge causes the master to stand down. More specifically, these symptoms suggest that IO latency is sometimes in the region of seconds-to-minutes rather than milliseconds, which is high enough for master-eligible nodes to be considered faulty. A possible cause is insufficient IO bandwidth, and this could be the case even if CPU and heap look reasonable. It also indicates that this isn't quite true:

Elections and node joins are intimately related processes, so timeouts during node joining are effectively a problem with elections.

It is perhaps worth noting that the master-eligible nodes in Elasticsearch 7 are more sensitive to IO latency, because they are no longer vulnerable to a power-safety issue that existed in earlier versions. I think this explains the difference in behaviour you're seeing since your upgrade.

We can confirm this theory, or otherwise, if you were to set the following settings ...

logger.org.elasticsearch.cluster.service: TRACE
logger.org.elasticsearch.cluster.coordination: TRACE
logger.org.elasticsearch.gateway.MetaStateService: TRACE

... and then share logs from all of the nodes from an outage caused by a force-merge as you describe. Or you could add dedicated masters and see if the problem persists.

1 Like

Will SSD drives for the entire cluster solve the problem? Also is it possible just to add several master nodes and move current nodes to the data role e.g. I will add 3 nodes with the master role and then switch current nodes only to data?

Possibly, yes.

Yes. Add more master nodes first, then do a rolling restart of the other nodes, restarting each one in turn and setting node.master: false while it's shut down.