Master role does not switch automaticaly

Hello everyone,

I've recently having issue moving the master role from node to node.

I got 3 node in my cluster and there are all 3 master-eligible nodes :
image

I first disable shard allocation on my second node and i let elasticsearch move all the shards across the 2 other node.

I then shutdown elasticsearch on the second node to perform a disk change on the VM and add storage (while changing the mounting point names)

But whenever i shutdown the master node, the role never got transfer to another node and i'm stuck with a cluster without any entry point available.

Logs from the others nodes :

[2020-01-08T11:43:38,012][WARN ][r.suppressed             ] [hostname] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:189) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:534) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:415) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:568) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:325) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:598) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.5.1.jar:7.5.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-01-08T11:43:38,020][WARN ][r.suppressed             ] [hostname] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:189) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:534) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:415) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:568) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:325) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:598) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.5.1.jar:7.5.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-01-08T11:43:38,141][WARN ][r.suppressed             ] [hostname] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:189) ~[elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:534) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:415) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:568) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:325) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:598) [elasticsearch-7.5.1.jar:7.5.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.5.1.jar:7.5.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]

My config file (same on the 3 node except the IP) :

cluster.name : MyCluster
node.name : ${HOSTNAME}
path.data : /data
path.logs : /var/log/elasticsearch
network.host : my_ip
http.port : 9200
discovery.seed_hosts : ["ip_node1","ip_node2","ip_node3"]
discovery.zen.minimum_master_nodes : 1
cluster.max_shards_per_node: 2000
xpack.monitoring.enabled: false
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: cert/path
xpack.security.transport.ssl.truststore.path: cert/path

If i restart my node who was previously master, the master role got switch at the moment the node restart

It's not the first time i'm having this issue and i never get an explanation on how to properly correct this issue or if there is a misconfiguration from my side ...

Hi @The-oo,

I believe the log file from the other 2 nodes should contain a message from the ClusterFormationFailureHelper that hopefully contains the explanation for why a new master could not be elected. Please see if you can find that message and include it here.

Here is the logs from the moment i turned of my master (i removed all the java error to prevent pasting thousands of lines.
In this situation, the elected master was the ip 172.31.27.9. I first moved every shards from this node to the 2 others and then simply shutdown the service

systemctl stop elasticsearch

Logs from 172.31.27.8 (cluster formation only:

[2020-01-08T11:40:13,782][WARN ][o.e.c.c.ClusterFormationFailureHelper] [cyres-elas03a] master not discovered or elected yet, an election requires at least 2 nodes with ids from [GmYLyNdCT-OS3-vHYFhIdg, icISibvhTd-hGnJm9_TH6A, 5DMctAzETmS90Rkz0EWRqg], have discovered [{cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery will continue using [172.31.27.9:9300, 172.31.27.10:9300] from hosts providers and [{cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}, {cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03b}{5DMctAzETmS90Rkz0EWRqg}{TZCHaT2DQTuO_tzssL4dHw}{172.31.27.9}{172.31.27.9:9300}{dilm}{ml.machine_memory=25280589824, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 14, last-accepted version 31920 in term 14
[2020-01-08T11:40:23,783][WARN ][o.e.c.c.ClusterFormationFailureHelper] [cyres-elas03a] master not discovered or elected yet, an election requires at least 2 nodes with ids from [GmYLyNdCT-OS3-vHYFhIdg, icISibvhTd-hGnJm9_TH6A, 5DMctAzETmS90Rkz0EWRqg], have discovered [{cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery will continue using [172.31.27.9:9300, 172.31.27.10:9300] from hosts providers and [{cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}, {cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03b}{5DMctAzETmS90Rkz0EWRqg}{TZCHaT2DQTuO_tzssL4dHw}{172.31.27.9}{172.31.27.9:9300}{dilm}{ml.machine_memory=25280589824, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 14, last-accepted version 31920 in term 14
[2020-01-08T11:41:08,577][WARN ][o.e.c.c.ClusterFormationFailureHelper] [cyres-elas03a] master not discovered or elected yet, an election requires at least 2 nodes with ids from [GmYLyNdCT-OS3-vHYFhIdg, icISibvhTd-hGnJm9_TH6A, 5DMctAzETmS90Rkz0EWRqg], have discovered [{cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery will continue using [172.31.27.9:9300, 172.31.27.10:9300] from hosts providers and [{cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}, {cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03b}{5DMctAzETmS90Rkz0EWRqg}{TZCHaT2DQTuO_tzssL4dHw}{172.31.27.9}{172.31.27.9:9300}{dilm}{ml.machine_memory=25280589824, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 15, last-accepted version 31921 in term 15
[2020-01-08T11:41:18,578][WARN ][o.e.c.c.ClusterFormationFailureHelper] [cyres-elas03a] master not discovered or elected yet, an election requires at least 2 nodes with ids from [GmYLyNdCT-OS3-vHYFhIdg, icISibvhTd-hGnJm9_TH6A, 5DMctAzETmS90Rkz0EWRqg], have discovered [{cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery will continue using [172.31.27.9:9300, 172.31.27.10:9300] from hosts providers and [{cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}, {cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03b}{5DMctAzETmS90Rkz0EWRqg}{TZCHaT2DQTuO_tzssL4dHw}{172.31.27.9}{172.31.27.9:9300}{dilm}{ml.machine_memory=25280589824, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 15, last-accepted version 31921 in term 15
[2020-01-08T11:42:03,470][WARN ][o.e.c.c.ClusterFormationFailureHelper] [cyres-elas03a] master not discovered or elected yet, an election requires at least 2 nodes with ids from [GmYLyNdCT-OS3-vHYFhIdg, icISibvhTd-hGnJm9_TH6A, 5DMctAzETmS90Rkz0EWRqg], have discovered [{cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery will continue using [172.31.27.9:9300, 172.31.27.10:9300] from hosts providers and [{cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}, {cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03b}{5DMctAzETmS90Rkz0EWRqg}{TZCHaT2DQTuO_tzssL4dHw}{172.31.27.9}{172.31.27.9:9300}{dilm}{ml.machine_memory=25280589824, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 16, last-accepted version 31922 in term 16
[2020-01-08T11:42:13,471][WARN ][o.e.c.c.ClusterFormationFailureHelper] [cyres-elas03a] master not discovered or elected yet, an election requires at least 2 nodes with ids from [GmYLyNdCT-OS3-vHYFhIdg, icISibvhTd-hGnJm9_TH6A, 5DMctAzETmS90Rkz0EWRqg], have discovered [{cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery will continue using [172.31.27.9:9300, 172.31.27.10:9300] from hosts providers and [{cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}, {cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03b}{5DMctAzETmS90Rkz0EWRqg}{TZCHaT2DQTuO_tzssL4dHw}{172.31.27.9}{172.31.27.9:9300}{dilm}{ml.machine_memory=25280589824, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 16, last-accepted version 31922 in term 16
[2020-01-08T11:42:23,473][WARN ][o.e.c.c.ClusterFormationFailureHelper] [cyres-elas03a] master not discovered or elected yet, an election requires at least 2 nodes with ids from [GmYLyNdCT-OS3-vHYFhIdg, icISibvhTd-hGnJm9_TH6A, 5DMctAzETmS90Rkz0EWRqg], have discovered [{cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery will continue using [172.31.27.9:9300, 172.31.27.10:9300] from hosts providers and [{cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}, {cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03b}{5DMctAzETmS90Rkz0EWRqg}{TZCHaT2DQTuO_tzssL4dHw}{172.31.27.9}{172.31.27.9:9300}{dilm}{ml.machine_memory=25280589824, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 16, last-accepted version 31922 in term 16
[2020-01-08T11:42:33,474][WARN ][o.e.c.c.ClusterFormationFailureHelper] [cyres-elas03a] master not discovered or elected yet, an election requires at least 2 nodes with ids from [GmYLyNdCT-OS3-vHYFhIdg, icISibvhTd-hGnJm9_TH6A, 5DMctAzETmS90Rkz0EWRqg], have discovered [{cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}] which is a quorum; discovery will continue using [172.31.27.9:9300, 172.31.27.10:9300] from hosts providers and [{cyres-elas03c}{GmYLyNdCT-OS3-vHYFhIdg}{9TWZOlkvRIu6h1dJoHyXKQ}{172.31.27.10}{172.31.27.10:9300}{dilm}{ml.machine_memory=33717133312, ml.max_open_jobs=20, xpack.installed=true}, {cyres-elas03a}{icISibvhTd-hGnJm9_TH6A}{Z4xeN5cYSTCGAth1A382og}{172.31.27.8}{172.31.27.8:9300}{dilm}{ml.machine_memory=25280589824, xpack.installed=true, ml.max_open_jobs=20}, {cyres-elas03b}{5DMctAzETmS90Rkz0EWRqg}{TZCHaT2DQTuO_tzssL4dHw}{172.31.27.9}{172.31.27.9:9300}{dilm}{ml.machine_memory=25280589824, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 16, last-accepted version 31922 in term 16

Please could you share all the logs from both remaining nodes? Use https://gist.github.com since they will be too large to share in full here.

Sorry, the logs are lengthy :confused:
i format them a bit to expose the issue in the first lines. 11:38:35 is the time when i shutdown the elasticsearch service on the node 03b. Before this, there was no issue what so ever.

Logs from node a

Logs from node c

Thanks. The vital clue is here:

[2020-01-08T11:39:33,488][WARN ][o.e.g.IncrementalClusterStateWriter] [cyres-elas03c] writing cluster state took [57674ms] which is above the warn threshold of [10s]; wrote metadata for [693] indices and skipped [0] unchanged indices

It is taking almost a minute to write out the cluster metadata after the master is elected, which the master considers to be unreasonably slow and it therefore considers itself faulty and stands down. It's kinda right, that's a long time to write out a fairly small amount of data, but with ~700 indices it's not completely unreasonable if your disks aren't that quick.

Can you move to faster disks? If not, I suggest lengthening these timeouts to account for your environment:

cluster.publish.timeout: 90s
cluster.join.timeout: 90s

Work is in progress to trim down the sensitivity to slow disks in https://github.com/elastic/elasticsearch/issues/48701, hopefully to be released soon.

Thank you for the clear explaination. I, undeed, see this error a while ago when i first encounter this issue but i didn't though it was the actual issue.

That's weird because i was on a raid 6 with 12 drives so i shouldn't get low write and read speed.
Now i got 20 drives per server with each configured as a RAID0 (i couldn't set them to hardward disk due to the physical limitation of my environment) so i shouldn't get such slow write and read speed as before.

I'm still gonna add these 2 options into my elasticsearch.yml, i still got one more node to move after this.

Thanks again, i will never do this error in the future :smiley:

I will set your answer as a solution after i moved my other node just to be sure ^^

That worked out perfectly ! Thank you for your help :smiley:

NVM i'm still getting issue switching my master role.
Again today, after shutting down the master, the role never got transfered and i'm always facing non-working cluster.

I'm not able to pinpoint the exact issue right now and my cluster is not starting anymore

The nodes can't see each other and the cluster is never bootstrapping.

If EVEN after setting the timeout option to 90s there must be an issue with the service communicating between them. I've disabled the firewall on both VMs and they are running on 2 separated server with a dual Xeon, 60 GB of RAM (32 allocated to ES), with 24 drives, which every of them are mounted separately on both VM. I don't think it's an hardward issue and more likely a configuration or a service issue. I've got no filtering between both VMs.

Are you still seeing warnings from o.e.g.IncrementalClusterStateWriter about slow persistence? What do they say now?

Why do you think this is a communication issue? Can you provide log messages to support that?

I'm still seeing the same issue as before even after setting the option you gave me above :

[2020-01-20T09:24:17,211][INFO ][o.e.c.c.Coordinator      ] [cyres-elas03a] master node [{cyres-elas03b}{7LBB8nyDSyeUTi_pS_GSmw}{winKz2iDSfqaTWteQLjU1w}{172.31.27.9}{172.31.27.9:9300}{dilm}{ml.machine_memory=25280589824, ml.max_open_jobs=20, xpack.installed=true}] failed, restarting discovery
org.elasticsearch.transport.NodeDisconnectedException: [cyres-elas03b][172.31.27.9:9300][disconnected] disconnected
[2020-01-20T09:24:17,217][INFO ][o.e.c.s.ClusterApplierService] [cyres-elas03a] master node changed {previous [{cyres-elas03b}{7LBB8nyDSyeUTi_pS_GSmw}{winKz2iDSfqaTWteQLjU1w}{172.31.27.9}{172.31.27.9:9300}{dilm}{ml.machine_memory=25280589824, ml.max_open_jobs=20, xpack.installed=true}], current []}, term: 27, version: 46674, reason: becoming candidate: onLeaderFailure
[2020-01-20T09:24:17,439][WARN ][o.e.c.NodeConnectionsService] [cyres-elas03a] failed to connect to {cyres-elas03b}{7LBB8nyDSyeUTi_pS_GSmw}{winKz2iDSfqaTWteQLjU1w}{172.31.27.9}{172.31.27.9:9300}{dilm}{ml.machine_memory=25280589824, ml.max_open_jobs=20, xpack.installed=true} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [cyres-elas03b][172.31.27.9:9300] connect_exception
[2020-01-20T09:26:26,812][WARN ][o.e.g.IncrementalClusterStateWriter] [cyres-elas03a] writing cluster state took [128489ms] which is above the warn threshold of [10s]; wrote metadata for [1149] indices and skipped [0] unchanged indices

And after that i'm getting the usual "Connection refused / NO_MASTER_AVAILABLE"

The lines i added in my config file (i even added more times for the cluster to switch the role) :

cluster.publish.timeout: 180s
cluster.join.timeout: 180s

I see that there is another threshold set to 10s but it didn't seems to be affected by the options above

Your disks are still dreadfully slow, and now you have twice as many indices as before so there's twice as much metadata to write out.

On reflection it's possible that an election might need to allow 2x the time to write the state out since the election might start while a write is already in progress; given that it's currently taking 130s to write the state I think you should set the timeouts to double that plus a generous margin. Try 300s for instance.

I will have to look at the disks write speed. Given the context, i've arranged my drives this way :

  • Out of the 24, i've setup a RAID 5 on 4 disk for the VM drives
  • With the 20 remaining, each disk has it's own RAID 0 so the elastic VM got 20 drives mounted separately.
    This way, the main disk of the VM is running on the RAID 5 and the data disks are all in separated RAID 0

I should, in theory, not getting slow write and read speed on the RAID 5. But i never now maybe there is an issue with the servers i'm using. I will keep track on this ...

Do you mean that you have multiple entries in each node's path.data config? If so, how many?

Yes i got 20 drives in my elasticsearch config like that :

path.data : /data/data1,/data/data2,/data/data3,/data/data4,/data/data5,/data/data6,/data/data7,/data/data8,/data/data9,/data/data10,/data/data11,/data/data12,/data/data13,/data/data14,/data/data15,/data/data16,/data/data17,/data/data18,/data/data19,/data/data20

Ok I think that'd explain it. Master-eligible nodes write the metadata to every single data path, so that's 20x the writes on each cluster state update. I recommend only having a single data path on each master-eligible node. There's a couple of ways you could do that: either you could add three dedicated master nodes each with a single data path, or you could combine these drives together. The three dedicated master nodes can all run on the same hosts

Combining the drives is a good idea anyway since Elasticsearch doesn't have a mechanism to balance data between data paths on the same node. Larger clusters work ok with more data paths since we still balance across paths when moving data between nodes, but if you only have three nodes then there won't be enough inter-node movement to keep things balanced.

Ok that's explains it clearly.
I think the solution for my infrastructure is to add 2 more Master node (to have an impair count of master nodes) and remove the master roles on both my data nodes. This way, i will never got issue any more and i will still have a clean cluster status :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.