My cluster always yellow

hi,
version: 5.1.1
my cluster always yellow, a lot of unassinged shards. logs show lot of "master left"

[2018-02-06T05:16:04,873][WARN ][o.e.i.c.IndicesClusterStateService] [node-15] [[xxx_log_201802][150]] marking and sending shard failed due to [shard failure, reason [primary shard [[xxx_log_201802][150], node[llX2oMHOT6aMFTpfjvXilg], [P], s[STARTED], a[id=GnaqByImR5y06I6bn5G0MQ]] was demoted while failing replica shard]]
org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [10] did not match current primary term [11]
at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:280) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:581) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:920) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:458) [elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) ~[elasticsearch-5.1.1.jar:5.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_92]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_92]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]
[2018-02-06T05:16:05,556][INFO ][o.e.d.z.ZenDiscovery ] [node-15] master_left [{node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{10.4.71.30}{10.4.71.30:9300}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2018-02-06T05:16:05,557][WARN ][o.e.d.z.ZenDiscovery ] [node-15] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
{node-9}{UsskhSP7StKKpuD_GcAcGw}{MYFs7ktOQ5KZsuaFpjW_0g}{10.4.71.67}{10.4.71.67:9300}
{node-10}{MaoCFIexSneolK-a9DTtgQ}{DdsgyhqbQ7ipE4rlV2LHyQ}{10.4.71.68}{10.4.71.68:9300}
{node-4}{NPO_-oC7RGikhqcmwc6JxA}{wGKtE55VT8uhp1mI5QaOlA}{10.4.71.33}{10.4.71.33:9300}
{node-5}{qUDCyodwSq63x3Z3s5LlUw}{StzL-uBsTIug6gWmhzB7qA}{10.4.71.34}{10.4.71.34:9300}
{node-8}{lTyb_Y1CQRapvztkK-Uz1g}{wLHuYpgDQMaHisDWWslO_Q}{10.4.71.66}{10.4.71.66:9300}
{node-3}{KrYOYlfmRC6yS0uvNIBSHA}{X-glH3VmT-2c9_XMQsTtpw}{10.4.71.32}{10.4.71.32:9300}
{node-7}{ABmukxbzSQGgLJ29DBUnjA}{ppo9n1vgSw-LYq-0BO_Dmw}{10.4.71.36}{10.4.71.36:9300}
{node-11}{oSYHx9hgTZqNoI0gIqO6Rw}{BJa9EDLpSje4865hn0bXlw}{10.4.71.69}{10.4.71.69:9300}
{node-6}{T7J4E5H4RqmM2DpJl4J3bA}{22uVp89gTvewPzhR6buhAA}{10.4.71.35}{10.4.71.35:9300}
{node-14}{dPEClHXZR2iFpaiLw-I-nQ}{jkBVFbbiRs2d48jjoDAdAQ}{10.4.71.72}{10.4.71.72:9300}
{node-13}{w82I1FcSQm6bqZAypk1P-g}{JZEc57UTQ6q90SsgV9IJRg}{10.4.71.71}{10.4.71.71:9300}
{node-15}{llX2oMHOT6aMFTpfjvXilg}{VXe3JphuT6aCbbAVRoB9CQ}{10.4.71.73}{10.4.71.73:9300}, local
{node-12}{VnLoFQVhTragjMlTo_3TzA}{aAhZplHLSISNczETVnimPw}{10.4.71.70}{10.4.71.70:9300}
{node-2}{qdhLZ9OPREiXE7L-G1GWlg}{QtCiGLrHS5eVuSYaIyhiGw}{10.4.71.31}{10.4.71.31:9300}

[2018-02-06T05:16:05,557][INFO ][o.e.c.s.ClusterService ] [node-15] removed {{node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{10.4.71.30}{10.4.71.30:9300},}, reason: master_failed ({node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{10.4.71.30}{10.4.71.30:9300})
[2018-02-06T05:16:08,830][INFO ][o.e.c.s.ClusterService ] [node-15] detected_master {node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{10.4.71.30}{10.4.71.30:9300}, added {{node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{10.4.71.30}{10.4.71.30:9300},}, reason: zen-disco-receive(from master [master {node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{10.4.71.30}{10.4.71.30:9300} committed version [77351]])
[2018-02-06T05:16:58,138][INFO ][o.e.m.j.JvmGcMonitorService] [node-15] [gc][1861024] overhead, spent [433ms] collecting in the last [1s]

my cluster health :
{
"cluster_name": "xxxxxxx",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 15,
"number_of_data_nodes": 15,
"active_primary_shards": 1465,
"active_shards": 2524,
"relocating_shards": 0,
"initializing_shards": 5,
"unassigned_shards": 164,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 93.72447085035277
}

my config :
cluster.name: xxxxxx
node.name: node-15
path.data: /data1/elasticsearch/data
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: ["10.4.71.30", "10.4.71.31", "10.4.71.32", "10.4.71.33", "10.4.71.34", "10.4.71.35", "10.4.71.36", "10.4.71.66", "10.4.71.67", "10.4.71.68", "10.4.71.69", "10.4.71.70","10.4.71.71","10.4.71.72","10.4.71.73"]
reindex.remote.whitelist: ["10.5.24.139:9200"]
node.master: true
node.data: true

bootstrap.memory_lock: true

Please don't post pictures of text, they are difficult to read and some people may not be even able to see them :slight_smile:

oh sorry . i cannot take text easy. i will post text soon.

How much heap do you have assigned to the nodes? How many of your nodes are master eligible? Are you using the default value for the discovery.zen.minimum_master_nodes (should be set as described here)?

How much heap do you have assigned to the nodes?

30GB heap size at each node
i set
-Xms30g
-Xmx30g
in jvm.options

How many of your nodes are master eligible?

every node can be master. i had set this config (in my question) to all nodes except node.name is different.

Are you using the default value for the discovery.zen.minimum_master_nodes

yes, i has not set discovery.zen.minimum_master_nodes.

That is not good. Not good at all. You need to set this to the correct value as you can otherwise suffer from network partitions and data loss.

ok. i will do that immediately. thank you~~~~:grinning:

You may already have network partitions and inconsistencies within your cluster, so could potentially have conflicts and lose data when fixing this.

it doesn't matter.
but. i had set discovery.zen.minimum_master_nodes : 8 (i have 15 nodes). "master left" problem is still happening.
in 30 minutes there is 3 nodes logged "master left" and lots of unassigned shards appeared.

If your cluster is under reasonably heavy load and suffering from long GC, you are probably better off introducing 3 smaller, dedicated master nodes as that will provide better stability and make it easier to scale out the cluster as minimum_master_nodes will not need to be adjusted.

ok,i will follow your suggestion , thanks a lot. :grinning:

i found root cause: ES 5.1.1: Cluster loses a node randomly every few hours. Error: Message not fully read (response) for requestId

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.