My cluster always yellow


(Sekaiga) #1

hi,
version: 5.1.1
my cluster always yellow, a lot of unassinged shards. logs show lot of "master left"

[2018-02-06T05:16:04,873][WARN ][o.e.i.c.IndicesClusterStateService] [node-15] [[xxx_log_201802][150]] marking and sending shard failed due to [shard failure, reason [primary shard [[xxx_log_201802][150], node[llX2oMHOT6aMFTpfjvXilg], [P], s[STARTED], a[id=GnaqByImR5y06I6bn5G0MQ]] was demoted while failing replica shard]]
org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [10] did not match current primary term [11]
at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:280) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:581) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:920) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:458) [elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) ~[elasticsearch-5.1.1.jar:5.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_92]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_92]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]
[2018-02-06T05:16:05,556][INFO ][o.e.d.z.ZenDiscovery ] [node-15] master_left [{node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{10.4.71.30}{10.4.71.30:9300}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2018-02-06T05:16:05,557][WARN ][o.e.d.z.ZenDiscovery ] [node-15] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
{node-9}{UsskhSP7StKKpuD_GcAcGw}{MYFs7ktOQ5KZsuaFpjW_0g}{10.4.71.67}{10.4.71.67:9300}
{node-10}{MaoCFIexSneolK-a9DTtgQ}{DdsgyhqbQ7ipE4rlV2LHyQ}{10.4.71.68}{10.4.71.68:9300}
{node-4}{NPO_-oC7RGikhqcmwc6JxA}{wGKtE55VT8uhp1mI5QaOlA}{10.4.71.33}{10.4.71.33:9300}
{node-5}{qUDCyodwSq63x3Z3s5LlUw}{StzL-uBsTIug6gWmhzB7qA}{10.4.71.34}{10.4.71.34:9300}
{node-8}{lTyb_Y1CQRapvztkK-Uz1g}{wLHuYpgDQMaHisDWWslO_Q}{10.4.71.66}{10.4.71.66:9300}
{node-3}{KrYOYlfmRC6yS0uvNIBSHA}{X-glH3VmT-2c9_XMQsTtpw}{10.4.71.32}{10.4.71.32:9300}
{node-7}{ABmukxbzSQGgLJ29DBUnjA}{ppo9n1vgSw-LYq-0BO_Dmw}{10.4.71.36}{10.4.71.36:9300}
{node-11}{oSYHx9hgTZqNoI0gIqO6Rw}{BJa9EDLpSje4865hn0bXlw}{10.4.71.69}{10.4.71.69:9300}
{node-6}{T7J4E5H4RqmM2DpJl4J3bA}{22uVp89gTvewPzhR6buhAA}{10.4.71.35}{10.4.71.35:9300}
{node-14}{dPEClHXZR2iFpaiLw-I-nQ}{jkBVFbbiRs2d48jjoDAdAQ}{10.4.71.72}{10.4.71.72:9300}
{node-13}{w82I1FcSQm6bqZAypk1P-g}{JZEc57UTQ6q90SsgV9IJRg}{10.4.71.71}{10.4.71.71:9300}
{node-15}{llX2oMHOT6aMFTpfjvXilg}{VXe3JphuT6aCbbAVRoB9CQ}{10.4.71.73}{10.4.71.73:9300}, local
{node-12}{VnLoFQVhTragjMlTo_3TzA}{aAhZplHLSISNczETVnimPw}{10.4.71.70}{10.4.71.70:9300}
{node-2}{qdhLZ9OPREiXE7L-G1GWlg}{QtCiGLrHS5eVuSYaIyhiGw}{10.4.71.31}{10.4.71.31:9300}

[2018-02-06T05:16:05,557][INFO ][o.e.c.s.ClusterService ] [node-15] removed {{node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{10.4.71.30}{10.4.71.30:9300},}, reason: master_failed ({node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{10.4.71.30}{10.4.71.30:9300})
[2018-02-06T05:16:08,830][INFO ][o.e.c.s.ClusterService ] [node-15] detected_master {node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{10.4.71.30}{10.4.71.30:9300}, added {{node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{10.4.71.30}{10.4.71.30:9300},}, reason: zen-disco-receive(from master [master {node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{10.4.71.30}{10.4.71.30:9300} committed version [77351]])
[2018-02-06T05:16:58,138][INFO ][o.e.m.j.JvmGcMonitorService] [node-15] [gc][1861024] overhead, spent [433ms] collecting in the last [1s]

my cluster health :
{
"cluster_name": "xxxxxxx",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 15,
"number_of_data_nodes": 15,
"active_primary_shards": 1465,
"active_shards": 2524,
"relocating_shards": 0,
"initializing_shards": 5,
"unassigned_shards": 164,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 93.72447085035277
}

my config :
cluster.name: xxxxxx
node.name: node-15
path.data: /data1/elasticsearch/data
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: ["10.4.71.30", "10.4.71.31", "10.4.71.32", "10.4.71.33", "10.4.71.34", "10.4.71.35", "10.4.71.36", "10.4.71.66", "10.4.71.67", "10.4.71.68", "10.4.71.69", "10.4.71.70","10.4.71.71","10.4.71.72","10.4.71.73"]
reindex.remote.whitelist: ["10.5.24.139:9200"]
node.master: true
node.data: true

bootstrap.memory_lock: true


(Mark Walkom) #2

Please don't post pictures of text, they are difficult to read and some people may not be even able to see them :slight_smile:


(Sekaiga) #3

oh sorry . i cannot take text easy. i will post text soon.


(Christian Dahlqvist) #4

How much heap do you have assigned to the nodes? How many of your nodes are master eligible? Are you using the default value for the discovery.zen.minimum_master_nodes (should be set as described here)?


(Sekaiga) #5

How much heap do you have assigned to the nodes?

30GB heap size at each node
i set
-Xms30g
-Xmx30g
in jvm.options

How many of your nodes are master eligible?

every node can be master. i had set this config (in my question) to all nodes except node.name is different.

Are you using the default value for the discovery.zen.minimum_master_nodes

yes, i has not set discovery.zen.minimum_master_nodes.


(Christian Dahlqvist) #6

That is not good. Not good at all. You need to set this to the correct value as you can otherwise suffer from network partitions and data loss.


(Sekaiga) #7

ok. i will do that immediately. thank you~~~~:grinning:


(Christian Dahlqvist) #8

You may already have network partitions and inconsistencies within your cluster, so could potentially have conflicts and lose data when fixing this.


(Sekaiga) #9

it doesn't matter.
but. i had set discovery.zen.minimum_master_nodes : 8 (i have 15 nodes). "master left" problem is still happening.
in 30 minutes there is 3 nodes logged "master left" and lots of unassigned shards appeared.


(Christian Dahlqvist) #10

If your cluster is under reasonably heavy load and suffering from long GC, you are probably better off introducing 3 smaller, dedicated master nodes as that will provide better stability and make it easier to scale out the cluster as minimum_master_nodes will not need to be adjusted.


(Sekaiga) #11

ok,i will follow your suggestion , thanks a lot. :grinning:


(Sekaiga) #12

i found root cause: ES 5.1.1: Cluster loses a node randomly every few hours. Error: Message not fully read (response) for requestId


(system) #13

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.


(system) #14