My cluster always yellow

(Sekaiga) #1

version: 5.1.1
my cluster always yellow, a lot of unassinged shards. logs show lot of "master left"

[2018-02-06T05:16:04,873][WARN ][o.e.i.c.IndicesClusterStateService] [node-15] [[xxx_log_201802][150]] marking and sending shard failed due to [shard failure, reason [primary shard [[xxx_log_201802][150], node[llX2oMHOT6aMFTpfjvXilg], [P], s[STARTED], a[id=GnaqByImR5y06I6bn5G0MQ]] was demoted while failing replica shard]]
org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [10] did not match current primary term [11]
at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute( ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor( ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.cluster.service.ClusterService$ ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ [elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean( ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$ ~[elasticsearch-5.1.1.jar:5.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker( [?:1.8.0_92]
at java.util.concurrent.ThreadPoolExecutor$ [?:1.8.0_92]
at [?:1.8.0_92]
[2018-02-06T05:16:05,556][INFO ][o.e.d.z.ZenDiscovery ] [node-15] master_left [{node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{}{}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2018-02-06T05:16:05,557][WARN ][o.e.d.z.ZenDiscovery ] [node-15] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
{node-15}{llX2oMHOT6aMFTpfjvXilg}{VXe3JphuT6aCbbAVRoB9CQ}{}{}, local

[2018-02-06T05:16:05,557][INFO ][o.e.c.s.ClusterService ] [node-15] removed {{node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{}{},}, reason: master_failed ({node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{}{})
[2018-02-06T05:16:08,830][INFO ][o.e.c.s.ClusterService ] [node-15] detected_master {node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{}{}, added {{node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{}{},}, reason: zen-disco-receive(from master [master {node-1}{zzJVtCQDQFaNm_jNx2YrjA}{3vrBZme0R_aCZwqZHuoJDQ}{}{} committed version [77351]])
[2018-02-06T05:16:58,138][INFO ][o.e.m.j.JvmGcMonitorService] [node-15] [gc][1861024] overhead, spent [433ms] collecting in the last [1s]

my cluster health :
"cluster_name": "xxxxxxx",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 15,
"number_of_data_nodes": 15,
"active_primary_shards": 1465,
"active_shards": 2524,
"relocating_shards": 0,
"initializing_shards": 5,
"unassigned_shards": 164,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 93.72447085035277

my config : xxxxxx node-15 /data1/elasticsearch/data ["", "", "", "", "", "", "", "", "", "", "", "","","",""]
reindex.remote.whitelist: [""]
node.master: true true

bootstrap.memory_lock: true

(Mark Walkom) #2

Please don't post pictures of text, they are difficult to read and some people may not be even able to see them :slight_smile:

(Sekaiga) #3

oh sorry . i cannot take text easy. i will post text soon.

(Christian Dahlqvist) #4

How much heap do you have assigned to the nodes? How many of your nodes are master eligible? Are you using the default value for the discovery.zen.minimum_master_nodes (should be set as described here)?

(Sekaiga) #5

How much heap do you have assigned to the nodes?

30GB heap size at each node
i set
in jvm.options

How many of your nodes are master eligible?

every node can be master. i had set this config (in my question) to all nodes except is different.

Are you using the default value for the discovery.zen.minimum_master_nodes

yes, i has not set discovery.zen.minimum_master_nodes.

(Christian Dahlqvist) #6

That is not good. Not good at all. You need to set this to the correct value as you can otherwise suffer from network partitions and data loss.

(Sekaiga) #7

ok. i will do that immediately. thank you~~~~:grinning:

(Christian Dahlqvist) #8

You may already have network partitions and inconsistencies within your cluster, so could potentially have conflicts and lose data when fixing this.

(Sekaiga) #9

it doesn't matter.
but. i had set discovery.zen.minimum_master_nodes : 8 (i have 15 nodes). "master left" problem is still happening.
in 30 minutes there is 3 nodes logged "master left" and lots of unassigned shards appeared.

(Christian Dahlqvist) #10

If your cluster is under reasonably heavy load and suffering from long GC, you are probably better off introducing 3 smaller, dedicated master nodes as that will provide better stability and make it easier to scale out the cluster as minimum_master_nodes will not need to be adjusted.

(Sekaiga) #11

ok,i will follow your suggestion , thanks a lot. :grinning:

(Sekaiga) #12

i found root cause: ES 5.1.1: Cluster loses a node randomly every few hours. Error: Message not fully read (response) for requestId

(system) #13

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

(system) #14