I am using 6 node ES cluster of 3 masters and 3 data nodes. one master and one data node in each vm of the machine. Totally 3 machines.
replication factor = 3. num of shards = 3. node left delay being set to 5m.
When one node goes down, meaning one master and one data node goes down. ES goes to red state with shards being NODE_LEFT state. And it takes lot of time ~10min to come to yellow state. Is it expected to take this much time?
Later we reduced the num of replicas to 2 to check whether it can reduce the time.
during this time all reads and writes fail. High availability is lost. Is there a way to promote replicas to primary soon and make ES cluster available soon. delaying replica assignment can happen in background. currently it seems replica assignment is happening in sync and cluster is non responsive.
If all nodes hold all the data, which seems to be the case as you have 1 primary and 2 replicas of every shard, why have you set the node left delay to 5 minutes? What happens if you instead run the cluster with the default settings?
If i run with default setting it takes lot of time to come back to healthy state. Sometimes index left in NODE_LEFT state and never comes back until reroute is done, manual intervention is required
When i left default for node left delay and bring up another node the indices are in below state and cluster is in red state
metrics-2018.11-10 1 p UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11-10 1 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11-10 1 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.08.11 1 p UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.08.11 1 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.08.11 1 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.08.11 2 p UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.08.11 2 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.08.11 2 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.08.11 0 p UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.08.11 0 r UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.08.11 0 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.09 1 p UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.09 1 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.09 1 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.09 2 p UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.09 2 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.09 2 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.09 0 p UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.09 0 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.09 0 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.01 2 p UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.01 2 r UNASSIGNED CLUSTER_RECOVERED
metrics-2018.11.01 2 r UNASSIGNED CLUSTER_RECOVERED
Now though all nodes are available it does not recover itself and cluster will be in red always
pls guide me ways to keep ES always available no matter node goes down and later comes back
This status is seen on the first master node elected after a cluster restarts. What is the setting for discovery.zen.minimum_master_nodes in the elasticsearch.yml config file on each node?
okay. i also want to ask one more qn when i configure the below property
index.routing.allocation.total_shards_per_node=1
suppose say i have
index0 -> pri01 - 3shards, rep01-3shards and rep02-3shards
index1 -> pri11, rep11 and rep12
what will happen, as the conf says only one shards per node and i have only 3 nodes. 9 shards for each index. only 3 out of 9 will be allocated? or 3(1 pre, 1 rep1 and 1 rep2) in one node, so that 3 shards of same index can be allocated per node
You almost certainly don't want this setting. If you have three data nodes and every index has one primary and two replicas then every shard will be allocated to every node without this setting. If you set index.routing.allocation.total_shards_per_node to 1 then only the primaries will be allocated, and your cluster health will be yellow.
This thread mostly sleeps, and has been hidden from the default hot-threads output since Elasticsearch 1.5. Either you're using a really really old version of Elasticsearch or else you're looking at idle threads. In any case it seems unlikely that it's consuming any appreciable fraction of your CPU.
Ah, ok, client-side. It still seems strange that a sleep() could consume so much CPU, but 300 threads is a lot. I do not have a development environment set up for 2.4 but in more recent versions it looks like there's one of these threads for every client object that's been created and hasn't yet been not closed. Assuming nothing's changed since 2.4 I guess that your client program is making a lot of TransportClient instances and not closing them when it's done. These things are quite expensive to create, expected to last a long time, and can be shared across threads, so could you try making fewer of them and/or cleaning them up properly when they're no longer needed?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.