HIgh availability of ES

Hi

I am using 6 node ES cluster of 3 masters and 3 data nodes. one master and one data node in each vm of the machine. Totally 3 machines.
replication factor = 3. num of shards = 3. node left delay being set to 5m.

When one node goes down, meaning one master and one data node goes down. ES goes to red state with shards being NODE_LEFT state. And it takes lot of time ~10min to come to yellow state. Is it expected to take this much time?

Later we reduced the num of replicas to 2 to check whether it can reduce the time.

during this time all reads and writes fail. High availability is lost. Is there a way to promote replicas to primary soon and make ES cluster available soon. delaying replica assignment can happen in background. currently it seems replica assignment is happening in sync and cluster is non responsive.

If all nodes hold all the data, which seems to be the case as you have 1 primary and 2 replicas of every shard, why have you set the node left delay to 5 minutes? What happens if you instead run the cluster with the default settings?

If i run with default setting it takes lot of time to come back to healthy state. Sometimes index left in NODE_LEFT state and never comes back until reroute is done, manual intervention is required

What is the output of the cluster health API?

When i left default for node left delay and bring up another node the indices are in below state and cluster is in red state

metrics-2018.11-10 1 p UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11-10 1 r UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11-10 1 r UNASSIGNED CLUSTER_RECOVERED 
logs-2018.11.08.11 1 p UNASSIGNED CLUSTER_RECOVERED
logs-2018.11.08.11 1 r UNASSIGNED CLUSTER_RECOVERED 
logs-2018.11.08.11 1 r UNASSIGNED CLUSTER_RECOVERED 
logs-2018.11.08.11 2 p UNASSIGNED CLUSTER_RECOVERED 
logs-2018.11.08.11 2 r UNASSIGNED CLUSTER_RECOVERED 
logs-2018.11.08.11 2 r UNASSIGNED CLUSTER_RECOVERED 
logs-2018.11.08.11 0 p UNASSIGNED CLUSTER_RECOVERED 
logs-2018.11.08.11 0 r UNASSIGNED CLUSTER_RECOVERED 
logs-2018.11.08.11 0 r UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.09 1 p UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.09 1 r UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.09 1 r UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.09 2 p UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.09 2 r UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.09 2 r UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.09 0 p UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.09 0 r UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.09 0 r UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.01 2 p UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.01 2 r UNASSIGNED CLUSTER_RECOVERED 
metrics-2018.11.01 2 r UNASSIGNED CLUSTER_RECOVERED

Now though all nodes are available it does not recover itself and cluster will be in red always

pls guide me ways to keep ES always available no matter node goes down and later comes back

{
  "cluster_name" : "metrics-datastore",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 10,
  "active_shards" : 30,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 24,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 55.55555555555556
}

this is the case when restarts happened

it remains same when one node goes down also.... it takes lot of time for number of nodes to get updated to 4 when one node goes down.

This status is seen on the first master node elected after a cluster restarts. What is the setting for discovery.zen.minimum_master_nodes in the elasticsearch.yml config file on each node?

discovery.zen.minimum_master_nodes: 1

That would explain it. It should be 2 if you have 3 master-eligible nodes. Your cluster is suffering from split-brain.

ok thanks for guidance

but another doubt is that cluster state shows that it has 6 nodes out of which 3 master and 3 data node

I don't understand the question. This sounds correct:

okay. i also want to ask one more qn when i configure the below property

index.routing.allocation.total_shards_per_node=1

suppose say i have
index0 -> pri01 - 3shards, rep01-3shards and rep02-3shards
index1 -> pri11, rep11 and rep12

what will happen, as the conf says only one shards per node and i have only 3 nodes. 9 shards for each index. only 3 out of 9 will be allocated? or 3(1 pre, 1 rep1 and 1 rep2) in one node, so that 3 shards of same index can be allocated per node

You almost certainly don't want this setting. If you have three data nodes and every index has one primary and two replicas then every shard will be allocated to every node without this setting. If you set index.routing.allocation.total_shards_per_node to 1 then only the primaries will be allocated, and your cluster health will be yellow.

ok thanks understood

I am seeing one more issue

i use java client to push logs to elasticsearch

when i tried to profile and check what could be issue to see load avg being very high given the below call consuming 86% of the cpu

org.elasticsearch.threadpool.ThreadPool$EstimatedTimeThread.run() ThreadPool.java:747 1521711

what is it and how can I avoid this?

This thread mostly sleeps, and has been hidden from the default hot-threads output since Elasticsearch 1.5. Either you're using a really really old version of Elasticsearch or else you're looking at idle threads. In any case it seems unlikely that it's consuming any appreciable fraction of your CPU.

I use ES 2.4. Its at client side, from profiling i see this sleep being in top consumer of cpu. There are ~300 of such threads.

Ah, ok, client-side. It still seems strange that a sleep() could consume so much CPU, but 300 threads is a lot. I do not have a development environment set up for 2.4 but in more recent versions it looks like there's one of these threads for every client object that's been created and hasn't yet been not closed. Assuming nothing's changed since 2.4 I guess that your client program is making a lot of TransportClient instances and not closing them when it's done. These things are quite expensive to create, expected to last a long time, and can be shared across threads, so could you try making fewer of them and/or cleaning them up properly when they're no longer needed?

In one node setup how this can happen?

root@metrics-datastore-0 esutilities]# sh check_cluster.sh
{
"cluster_name" : "metrics-datastore",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4,
"active_shards" : 4,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 2,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 66.66666666666666
}
index,shard,prirep,state,unassigned.reason| grep UNASSIGNEDt:9200/_cat/shards?h=
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
135 270 135 270 0 0 1759 0 --:--:-- --:--:-- --:--:-- 1788
metrics-2018.11-10 1 p UNASSIGNED NODE_LEFT
metrics-2018.11.09 1 p UNASSIGNED NODE_LEFT

It cannot: