ES stuck in Red state, despite the fact that all nodes are in cluster: "ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]"

Our 231 node, cloud ES cluster is stuck in a "RED" state.
Cluster health:

{
  "cluster_name" : "exabeam-es",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 231,
  "number_of_data_nodes" : 230,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : "NaN"
}

Here is our configuration:

discovery.zen.ping.unicast.hosts: [/* 185 hosts in the list */]
discovery.zen.minimum_master_nodes:   "93"
discovery.zen.fd.ping_interval: 5s
discovery.zen.fd.ping_timeout: "60s"
transport.tcp.connect_timeout: "60s"

We are getting this exception:

[2018-11-29T21:30:31,603][WARN ][r.suppressed             ] path: /index-migrations, params: {index=index-migrations}
org.elasticsearch.transport.RemoteTransportException: [host46-2][10.50.61.136:9300][indices:admin/create]
Caused by: org.elasticsearch.discovery.MasterNotDiscoveredException: ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:209) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:311) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:238) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.cluster.service.ClusterService$NotifyTimeout.run(ClusterService.java:1057) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.4.0.jar:5.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_91]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]
Caused by: org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
at org.elasticsearch.cluster.block.ClusterBlocks.indexBlockedException(ClusterBlocks.java:182) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.admin.indices.create.TransportCreateIndexAction.checkBlock(TransportCreateIndexAction.java:64) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.admin.indices.create.TransportCreateIndexAction.checkBlock(TransportCreateIndexAction.java:39) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.doStart(TransportMasterNodeAction.java:134) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.start(TransportMasterNodeAction.java:126) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:104) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:54) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:142) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:64) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:54) ~[elasticsearch-5.4.0.jar:5.4.0]
at com.floragunn.searchguard.ssl.transport.SearchGuardSSLRequestHandler.messageReceivedDecorate(SearchGuardSSLRequestHandler.java:177) ~[?:?]
at com.floragunn.searchguard.ssl.transport.SearchGuardSSLRequestHandler.messageReceived(SearchGuardSSLRequestHandler.java:139) ~[?:?]

Here is an excerpt from 'https://localhost:9200/_cluster/state'

  "cluster_name" : "exabeam-es",
  "version" : 17,
  "state_uuid" : "XJim7LIaRxiTWw-9vE1ItQ",
  "master_node" : "-YfA-UIpQ4yVXcFjW0F7YQ",
  "blocks" : {
    "global" : {
      "1" : {
        "description" : "state not recovered / initialized",
        "retryable" : true,
        "disable_state_persistence" : true,
        "levels" : [
          "read",
          "write",
          "metadata_read",
          "metadata_write"
        ]
      }
    }
  },

What version are you on?
Is this just one node or the entire cluster (ie other nodes)?

This is pretty excessive and likely going to cause you heartache. You really only need 3 dedicated, master eligible nodes. Then you put the details of those 3 into discovery.zen.ping.unicast.host.

Thanks for responding @warkolm,

Agreed, we would like to tune down the master nodes count down as well. We can pursue that in parallel. Would like to know the root cause of this issue if possible, however.

Version information:

{
  "name" : "host1-0",
  "cluster_name" : "exabeam-es",
  "cluster_uuid" : "_na_",
  "version" : {
    "number" : "5.4.0",
    "build_hash" : "780f8c4",
    "build_date" : "2017-04-28T17:43:27.229Z",
    "build_snapshot" : false,
    "lucene_version" : "6.5.0"
  },
  "tagline" : "You Know, for Search"
}

What's the status of the current master?

Output of the _cat/master API ('https://localhost:9200/_cat/master')

-YfA-UIpQ4yVXcFjW0F7YQ elasticsearch-host46-c 10.50.61.136 host46-2

Output of the _nodes/host46-2 API ('https://localhost:9200/_nodes/host46-2?pretty')

{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "exabeam-es",
  "nodes" : {
    "-YfA-UIpQ4yVXcFjW0F7YQ" : {
      "name" : "host46-2",
      "transport_address" : "10.50.61.136:9300",
      "host" : "elasticsearch-host46-c",
      "ip" : "10.50.61.136",
      "version" : "5.4.0",
      "build_hash" : "780f8c4",
      "total_indexing_buffer" : 1181116006,
      "roles" : [
        "master",
        "data"
      ],
      "attributes" : {
        "rack_id" : "host46",
        "box_type" : "warm"
      },
      "settings" : {
        "cluster" : {
          "name" : "exabeam-es",
          "routing" : {
            "allocation" : {
              "awareness" : {
                "attributes" : "rack_id"
              },
              "node_concurrent_recoveries" : "50",
              "node_initial_primaries_recoveries" : "50"
            }
          }
        },
        "node" : {
          "name" : "host46-2",
          "attr" : {
            "rack_id" : "host46",
            "box_type" : "warm"
          },
          "data" : "true",
          "ingest" : "false",
          "master" : "true"
        },
        "path" : {
          "data" : [
            "/opt/exabeam/data/lms/elasticsearch/1",
            "/opt/exabeam/data/lms/elasticsearch/5"
          ],
          "logs" : "/opt/elasticsearch/logs",
          "home" : "/opt/elasticsearch"
        },
        "indices" : {
          "query" : {
            "bool" : {
              "max_clause_count" : "10050"
            }
          }
        },
        "discovery" : {
          "zen" : {
            "minimum_master_nodes" : "93",
            "fd" : {
              "ping_interval" : "5s",
              "ping_timeout" : "60s"
            },
            "ping" : {
              "unicast" : {
                "hosts" : []
              }
            }
          }
        },
        "thread_pool" : {
          "search" : {
            "size" : "2"
          }
        },
        "client" : {
          "type" : "node"
        },
        "http" : {
          "type" : "com.floragunn.searchguard.ssl.http.netty.SearchGuardSSLNettyHttpServerTransport"
        },
        "index" : {
          "store" : {
            "type" : "niofs"
          }
        },
        "bootstrap" : {
          "memory_lock" : "true"
        },
        "transport" : {
          "tcp" : {
            "connect_timeout" : "60s"
          },
          "type" : "com.floragunn.searchguard.ssl.http.netty.SearchGuardSSLNettyTransport"
        },
        "network" : {
          "host" : "elasticsearch-host46-c",
          "bind_host" : "0.0.0.0"
        }
      },
      "os" : {
        "refresh_interval_in_millis" : 1000,
        "name" : "Linux",
        "arch" : "amd64",
        "version" : "3.10.0-862.9.1.el7.x86_64",
        "available_processors" : 16,
        "allocated_processors" : 16
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 12,
        "mlockall" : true
      },
      "jvm" : {
        "pid" : 12,
        "version" : "1.8.0_91",
        "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
        "vm_version" : "25.91-b14",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1543461055638,
        "mem" : {
          "heap_init_in_bytes" : 11811160064,
          "heap_max_in_bytes" : 11811160064,
          "non_heap_init_in_bytes" : 2555904,
          "non_heap_max_in_bytes" : 0,
          "direct_max_in_bytes" : 11811160064
        },
        "gc_collectors" : [
          "G1 Young Generation",
          "G1 Old Generation"
        ],
        "memory_pools" : [
          "Code Cache",
          "Metaspace",
          "Compressed Class Space",
          "G1 Eden Space",
          "G1 Survivor Space",
          "G1 Old Gen"
        ],
        "using_compressed_ordinary_object_pointers" : "true"
      },
    }
  }
}

(had to remove some info from the response to hit minimum character count)

Interestingly, this API throws an error:

[exabeam@ip-10-20-242-43 ~]$ curl -k 'https://localhost:9200/_cat/shards'
{"error":{"root_cause":[{"type":"cluster_block_exception","reason":"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"}],"type":"cluster_block_exception","reason":"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"},"status":503}

@warkolm Do you have any documentation by any chance explaining why a high master node count would be detrimental and what the performance tradeoffs are?

@warkolm Any other info I can grab while the system is in this state?

We will probably look into remediation steps soon (such as restarting the elected ES master node).

I'm happy to jump on a call w/ Elastic support/engineering.

Unfortunately we only offer that for customers.

I asked those few questions as a quick knowledge gathering, but I don't have time to help at this stage and maybe I can get back to this later today. Someone else might be able jump in in the meantime though.

Any tips on shrinking the number of master nodes in the cluster?

I'm concerned that we could hit a split brain issue while shrinking the number of master nodes (we only technically have 1 dedicated master node, the rest are data+master nodes. By "shrinking", I mean, marking 95% of our data nodes as data-only).

I think you are already susceptible to a split-brain issue:

The number of master eligible nodes must be strictly less than twice the discovery.zen.minimum_master_nodes setting to avoid split brains, but 231-1 = 230 ≥ 93*2 = 186. The number of nodes in discovery.zen.ping.unicast.hosts is irrelevant - these nodes are used for discovery, but voting rights are determined by the node.master setting on each node.

Managing minimum_master_nodes as a cluster grows or shrinks is a pain, and this pain can be avoided using a small set of dedicated masters, which is one good reason for this architecture recommendation.

Cluster formation happens in roughly two phases: first a master is elected and second the data nodes join the elected master. I think having so many master-eligible nodes is disrupting the first phase. The election process is somewhat similar to consensus, and like consensus it is provably impossible to do reliably so the best we can do is design for a high probability of success in common cases. 200+ master-eligible nodes is not a common case, and although I don't have access to 200+ nodes to try it I can speculate that there are multiple elections all running concurrently, discovering the other inconsistent elections, and ultimately failing.

Additionally I think there are only a limited number of threads available for forming connections between master-eligible nodes, so it's going to be easy to run out when you have so many.

Apart from "there should be 3" the only other tip I can think of now is that you might want to consider using gateway.recover_after_nodes to make the elected master wait for enough of the data nodes to join the cluster before starting to allocate the shards.

With that many nodes, it is also a good practice to have dedicated masters, and especially if you have lots of shards (primaries and replicas). I would think you would need to shut everything down. Pick 3 servers to be masters and node.master should only be set on 3 nodes. There can only be one master at a given time, and that is a waste of cpu and memory to replicate the cluster state to 90+ servers, participate in elections, and to effectively elect a master during times of patching/rebooting. Then start those masters up. Next is changing the node config on all the rest to not be masters. Depending on how many shards and replicas you have, you will need to also set the gateway recover after x nodes. You will want to pick something relevant to your setup. If the # is too low... your nodes will constantly be moving and shifting shards around with lots of disk i/o. Good luck

All nodes hold the cluster state though.

hello, the number of master nodes in an elasticsearch cluster is calculated like that :
the total number of nodes divided by two plus 1.

if you didn't configure correctly the master nodes you will face a problem called "Split Brain".

Thanks for the response and explanation @DavidTurner.

The current plan is to stop all nodes in the cluster, make the config change, then restart all nodes in the cluster.

Will we run into any issues with that plan?

The obvious issue will be that the cluster is unavailable until restarted, but if you can tolerate this then a full cluster restart is normally a good way to proceed. I'll reiterate the advice to set gateway.recover_after_nodes to stop the cluster forming before all the nodes are back up, and also it's a good idea to follow the whole full cluster restart process, disabling allocation, stopping all indexing traffic, and performing a synced flush beforehand.

Thanks @DavidTurner!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.