ES stuck in Red state, despite the fact that all nodes are in cluster: "ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]"


(Robert Blankenship) #1

Our 231 node, cloud ES cluster is stuck in a "RED" state.
Cluster health:

{
  "cluster_name" : "exabeam-es",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 231,
  "number_of_data_nodes" : 230,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : "NaN"
}

Here is our configuration:

discovery.zen.ping.unicast.hosts: [/* 185 hosts in the list */]
discovery.zen.minimum_master_nodes:   "93"
discovery.zen.fd.ping_interval: 5s
discovery.zen.fd.ping_timeout: "60s"
transport.tcp.connect_timeout: "60s"

We are getting this exception:

[2018-11-29T21:30:31,603][WARN ][r.suppressed             ] path: /index-migrations, params: {index=index-migrations}
org.elasticsearch.transport.RemoteTransportException: [host46-2][10.50.61.136:9300][indices:admin/create]
Caused by: org.elasticsearch.discovery.MasterNotDiscoveredException: ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:209) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:311) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:238) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.cluster.service.ClusterService$NotifyTimeout.run(ClusterService.java:1057) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.4.0.jar:5.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_91]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]
Caused by: org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
at org.elasticsearch.cluster.block.ClusterBlocks.indexBlockedException(ClusterBlocks.java:182) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.admin.indices.create.TransportCreateIndexAction.checkBlock(TransportCreateIndexAction.java:64) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.admin.indices.create.TransportCreateIndexAction.checkBlock(TransportCreateIndexAction.java:39) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.doStart(TransportMasterNodeAction.java:134) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.start(TransportMasterNodeAction.java:126) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:104) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:54) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:142) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:64) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:54) ~[elasticsearch-5.4.0.jar:5.4.0]
at com.floragunn.searchguard.ssl.transport.SearchGuardSSLRequestHandler.messageReceivedDecorate(SearchGuardSSLRequestHandler.java:177) ~[?:?]
at com.floragunn.searchguard.ssl.transport.SearchGuardSSLRequestHandler.messageReceived(SearchGuardSSLRequestHandler.java:139) ~[?:?]

Here is an excerpt from 'https://localhost:9200/_cluster/state'

  "cluster_name" : "exabeam-es",
  "version" : 17,
  "state_uuid" : "XJim7LIaRxiTWw-9vE1ItQ",
  "master_node" : "-YfA-UIpQ4yVXcFjW0F7YQ",
  "blocks" : {
    "global" : {
      "1" : {
        "description" : "state not recovered / initialized",
        "retryable" : true,
        "disable_state_persistence" : true,
        "levels" : [
          "read",
          "write",
          "metadata_read",
          "metadata_write"
        ]
      }
    }
  },

(Mark Walkom) #2

What version are you on?
Is this just one node or the entire cluster (ie other nodes)?

This is pretty excessive and likely going to cause you heartache. You really only need 3 dedicated, master eligible nodes. Then you put the details of those 3 into discovery.zen.ping.unicast.host.


(Robert Blankenship) #3

Thanks for responding @warkolm,

Agreed, we would like to tune down the master nodes count down as well. We can pursue that in parallel. Would like to know the root cause of this issue if possible, however.

Version information:

{
  "name" : "host1-0",
  "cluster_name" : "exabeam-es",
  "cluster_uuid" : "_na_",
  "version" : {
    "number" : "5.4.0",
    "build_hash" : "780f8c4",
    "build_date" : "2017-04-28T17:43:27.229Z",
    "build_snapshot" : false,
    "lucene_version" : "6.5.0"
  },
  "tagline" : "You Know, for Search"
}

(Mark Walkom) #4

What's the status of the current master?


(Robert Blankenship) #5

Output of the _cat/master API ('https://localhost:9200/_cat/master')

-YfA-UIpQ4yVXcFjW0F7YQ elasticsearch-host46-c 10.50.61.136 host46-2

Output of the _nodes/host46-2 API ('https://localhost:9200/_nodes/host46-2?pretty')

{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "exabeam-es",
  "nodes" : {
    "-YfA-UIpQ4yVXcFjW0F7YQ" : {
      "name" : "host46-2",
      "transport_address" : "10.50.61.136:9300",
      "host" : "elasticsearch-host46-c",
      "ip" : "10.50.61.136",
      "version" : "5.4.0",
      "build_hash" : "780f8c4",
      "total_indexing_buffer" : 1181116006,
      "roles" : [
        "master",
        "data"
      ],
      "attributes" : {
        "rack_id" : "host46",
        "box_type" : "warm"
      },
      "settings" : {
        "cluster" : {
          "name" : "exabeam-es",
          "routing" : {
            "allocation" : {
              "awareness" : {
                "attributes" : "rack_id"
              },
              "node_concurrent_recoveries" : "50",
              "node_initial_primaries_recoveries" : "50"
            }
          }
        },
        "node" : {
          "name" : "host46-2",
          "attr" : {
            "rack_id" : "host46",
            "box_type" : "warm"
          },
          "data" : "true",
          "ingest" : "false",
          "master" : "true"
        },
        "path" : {
          "data" : [
            "/opt/exabeam/data/lms/elasticsearch/1",
            "/opt/exabeam/data/lms/elasticsearch/5"
          ],
          "logs" : "/opt/elasticsearch/logs",
          "home" : "/opt/elasticsearch"
        },
        "indices" : {
          "query" : {
            "bool" : {
              "max_clause_count" : "10050"
            }
          }
        },
        "discovery" : {
          "zen" : {
            "minimum_master_nodes" : "93",
            "fd" : {
              "ping_interval" : "5s",
              "ping_timeout" : "60s"
            },
            "ping" : {
              "unicast" : {
                "hosts" : []
              }
            }
          }
        },
        "thread_pool" : {
          "search" : {
            "size" : "2"
          }
        },
        "client" : {
          "type" : "node"
        },
        "http" : {
          "type" : "com.floragunn.searchguard.ssl.http.netty.SearchGuardSSLNettyHttpServerTransport"
        },
        "index" : {
          "store" : {
            "type" : "niofs"
          }
        },
        "bootstrap" : {
          "memory_lock" : "true"
        },
        "transport" : {
          "tcp" : {
            "connect_timeout" : "60s"
          },
          "type" : "com.floragunn.searchguard.ssl.http.netty.SearchGuardSSLNettyTransport"
        },
        "network" : {
          "host" : "elasticsearch-host46-c",
          "bind_host" : "0.0.0.0"
        }
      },
      "os" : {
        "refresh_interval_in_millis" : 1000,
        "name" : "Linux",
        "arch" : "amd64",
        "version" : "3.10.0-862.9.1.el7.x86_64",
        "available_processors" : 16,
        "allocated_processors" : 16
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 12,
        "mlockall" : true
      },
      "jvm" : {
        "pid" : 12,
        "version" : "1.8.0_91",
        "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
        "vm_version" : "25.91-b14",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1543461055638,
        "mem" : {
          "heap_init_in_bytes" : 11811160064,
          "heap_max_in_bytes" : 11811160064,
          "non_heap_init_in_bytes" : 2555904,
          "non_heap_max_in_bytes" : 0,
          "direct_max_in_bytes" : 11811160064
        },
        "gc_collectors" : [
          "G1 Young Generation",
          "G1 Old Generation"
        ],
        "memory_pools" : [
          "Code Cache",
          "Metaspace",
          "Compressed Class Space",
          "G1 Eden Space",
          "G1 Survivor Space",
          "G1 Old Gen"
        ],
        "using_compressed_ordinary_object_pointers" : "true"
      },
    }
  }
}

(had to remove some info from the response to hit minimum character count)


(Robert Blankenship) #6

Interestingly, this API throws an error:

[exabeam@ip-10-20-242-43 ~]$ curl -k 'https://localhost:9200/_cat/shards'
{"error":{"root_cause":[{"type":"cluster_block_exception","reason":"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"}],"type":"cluster_block_exception","reason":"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"},"status":503}

(Robert Blankenship) #7

@warkolm Do you have any documentation by any chance explaining why a high master node count would be detrimental and what the performance tradeoffs are?


(Robert Blankenship) #8

@warkolm Any other info I can grab while the system is in this state?

We will probably look into remediation steps soon (such as restarting the elected ES master node).

I'm happy to jump on a call w/ Elastic support/engineering.


(Mark Walkom) #9

Unfortunately we only offer that for customers.

I asked those few questions as a quick knowledge gathering, but I don't have time to help at this stage and maybe I can get back to this later today. Someone else might be able jump in in the meantime though.


(Robert Blankenship) #10

Any tips on shrinking the number of master nodes in the cluster?

I'm concerned that we could hit a split brain issue while shrinking the number of master nodes (we only technically have 1 dedicated master node, the rest are data+master nodes. By "shrinking", I mean, marking 95% of our data nodes as data-only).


(David Turner) #11

I think you are already susceptible to a split-brain issue:

The number of master eligible nodes must be strictly less than twice the discovery.zen.minimum_master_nodes setting to avoid split brains, but 231-1 = 230 ≥ 93*2 = 186. The number of nodes in discovery.zen.ping.unicast.hosts is irrelevant - these nodes are used for discovery, but voting rights are determined by the node.master setting on each node.

Managing minimum_master_nodes as a cluster grows or shrinks is a pain, and this pain can be avoided using a small set of dedicated masters, which is one good reason for this architecture recommendation.

Cluster formation happens in roughly two phases: first a master is elected and second the data nodes join the elected master. I think having so many master-eligible nodes is disrupting the first phase. The election process is somewhat similar to consensus, and like consensus it is provably impossible to do reliably so the best we can do is design for a high probability of success in common cases. 200+ master-eligible nodes is not a common case, and although I don't have access to 200+ nodes to try it I can speculate that there are multiple elections all running concurrently, discovering the other inconsistent elections, and ultimately failing.

Additionally I think there are only a limited number of threads available for forming connections between master-eligible nodes, so it's going to be easy to run out when you have so many.

Apart from "there should be 3" the only other tip I can think of now is that you might want to consider using gateway.recover_after_nodes to make the elected master wait for enough of the data nodes to join the cluster before starting to allocate the shards.


(Bryan Stuhlsatz) #12

With that many nodes, it is also a good practice to have dedicated masters, and especially if you have lots of shards (primaries and replicas). I would think you would need to shut everything down. Pick 3 servers to be masters and node.master should only be set on 3 nodes. There can only be one master at a given time, and that is a waste of cpu and memory to replicate the cluster state to 90+ servers, participate in elections, and to effectively elect a master during times of patching/rebooting. Then start those masters up. Next is changing the node config on all the rest to not be masters. Depending on how many shards and replicas you have, you will need to also set the gateway recover after x nodes. You will want to pick something relevant to your setup. If the # is too low... your nodes will constantly be moving and shifting shards around with lots of disk i/o. Good luck


(Mark Walkom) #13

All nodes hold the cluster state though.


(Rahal Aymen) #14

hello, the number of master nodes in an elasticsearch cluster is calculated like that :
the total number of nodes divided by two plus 1.


(Rahal Aymen) #15

if you didn't configure correctly the master nodes you will face a problem called "Split Brain".