ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized]

Antoine_Brun · September 5, 2016, 9:14am

Hello,

I hope someone can help here.
I have a lab cluster running 1.7.5.
It has been running fine for month. It's a lab so we can delete the index, the data,...

last week i realized that the cluster state was red and I had the following status:

curl 'http://localhost:9200/_cluster/_status?pretty'
{
  "error" : "ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]",
  "status" : 503
}

There was no error in the ES log files of the nodes (1 client, 2 data and 2 master).
I tried some restart, I even deleted all the data first with index delete API then directly from the nodes FS in each nodes and restarted but the status stayed red.

it's like the status is permanently stored as RED and I could not find a way to reset it.

Anyone know ho to do this?

Antoine

the state is:

curl -XGET 'http://localhost:9200/_cluster/state?pretty'
{
  "cluster_name" : "DEVES17",
  "version" : 4,
  "master_node" : "pXv111crTtutpuYe3ROXkw",
  "blocks" : {
    "global" : {
      "1" : {
        "description" : "state not recovered / initialized",
        "retryable" : true,
        "disable_state_persistence" : true,
        "levels" : [ "read", "write", "metadata_read", "metadata_write" ]
      }
    }
  },
  "nodes" : {
    "bk3vIkpHSxqcxRGFibjtJQ" : {
      "name" : "DEVES17_NODE_CLIENT1",
      "transport_address" : "inet[/10.30.19.95:9300]",
      "attributes" : {
        "tag" : "CLIENT1",
        "data" : "false",
        "master" : "false"
      }
    },
    "QMiMDvdqQN2q6_E6EqGpBA" : {
      "name" : "DEVES17_NODE_DATA1",
      "transport_address" : "inet[/10.30.19.104:9300]",
      "attributes" : {
        "tag" : "DATA1",
        "master" : "false"
      }
    },
    "XsZU2pJdSqqYgJzbrjL4Aw" : {
      "name" : "DEVES17_NODE_DATA2",
      "transport_address" : "inet[/10.30.19.106:9300]",
      "attributes" : {
        "tag" : "DATA2",
        "master" : "false"
      }
    },
    "qHJaL7XvTuaBt5KPntjArg" : {
      "name" : "DEVES17_NODE_MASTER1",
      "transport_address" : "inet[/10.30.19.105:9300]",
      "attributes" : {
        "tag" : "MASTER1",
        "data" : "false",
        "master" : "true"
      }
    },
    "pXv111crTtutpuYe3ROXkw" : {
      "name" : "DEVES17_NODE_MASTER2",
      "transport_address" : "inet[/10.30.19.107:9300]",
      "attributes" : {
        "tag" : "MASTER2",
        "data" : "false",
        "master" : "true"
      }
    }
  },
  "metadata" : {
    "templates" : { },
    "indices" : { }
  },
  "routing_table" : {
    "indices" : { }
  },
  "routing_nodes" : {
    "unassigned" : [ ],
    "nodes" : {
      "XsZU2pJdSqqYgJzbrjL4Aw" : [ ],
      "QMiMDvdqQN2q6_E6EqGpBA" : [ ]
    }
  },
  "allocations" : [ ]
}

Antoine_Brun · September 6, 2016, 8:30am

Hello,

does anyone know how to properly clean up a cluster status?
any file to be deleted?

Any help would be appreciated because I'm really stuck here

Antoine

tanguy · September 6, 2016, 9:09am

Hi,

Can you check your logs again? You should have a message indicating why the cluster state turned RED.

Antoine_Brun · September 6, 2016, 10:30am

I have lots of exception like below and I'm not sure that it's related and I can't find the exact time when the cluster turn red.

but anyway, what if I want to completely reset the cluster status.
let's assume that I don't need to recover my data, what shall I do to totally reset ES?

[2016-08-19 16:08:02,168][WARN ][http.netty               ] [DEV-ES17_NODE_MASTER1] Caught exception while handling client http traffic, closing connection [id: 0xfa39ca56, /127.0.0.1:58336 => /127.0.0.1:9200]
org.elasticsearch.common.netty.handler.codec.frame.TooLongFrameException: An HTTP line is larger than 4096 bytes.
        at org.elasticsearch.common.netty.handler.codec.http.HttpMessageDecoder.readLine(HttpMessageDecoder.java:670)
        at org.elasticsearch.common.netty.handler.codec.http.HttpMessageDecoder.decode(HttpMessageDecoder.java:184)
        at org.elasticsearch.common.netty.handler.codec.http.HttpMessageDecoder.decode(HttpMessageDecoder.java:102)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2016-08-19 16:08:02,196][WARN ][http.netty               ] [DEV-ES17_NODE_MASTER1] Caught exception while handling client http traffic, closing connection [id: 0xfa39ca56, /127.0.0.1:58336 :> /127.0.0.1:9200]
java.lang.IllegalArgumentException: empty text
        at org.elasticsearch.common.netty.handler.codec.http.HttpVersion.<init>(HttpVersion.java:89)
        at org.elasticsearch.common.netty.handler.codec.http.HttpVersion.valueOf(HttpVersion.java:62)
        at org.elasticsearch.common.netty.handler.codec.http.HttpRequestDecoder.createMessage(HttpRequestDecoder.java:75)
        at org.elasticsearch.common.netty.handler.codec.http.HttpMessageDecoder.decode(HttpMessageDecoder.java:191)
        at org.elasticsearch.common.netty.handler.codec.http.HttpMessageDecoder.decode(HttpMessageDecoder.java:102)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.cleanup(ReplayingDecoder.java:554)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.channelDisconnected(FrameDecoder.java:365)

tanguy · September 6, 2016, 10:45am

You can shut down all your nodes, delete the data directory on each node and then restart each node. To know where the data directory is located , this table (Directory layout | Elasticsearch Guide [8.11] | Elastic) shows all the default locations.

Antoine_Brun · September 6, 2016, 10:55am

I already did this and restarted the cluster.
The master node logs look OK:

[2016-09-06 12:51:42,848][INFO ][node                     ] [DEVES17_NODE_MASTER1] version[1.7.5], pid[18593], build[00f95f4/2016-02-02T09:55:30Z]
[2016-09-06 12:51:42,852][INFO ][node                     ] [DEVES17_NODE_MASTER1] initializing ...
[2016-09-06 12:51:43,049][INFO ][plugins                  ] [DEVES17_NODE_MASTER1] loaded [], sites [HQ, kopf, head]
[2016-09-06 12:51:43,144][INFO ][env                      ] [DEVES17_NODE_MASTER1] using [1] data paths, mounts [[/ (/dev/mapper/VolGroup-lv_root)]], net usable_space [19.2gb], net total_space [24.4gb], types [ext4]
[2016-09-06 12:51:51,695][INFO ][node                     ] [DEVES17_NODE_MASTER1] initialized
[2016-09-06 12:51:51,696][INFO ][node                     ] [DEVES17_NODE_MASTER1] starting ...
[2016-09-06 12:51:52,223][INFO ][transport                ] [DEVES17_NODE_MASTER1] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.30.19.105:9300]}
[2016-09-06 12:51:52,344][INFO ][discovery                ] [DEVES17_NODE_MASTER1] DEVES17/hBH9uH2kTmiigTf-qTKS8g
[2016-09-06 12:51:56,178][INFO ][discovery.zen            ] [DEVES17_NODE_MASTER1] failed to send join request to master [[DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true}], reason [RemoteTransportException[[DEVES17_NODE_MASTER2][inet[/10.30.19.107:9300]][internal:discovery/zen/join]]; nested: ElasticsearchIllegalStateException[Node [[DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true}] not master for join request from [[DEVES17_NODE_MASTER1][hBH9uH2kTmiigTf-qTKS8g][DEV-ES1-MASTER1][inet[/10.30.19.105:9300]]{tag=MASTER1, data=false, master=true}]]; ], tried [3] times
[2016-09-06 12:51:59,267][INFO ][cluster.service          ] [DEVES17_NODE_MASTER1] detected_master [DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true}, added {[DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true},}, reason: zen-disco-receive(from master [[DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true}])
[2016-09-06 12:51:59,287][INFO ][http                     ] [DEVES17_NODE_MASTER1] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.30.19.105:9200]}
[2016-09-06 12:51:59,289][INFO ][node                     ] [DEVES17_NODE_MASTER1] started
[2016-09-06 12:51:59,567][INFO ][cluster.service          ] [DEVES17_NODE_MASTER1] added {[DEVES17_NODE_CLIENT1][2p2O2S3ITjC4s3KRmZmKAQ][DEV-ES1-CLIENT1][inet[/10.30.19.95:9300]]{tag=CLIENT1, data=false, master=false},}, reason: zen-disco-receive(from master [[DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true}])
[2016-09-06 12:52:00,264][INFO ][cluster.service          ] [DEVES17_NODE_MASTER1] added {[DEVES17_NODE_DATA2][hktw08S6RH67acfIeBxvyA][DEV-ES1-DATA2][inet[/10.30.19.106:9300]]{tag=DATA2, master=false},}, reason: zen-disco-receive(from master [[DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true}])
[2016-09-06 12:52:00,372][INFO ][cluster.service          ] [DEVES17_NODE_MASTER1] added {[DEVES17_NODE_DATA1][TstwYyBsSBW4W-0u1NIetw][DEV-ES1-DATA1][inet[/10.30.19.104:9300]]{tag=DATA1, master=false},}, reason: zen-disco-receive(from master [[DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true}])

but the cluster state stays red:
curl 'http://localhost:9200/_cluster/_status?pretty'
{
"error" : "ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]",
"status" : 503
}

tanguy · September 6, 2016, 11:33am

How's your configuration file looks like? (for master nodes) Did you configure something special for local gateway like 'recover_after_*' or 'gateway.expected_nodes' ? Or maybe minimum master nodes?

Antoine_Brun · September 6, 2016, 12:27pm

thank you, that was the issue! still I wonder why the state became red even though we didn't do any configuration change recently.

anyway, thanks a lot

Antoine

tanguy · September 6, 2016, 12:29pm

Glad you solved your issue. Can you please tell me more about it?

Antoine_Brun · September 6, 2016, 12:32pm

gateway.recover_after_nodes was set to 5 but there was only 4 nodes (2 data + 2 master)

what is strange is that this setting was wrong for weeks but it's only recently that the cluster became red.

Now I keep on getting

[2016-09-06 14:31:01,946][WARN ][http.netty               ] [DEVES17_NODE_MASTER1] Caught exception while handling client http traffic, closing connection [id: 0x434ccaae, /127.0.0.1:48852 => /127.0.0.1:9200]
org.elasticsearch.common.netty.handler.codec.frame.TooLongFrameException: An HTTP line is larger than 4096 bytes.
    at org.elasticsearch.common.netty.handler.codec.http.HttpMessageDecoder.readLine(HttpMessageDecoder.java:670)

Do you have any suggestion how to trouble shoot this?

Antoine

tanguy · September 6, 2016, 12:35pm

Ok, thanks.

It looks like there's a local process that sends an incorrect HTTP request to the DEVES17_NODE_MASTER1 node.

Topic		Replies	Views
ClusterBlockException Elasticsearch	3	1182	July 6, 2017
A problem: ClusterBlockException : blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized]; Elasticsearch	1	1982	July 6, 2017
ClusterBlockException:blocked by: [SERVICE_UNAVAILABLE/1/state not recovered /initialized] Elasticsearch	2	3125	March 22, 2019
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized] Elasticsearch	4	1065	August 17, 2022
ClusterBlockException : All shards failed Elasticsearch	5	318	April 16, 2021

ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized]

Related topics