ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized]

Hello,

I hope someone can help here.
I have a lab cluster running 1.7.5.
It has been running fine for month. It's a lab so we can delete the index, the data,...

last week i realized that the cluster state was red and I had the following status:

curl 'http://localhost:9200/_cluster/_status?pretty'
{
  "error" : "ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]",
  "status" : 503
}

There was no error in the ES log files of the nodes (1 client, 2 data and 2 master).
I tried some restart, I even deleted all the data first with index delete API then directly from the nodes FS in each nodes and restarted but the status stayed red.

it's like the status is permanently stored as RED and I could not find a way to reset it.

Anyone know ho to do this?

Antoine

the state is:

curl -XGET 'http://localhost:9200/_cluster/state?pretty'
{
  "cluster_name" : "DEVES17",
  "version" : 4,
  "master_node" : "pXv111crTtutpuYe3ROXkw",
  "blocks" : {
    "global" : {
      "1" : {
        "description" : "state not recovered / initialized",
        "retryable" : true,
        "disable_state_persistence" : true,
        "levels" : [ "read", "write", "metadata_read", "metadata_write" ]
      }
    }
  },
  "nodes" : {
    "bk3vIkpHSxqcxRGFibjtJQ" : {
      "name" : "DEVES17_NODE_CLIENT1",
      "transport_address" : "inet[/10.30.19.95:9300]",
      "attributes" : {
        "tag" : "CLIENT1",
        "data" : "false",
        "master" : "false"
      }
    },
    "QMiMDvdqQN2q6_E6EqGpBA" : {
      "name" : "DEVES17_NODE_DATA1",
      "transport_address" : "inet[/10.30.19.104:9300]",
      "attributes" : {
        "tag" : "DATA1",
        "master" : "false"
      }
    },
    "XsZU2pJdSqqYgJzbrjL4Aw" : {
      "name" : "DEVES17_NODE_DATA2",
      "transport_address" : "inet[/10.30.19.106:9300]",
      "attributes" : {
        "tag" : "DATA2",
        "master" : "false"
      }
    },
    "qHJaL7XvTuaBt5KPntjArg" : {
      "name" : "DEVES17_NODE_MASTER1",
      "transport_address" : "inet[/10.30.19.105:9300]",
      "attributes" : {
        "tag" : "MASTER1",
        "data" : "false",
        "master" : "true"
      }
    },
    "pXv111crTtutpuYe3ROXkw" : {
      "name" : "DEVES17_NODE_MASTER2",
      "transport_address" : "inet[/10.30.19.107:9300]",
      "attributes" : {
        "tag" : "MASTER2",
        "data" : "false",
        "master" : "true"
      }
    }
  },
  "metadata" : {
    "templates" : { },
    "indices" : { }
  },
  "routing_table" : {
    "indices" : { }
  },
  "routing_nodes" : {
    "unassigned" : [ ],
    "nodes" : {
      "XsZU2pJdSqqYgJzbrjL4Aw" : [ ],
      "QMiMDvdqQN2q6_E6EqGpBA" : [ ]
    }
  },
  "allocations" : [ ]
}

Hello,

does anyone know how to properly clean up a cluster status?
any file to be deleted?

Any help would be appreciated because I'm really stuck here

Antoine

Hi,

Can you check your logs again? You should have a message indicating why the cluster state turned RED.

I have lots of exception like below and I'm not sure that it's related and I can't find the exact time when the cluster turn red.

but anyway, what if I want to completely reset the cluster status.
let's assume that I don't need to recover my data, what shall I do to totally reset ES?

[2016-08-19 16:08:02,168][WARN ][http.netty               ] [DEV-ES17_NODE_MASTER1] Caught exception while handling client http traffic, closing connection [id: 0xfa39ca56, /127.0.0.1:58336 => /127.0.0.1:9200]
org.elasticsearch.common.netty.handler.codec.frame.TooLongFrameException: An HTTP line is larger than 4096 bytes.
        at org.elasticsearch.common.netty.handler.codec.http.HttpMessageDecoder.readLine(HttpMessageDecoder.java:670)
        at org.elasticsearch.common.netty.handler.codec.http.HttpMessageDecoder.decode(HttpMessageDecoder.java:184)
        at org.elasticsearch.common.netty.handler.codec.http.HttpMessageDecoder.decode(HttpMessageDecoder.java:102)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2016-08-19 16:08:02,196][WARN ][http.netty               ] [DEV-ES17_NODE_MASTER1] Caught exception while handling client http traffic, closing connection [id: 0xfa39ca56, /127.0.0.1:58336 :> /127.0.0.1:9200]
java.lang.IllegalArgumentException: empty text
        at org.elasticsearch.common.netty.handler.codec.http.HttpVersion.<init>(HttpVersion.java:89)
        at org.elasticsearch.common.netty.handler.codec.http.HttpVersion.valueOf(HttpVersion.java:62)
        at org.elasticsearch.common.netty.handler.codec.http.HttpRequestDecoder.createMessage(HttpRequestDecoder.java:75)
        at org.elasticsearch.common.netty.handler.codec.http.HttpMessageDecoder.decode(HttpMessageDecoder.java:191)
        at org.elasticsearch.common.netty.handler.codec.http.HttpMessageDecoder.decode(HttpMessageDecoder.java:102)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.cleanup(ReplayingDecoder.java:554)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.channelDisconnected(FrameDecoder.java:365)

You can shut down all your nodes, delete the data directory on each node and then restart each node. To know where the data directory is located , this table (Directory layout | Elasticsearch Guide [8.11] | Elastic) shows all the default locations.

I already did this and restarted the cluster.
The master node logs look OK:

[2016-09-06 12:51:42,848][INFO ][node                     ] [DEVES17_NODE_MASTER1] version[1.7.5], pid[18593], build[00f95f4/2016-02-02T09:55:30Z]
[2016-09-06 12:51:42,852][INFO ][node                     ] [DEVES17_NODE_MASTER1] initializing ...
[2016-09-06 12:51:43,049][INFO ][plugins                  ] [DEVES17_NODE_MASTER1] loaded [], sites [HQ, kopf, head]
[2016-09-06 12:51:43,144][INFO ][env                      ] [DEVES17_NODE_MASTER1] using [1] data paths, mounts [[/ (/dev/mapper/VolGroup-lv_root)]], net usable_space [19.2gb], net total_space [24.4gb], types [ext4]
[2016-09-06 12:51:51,695][INFO ][node                     ] [DEVES17_NODE_MASTER1] initialized
[2016-09-06 12:51:51,696][INFO ][node                     ] [DEVES17_NODE_MASTER1] starting ...
[2016-09-06 12:51:52,223][INFO ][transport                ] [DEVES17_NODE_MASTER1] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.30.19.105:9300]}
[2016-09-06 12:51:52,344][INFO ][discovery                ] [DEVES17_NODE_MASTER1] DEVES17/hBH9uH2kTmiigTf-qTKS8g
[2016-09-06 12:51:56,178][INFO ][discovery.zen            ] [DEVES17_NODE_MASTER1] failed to send join request to master [[DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true}], reason [RemoteTransportException[[DEVES17_NODE_MASTER2][inet[/10.30.19.107:9300]][internal:discovery/zen/join]]; nested: ElasticsearchIllegalStateException[Node [[DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true}] not master for join request from [[DEVES17_NODE_MASTER1][hBH9uH2kTmiigTf-qTKS8g][DEV-ES1-MASTER1][inet[/10.30.19.105:9300]]{tag=MASTER1, data=false, master=true}]]; ], tried [3] times
[2016-09-06 12:51:59,267][INFO ][cluster.service          ] [DEVES17_NODE_MASTER1] detected_master [DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true}, added {[DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true},}, reason: zen-disco-receive(from master [[DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true}])
[2016-09-06 12:51:59,287][INFO ][http                     ] [DEVES17_NODE_MASTER1] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.30.19.105:9200]}
[2016-09-06 12:51:59,289][INFO ][node                     ] [DEVES17_NODE_MASTER1] started
[2016-09-06 12:51:59,567][INFO ][cluster.service          ] [DEVES17_NODE_MASTER1] added {[DEVES17_NODE_CLIENT1][2p2O2S3ITjC4s3KRmZmKAQ][DEV-ES1-CLIENT1][inet[/10.30.19.95:9300]]{tag=CLIENT1, data=false, master=false},}, reason: zen-disco-receive(from master [[DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true}])
[2016-09-06 12:52:00,264][INFO ][cluster.service          ] [DEVES17_NODE_MASTER1] added {[DEVES17_NODE_DATA2][hktw08S6RH67acfIeBxvyA][DEV-ES1-DATA2][inet[/10.30.19.106:9300]]{tag=DATA2, master=false},}, reason: zen-disco-receive(from master [[DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true}])
[2016-09-06 12:52:00,372][INFO ][cluster.service          ] [DEVES17_NODE_MASTER1] added {[DEVES17_NODE_DATA1][TstwYyBsSBW4W-0u1NIetw][DEV-ES1-DATA1][inet[/10.30.19.104:9300]]{tag=DATA1, master=false},}, reason: zen-disco-receive(from master [[DEVES17_NODE_MASTER2][RrrZPWfbQrqsxwORXWbLvA][DEV-ES1-MASTER2][inet[/10.30.19.107:9300]]{tag=MASTER2, data=false, master=true}])

but the cluster state stays red:
curl 'http://localhost:9200/_cluster/_status?pretty'
{
"error" : "ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]",
"status" : 503
}

How's your configuration file looks like? (for master nodes) Did you configure something special for local gateway like 'recover_after_*' or 'gateway.expected_nodes' ? Or maybe minimum master nodes?

3 Likes

thank you, that was the issue! still I wonder why the state became red even though we didn't do any configuration change recently.

anyway, thanks a lot

Antoine

Glad you solved your issue. Can you please tell me more about it?

gateway.recover_after_nodes was set to 5 but there was only 4 nodes (2 data + 2 master)

what is strange is that this setting was wrong for weeks but it's only recently that the cluster became red.

Now I keep on getting

[2016-09-06 14:31:01,946][WARN ][http.netty               ] [DEVES17_NODE_MASTER1] Caught exception while handling client http traffic, closing connection [id: 0x434ccaae, /127.0.0.1:48852 => /127.0.0.1:9200]
org.elasticsearch.common.netty.handler.codec.frame.TooLongFrameException: An HTTP line is larger than 4096 bytes.
    at org.elasticsearch.common.netty.handler.codec.http.HttpMessageDecoder.readLine(HttpMessageDecoder.java:670)

Do you have any suggestion how to trouble shoot this?

Antoine

1 Like

Ok, thanks.

It looks like there's a local process that sends an incorrect HTTP request to the DEVES17_NODE_MASTER1 node.