Cluster goes into red, some shards in initializing state

Omar_Al_Zabir · December 21, 2015, 12:06pm

My cluster goes into red several times a day, while shards are initializing on existing index.

I have 3 Node EC cluster, replication 1.

Here's an example of a shard which is in initializing state:

"id": 0, "type": "STORE", "stage": "TRANSLOG", "primary": true, "start_time": "2015-12-21T11:53:28.619Z", "start_time_in_millis": 1450698808619, "total_time": "10.7m", "total_time_in_millis": 647180, "source": { "id": "RvmYqMzKTsavJAn3z271NQ", "host": "10.35.132.143", "transport_address": "10.35.132.143:9300", "ip": "10.35.132.143", "name": "ec-dyl09026app04" }, "target": { "id": "RvmYqMzKTsavJAn3z271NQ", "host": "10.35.132.143", "transport_address": "10.35.132.143:9300", "ip": "10.35.132.143", "name": "ec-dyl09026app04" }, "index": { "size": { "total": "130b", "total_in_bytes": 130, "reused": "130b", "reused_in_bytes": 130, "recovered": "0b", "recovered_in_bytes": 0, "percent": "100.0%" }, "files": { "total": 1, "reused": 1, "recovered": 0, "percent": "100.0%" }, "total_time": "1ms", "total_time_in_millis": 1, "source_throttle_time": "-1", "source_throttle_time_in_millis": 0, "target_throttle_time": "-1", "target_throttle_time_in_millis": 0 }, "translog": { "recovered": 40392, "total": -1, "percent": "-1.0%", "total_on_start": -1, "total_time": "10.7m", "total_time_in_millis": 647178 }, "verify_index": { "check_index_time": "0s", "check_index_time_in_millis": 0, "total_time": "0s", "total_time_in_millis": 0 } }

warkolm · December 22, 2015, 6:36am

Do you have new indices being created during the day? If so then this is expected.

Omar_Al_Zabir · December 22, 2015, 11:39am

New index gets created at midnight from filebeat and topbeat.

I have seen this in the logs:

[2015-12-22 11:30:52,979][INFO ][discovery.zen            ] [ec-dyl09026app02] master_left [{ec-rdl04910app06}{JrIyKDE5SVmOYLrp1rGs8Q}{10.187.147.59}{10.187.147.59:9300}], reason [transport disconnected]
[2015-12-22 11:30:52,979][WARN ][discovery.zen            ] [ec-dyl09026app02] master left (reason = transport disconnected), current nodes: {{ec-dyl09026app02}{DsZuppk3Tl67GuJvISU00w}{10.35.76.37}{10.35.76.37:9300},{ec-dyl09026app04}{RvmYqMzKTsavJAn3z271NQ}{10.35.132.143}{10.35.132.143:9300},}
[2015-12-22 11:30:52,979][INFO ][cluster.service          ] [ec-dyl09026app02] removed {{ec-rdl04910app06}{JrIyKDE5SVmOYLrp1rGs8Q}{10.187.147.59}{10.187.147.59:9300},}, reason: zen-disco-master_failed ({ec-rdl04910app06}{JrIyKDE5SVmOYLrp1rGs8Q}{10.187.147.59}{10.187.147.59:9300})
[2015-12-22 11:30:52,995][WARN ][discovery.zen.ping.unicast] [ec-dyl09026app02] failed to send ping to [{#zen_unicast_2#}{10.187.147.59}{10.187.147.59:9300}]

I see a node suddenly drops from the cluster and then comes back. Then indexes start initializing. Almost all the index initialize immediately, except the topbeat index for today. That takes a very long time to initialize. And I can see there's always one shard where it gets stuck:

{
  "id": 1,
  "type": "STORE",
  "stage": "TRANSLOG",
  "primary": true,
  "start_time": "2015-12-22T11:32:36.629Z",
  "start_time_in_millis": 1450783956629,
  "total_time": "6.8m",
  "total_time_in_millis": 408657,
  "source": {
    "id": "RvmYqMzKTsavJAn3z271NQ",
    "host": "10.35.132.143",
    "transport_address": "10.35.132.143:9300",
    "ip": "10.35.132.143",
    "name": "ec-dyl09026app04"
  },
  "target": {
    "id": "RvmYqMzKTsavJAn3z271NQ",
    "host": "10.35.132.143",
    "transport_address": "10.35.132.143:9300",
    "ip": "10.35.132.143",
    "name": "ec-dyl09026app04"
  },
  "index": {
    "size": {
      "total": "130b",
      "total_in_bytes": 130,
      "reused": "130b",
      "reused_in_bytes": 130,
      "recovered": "0b",
      "recovered_in_bytes": 0,
      "percent": "100.0%"
    },
    "files": {
      "total": 1,
      "reused": 1,
      "recovered": 0,
      "percent": "100.0%"
    },
    "total_time": "1ms",
    "total_time_in_millis": 1,
    "source_throttle_time": "-1",
    "source_throttle_time_in_millis": 0,
    "target_throttle_time": "-1",
    "target_throttle_time_in_millis": 0
  },
  "translog": {
    "recovered": 25136,
    "total": -1,
    "percent": "-1.0%",
    "total_on_start": -1,
    "total_time": "6.8m",
    "total_time_in_millis": 408654
  },
  "verify_index": {
    "check_index_time": "0s",
    "check_index_time_in_millis": 0,
    "total_time": "0s",
    "total_time_in_millis": 0
  }
}

Omar_Al_Zabir · December 22, 2015, 11:46am

More clue, I can see the shard that has problem recovering, got this exception where it failed to delete translog:

[2015-12-22 11:31:47,605][DEBUG][action.admin.indices.get ] [ec-dyl09026app04] no known master node, s
cheduling a retry
[2015-12-22 11:31:49,725][INFO ][cluster.service          ] [ec-dyl09026app04] detected_master {ec-rdl
04910app06}{JrIyKDE5SVmOYLrp1rGs8Q}{10.187.147.59}{10.187.147.59:9300}, added {{ec-rdl04910app06}{JrIy
KDE5SVmOYLrp1rGs8Q}{10.187.147.59}{10.187.147.59:9300},}, reason: zen-disco-receive(from master [{ec-r
dl04910app06}{JrIyKDE5SVmOYLrp1rGs8Q}{10.187.147.59}{10.187.147.59:9300}])
[2015-12-22 11:32:33,637][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-dyl09026ap
p02}{DsZuppk3Tl67GuJvISU00w}{10.35.76.37}{10.35.76.37:9300},}, reason: zen-disco-receive(from master [
{ec-rdl04910app06}{JrIyKDE5SVmOYLrp1rGs8Q}{10.187.147.59}{10.187.147.59:9300}])
[2015-12-22 11:32:36,653][WARN ][index.translog           ] [ec-dyl09026app04] [topbeat-2015.12.22][1]
 failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/topbeat
-2015.12.22/1/translog/translog-6032227611099380263.tlog
java.nio.file.NoSuchFileException: /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/
topbeat-2015.12.22/1/translog/translog-6032227611099380263.tlog

Omar_Al_Zabir · December 22, 2015, 11:55am

Even more log entries on that particular shard:

[2015-12-22 11:31:52,936][WARN ][action.bulk              ] [ec-dyl09026app02] [topbeat-2015.12.22][1] failed to perform indices:data/write/bulk[s] on node {ec-dyl09026app04}{RvmYqMzKTsavJAn3z271NQ}{10.35.132.143}{10.35.132.143:9300}
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][indices:data/write/bulk[s][r]]]; nested: NodeNotConnectedException[[ec-dyl09026app04][10.35.132.143:9300] Node not connected];

warkolm · December 22, 2015, 8:06pm

What version are you on?

Omar_Al_Zabir · December 23, 2015, 1:50pm

ES 2.1, LS 2.1 in all 3 nodes.

I have found some more clue. I see that every hour, the master node gets dropped. Immediately after that, I get the the translog cannot be deleted exception. For example:

[2015-12-23 05:03:41,890][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}, reason [transport disconnected]
[2015-12-23 05:03:41,892][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = transport disconnected), current nodes: {{ec-
[2015-12-23 05:03:41,892][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-rdl04910app06},}, reason: zen-disco-master_failed ({ec-rdl04910app06}
[2015-12-23 05:03:43,932][INFO ][cluster.service          ] [ec-dyl09026app04] detected_master {ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}{ added {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw},}, 
[2015-12-23 05:04:12,441][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q},}, reason: zen-disco-receive(from master [{ec-rdl04910app06}
[2015-12-23 05:04:17,234][WARN ][index.translog           ] [ec-dyl09026app04] [topbeat-2015.12.23][1] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/topbeat-2015.12.23/1/translog/translog-7465629763806507405.tlog

This is at 5:03. Then at 6:03, same story:

[2015-12-23 06:03:43,619][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}], reason [transport disconnected]
[2015-12-23 06:03:43,620][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = transport disconnected), current nodes: {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q}{10.35.76.37}{10.35.76.37:9300},{ec-dyl09026app04}
[2015-12-23 06:03:43,621][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw} reason: zen-disco-master_failed ({ec-rdl04910app06}
[2015-12-23 06:03:59,567][INFO ][rest.suppressed          ] /_bulk Params: {}
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]

Then at 7, same problem:

[2015-12-23 06:07:06,407][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2015-12-23 06:07:06,408][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q},{ec-dyl09026app04}{p2QuXlyDR6CNvykHSeD2wA}
[2015-12-23 06:07:06,409][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw},}, reason: zen-disco-master_failed ({ec-rdl04910app06}

The master node is across data center. So, I assume some firewall is dropping connection after 60m of inactivity/activity. Looks like the zen discover thing keeps a connection open and that connection gets dropped. Then the cluster thinks the master node has left.

On the master node, here's are the logs at that time, for ex at 5am:

[2015-12-23 05:03:43,903][WARN ][cluster.action.shard     ] [ec-rdl04910app06] [topbeat-2015.12.
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][indices:data/write/bulk[s][
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:323)
  Caused by: NodeNotConnectedException[[ec-dyl09026app04][10.35.132.143:9300] Node not connected]
        at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:1096
[2015-12-23 05:03:44,779][WARN ][cluster.action.shard     ] [ec-rdl04910app06] [logstash-vasfulf
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][indices:data/write/bulk[s][
Caused by: NodeNotConnectedException[[ec-dyl09026app04][10.35.132.143:9300] Node not connected]
[2015-12-23 05:04:12,404][INFO ][cluster.service          ] [ec-rdl04910app06] removed {{ec-dyl0
[2015-12-23 05:04:47,260][INFO ][cluster.service          ] [ec-rdl04910app06] added {{ec-dyl090

Same at 6am, only another index having problem:

[2015-12-23 06:04:53,543][WARN ][cluster.action.shard     ] [ec-rdl04910app06] [topbeat-2015.12.
SendRequestTransportException[[ec-dyl09026app02][10.35.76.37:9300][indices:data/write/bulk[s][r]
Caused by: NodeNotConnectedException[[ec-dyl09026app02][10.35.76.37:9300] Node not connected]
[2015-12-23 06:07:04,900][INFO ][cluster.service          ] [ec-rdl04910app06] removed {{ec-dyl0
[2015-12-23 06:07:04,987][DEBUG][action.admin.cluster.node.info] [ec-rdl04910app06] failed to ex
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][cluster:monitor/nodes/info[

warkolm · December 23, 2015, 8:31pm

Then I am going to keep replying here - Node disconnects every hour

Topic		Replies	Views
ES cluster goes into red frequently Elasticsearch	21	2543	November 29, 2019
One node cluster stuck on initializing_shards Elasticsearch	4	7505	June 29, 2017
ES cluster is red after restart Elasticsearch	2	491	July 6, 2017
Elasticsearch cluster is in Red state. How to recover it? Elasticsearch	14	10751	December 25, 2019
Cluster turns to red after reboot Elasticsearch	29	2768	January 4, 2019

Cluster goes into red, some shards in initializing state

Related topics