Cluster goes into red, some shards in initializing state

My cluster goes into red several times a day, while shards are initializing on existing index.

I have 3 Node EC cluster, replication 1.

Here's an example of a shard which is in initializing state:

"id": 0, "type": "STORE", "stage": "TRANSLOG", "primary": true, "start_time": "2015-12-21T11:53:28.619Z", "start_time_in_millis": 1450698808619, "total_time": "10.7m", "total_time_in_millis": 647180, "source": { "id": "RvmYqMzKTsavJAn3z271NQ", "host": "10.35.132.143", "transport_address": "10.35.132.143:9300", "ip": "10.35.132.143", "name": "ec-dyl09026app04" }, "target": { "id": "RvmYqMzKTsavJAn3z271NQ", "host": "10.35.132.143", "transport_address": "10.35.132.143:9300", "ip": "10.35.132.143", "name": "ec-dyl09026app04" }, "index": { "size": { "total": "130b", "total_in_bytes": 130, "reused": "130b", "reused_in_bytes": 130, "recovered": "0b", "recovered_in_bytes": 0, "percent": "100.0%" }, "files": { "total": 1, "reused": 1, "recovered": 0, "percent": "100.0%" }, "total_time": "1ms", "total_time_in_millis": 1, "source_throttle_time": "-1", "source_throttle_time_in_millis": 0, "target_throttle_time": "-1", "target_throttle_time_in_millis": 0 }, "translog": { "recovered": 40392, "total": -1, "percent": "-1.0%", "total_on_start": -1, "total_time": "10.7m", "total_time_in_millis": 647178 }, "verify_index": { "check_index_time": "0s", "check_index_time_in_millis": 0, "total_time": "0s", "total_time_in_millis": 0 } }

Do you have new indices being created during the day? If so then this is expected.

New index gets created at midnight from filebeat and topbeat.

I have seen this in the logs:

[2015-12-22 11:30:52,979][INFO ][discovery.zen            ] [ec-dyl09026app02] master_left [{ec-rdl04910app06}{JrIyKDE5SVmOYLrp1rGs8Q}{10.187.147.59}{10.187.147.59:9300}], reason [transport disconnected]
[2015-12-22 11:30:52,979][WARN ][discovery.zen            ] [ec-dyl09026app02] master left (reason = transport disconnected), current nodes: {{ec-dyl09026app02}{DsZuppk3Tl67GuJvISU00w}{10.35.76.37}{10.35.76.37:9300},{ec-dyl09026app04}{RvmYqMzKTsavJAn3z271NQ}{10.35.132.143}{10.35.132.143:9300},}
[2015-12-22 11:30:52,979][INFO ][cluster.service          ] [ec-dyl09026app02] removed {{ec-rdl04910app06}{JrIyKDE5SVmOYLrp1rGs8Q}{10.187.147.59}{10.187.147.59:9300},}, reason: zen-disco-master_failed ({ec-rdl04910app06}{JrIyKDE5SVmOYLrp1rGs8Q}{10.187.147.59}{10.187.147.59:9300})
[2015-12-22 11:30:52,995][WARN ][discovery.zen.ping.unicast] [ec-dyl09026app02] failed to send ping to [{#zen_unicast_2#}{10.187.147.59}{10.187.147.59:9300}]

I see a node suddenly drops from the cluster and then comes back. Then indexes start initializing. Almost all the index initialize immediately, except the topbeat index for today. That takes a very long time to initialize. And I can see there's always one shard where it gets stuck:

{
  "id": 1,
  "type": "STORE",
  "stage": "TRANSLOG",
  "primary": true,
  "start_time": "2015-12-22T11:32:36.629Z",
  "start_time_in_millis": 1450783956629,
  "total_time": "6.8m",
  "total_time_in_millis": 408657,
  "source": {
    "id": "RvmYqMzKTsavJAn3z271NQ",
    "host": "10.35.132.143",
    "transport_address": "10.35.132.143:9300",
    "ip": "10.35.132.143",
    "name": "ec-dyl09026app04"
  },
  "target": {
    "id": "RvmYqMzKTsavJAn3z271NQ",
    "host": "10.35.132.143",
    "transport_address": "10.35.132.143:9300",
    "ip": "10.35.132.143",
    "name": "ec-dyl09026app04"
  },
  "index": {
    "size": {
      "total": "130b",
      "total_in_bytes": 130,
      "reused": "130b",
      "reused_in_bytes": 130,
      "recovered": "0b",
      "recovered_in_bytes": 0,
      "percent": "100.0%"
    },
    "files": {
      "total": 1,
      "reused": 1,
      "recovered": 0,
      "percent": "100.0%"
    },
    "total_time": "1ms",
    "total_time_in_millis": 1,
    "source_throttle_time": "-1",
    "source_throttle_time_in_millis": 0,
    "target_throttle_time": "-1",
    "target_throttle_time_in_millis": 0
  },
  "translog": {
    "recovered": 25136,
    "total": -1,
    "percent": "-1.0%",
    "total_on_start": -1,
    "total_time": "6.8m",
    "total_time_in_millis": 408654
  },
  "verify_index": {
    "check_index_time": "0s",
    "check_index_time_in_millis": 0,
    "total_time": "0s",
    "total_time_in_millis": 0
  }
}

More clue, I can see the shard that has problem recovering, got this exception where it failed to delete translog:

[2015-12-22 11:31:47,605][DEBUG][action.admin.indices.get ] [ec-dyl09026app04] no known master node, s
cheduling a retry
[2015-12-22 11:31:49,725][INFO ][cluster.service          ] [ec-dyl09026app04] detected_master {ec-rdl
04910app06}{JrIyKDE5SVmOYLrp1rGs8Q}{10.187.147.59}{10.187.147.59:9300}, added {{ec-rdl04910app06}{JrIy
KDE5SVmOYLrp1rGs8Q}{10.187.147.59}{10.187.147.59:9300},}, reason: zen-disco-receive(from master [{ec-r
dl04910app06}{JrIyKDE5SVmOYLrp1rGs8Q}{10.187.147.59}{10.187.147.59:9300}])
[2015-12-22 11:32:33,637][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-dyl09026ap
p02}{DsZuppk3Tl67GuJvISU00w}{10.35.76.37}{10.35.76.37:9300},}, reason: zen-disco-receive(from master [
{ec-rdl04910app06}{JrIyKDE5SVmOYLrp1rGs8Q}{10.187.147.59}{10.187.147.59:9300}])
[2015-12-22 11:32:36,653][WARN ][index.translog           ] [ec-dyl09026app04] [topbeat-2015.12.22][1]
 failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/topbeat
-2015.12.22/1/translog/translog-6032227611099380263.tlog
java.nio.file.NoSuchFileException: /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/
topbeat-2015.12.22/1/translog/translog-6032227611099380263.tlog

Even more log entries on that particular shard:

[2015-12-22 11:31:52,936][WARN ][action.bulk              ] [ec-dyl09026app02] [topbeat-2015.12.22][1] failed to perform indices:data/write/bulk[s] on node {ec-dyl09026app04}{RvmYqMzKTsavJAn3z271NQ}{10.35.132.143}{10.35.132.143:9300}
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][indices:data/write/bulk[s][r]]]; nested: NodeNotConnectedException[[ec-dyl09026app04][10.35.132.143:9300] Node not connected];

What version are you on?

ES 2.1, LS 2.1 in all 3 nodes.

I have found some more clue. I see that every hour, the master node gets dropped. Immediately after that, I get the the translog cannot be deleted exception. For example:

[2015-12-23 05:03:41,890][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}, reason [transport disconnected]
[2015-12-23 05:03:41,892][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = transport disconnected), current nodes: {{ec-
[2015-12-23 05:03:41,892][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-rdl04910app06},}, reason: zen-disco-master_failed ({ec-rdl04910app06}
[2015-12-23 05:03:43,932][INFO ][cluster.service          ] [ec-dyl09026app04] detected_master {ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}{ added {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw},}, 
[2015-12-23 05:04:12,441][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q},}, reason: zen-disco-receive(from master [{ec-rdl04910app06}
[2015-12-23 05:04:17,234][WARN ][index.translog           ] [ec-dyl09026app04] [topbeat-2015.12.23][1] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/topbeat-2015.12.23/1/translog/translog-7465629763806507405.tlog

This is at 5:03. Then at 6:03, same story:

[2015-12-23 06:03:43,619][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}], reason [transport disconnected]
[2015-12-23 06:03:43,620][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = transport disconnected), current nodes: {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q}{10.35.76.37}{10.35.76.37:9300},{ec-dyl09026app04}
[2015-12-23 06:03:43,621][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw} reason: zen-disco-master_failed ({ec-rdl04910app06}
[2015-12-23 06:03:59,567][INFO ][rest.suppressed          ] /_bulk Params: {}
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]

Then at 7, same problem:

[2015-12-23 06:07:06,407][INFO ][discovery.zen            ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2015-12-23 06:07:06,408][WARN ][discovery.zen            ] [ec-dyl09026app04] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q},{ec-dyl09026app04}{p2QuXlyDR6CNvykHSeD2wA}
[2015-12-23 06:07:06,409][INFO ][cluster.service          ] [ec-dyl09026app04] removed {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw},}, reason: zen-disco-master_failed ({ec-rdl04910app06}

The master node is across data center. So, I assume some firewall is dropping connection after 60m of inactivity/activity. Looks like the zen discover thing keeps a connection open and that connection gets dropped. Then the cluster thinks the master node has left.

On the master node, here's are the logs at that time, for ex at 5am:

[2015-12-23 05:03:43,903][WARN ][cluster.action.shard     ] [ec-rdl04910app06] [topbeat-2015.12.
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][indices:data/write/bulk[s][
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:323)
  Caused by: NodeNotConnectedException[[ec-dyl09026app04][10.35.132.143:9300] Node not connected]
        at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:1096
[2015-12-23 05:03:44,779][WARN ][cluster.action.shard     ] [ec-rdl04910app06] [logstash-vasfulf
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][indices:data/write/bulk[s][
Caused by: NodeNotConnectedException[[ec-dyl09026app04][10.35.132.143:9300] Node not connected]
[2015-12-23 05:04:12,404][INFO ][cluster.service          ] [ec-rdl04910app06] removed {{ec-dyl0
[2015-12-23 05:04:47,260][INFO ][cluster.service          ] [ec-rdl04910app06] added {{ec-dyl090

Same at 6am, only another index having problem:

[2015-12-23 06:04:53,543][WARN ][cluster.action.shard     ] [ec-rdl04910app06] [topbeat-2015.12.
SendRequestTransportException[[ec-dyl09026app02][10.35.76.37:9300][indices:data/write/bulk[s][r]
Caused by: NodeNotConnectedException[[ec-dyl09026app02][10.35.76.37:9300] Node not connected]
[2015-12-23 06:07:04,900][INFO ][cluster.service          ] [ec-rdl04910app06] removed {{ec-dyl0
[2015-12-23 06:07:04,987][DEBUG][action.admin.cluster.node.info] [ec-rdl04910app06] failed to ex
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][cluster:monitor/nodes/info[

Then I am going to keep replying here - Node disconnects every hour