I have found some more clue. I see that every hour, the master node gets dropped. Immediately after that, I get the the translog cannot be deleted exception. For example:
[2015-12-23 05:03:41,890][INFO ][discovery.zen ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}, reason [transport disconnected]
[2015-12-23 05:03:41,892][WARN ][discovery.zen ] [ec-dyl09026app04] master left (reason = transport disconnected), current nodes: {{ec-
[2015-12-23 05:03:41,892][INFO ][cluster.service ] [ec-dyl09026app04] removed {{ec-rdl04910app06},}, reason: zen-disco-master_failed ({ec-rdl04910app06}
[2015-12-23 05:03:43,932][INFO ][cluster.service ] [ec-dyl09026app04] detected_master {ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}{ added {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw},},
[2015-12-23 05:04:12,441][INFO ][cluster.service ] [ec-dyl09026app04] removed {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q},}, reason: zen-disco-receive(from master [{ec-rdl04910app06}
[2015-12-23 05:04:17,234][WARN ][index.translog ] [ec-dyl09026app04] [topbeat-2015.12.23][1] failed to delete temp file /opt/softwares/elasticsearch-2.1.0/data/ec-cluster/nodes/0/indices/topbeat-2015.12.23/1/translog/translog-7465629763806507405.tlog
This is at 5:03. Then at 6:03, same story:
[2015-12-23 06:03:43,619][INFO ][discovery.zen ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}], reason [transport disconnected]
[2015-12-23 06:03:43,620][WARN ][discovery.zen ] [ec-dyl09026app04] master left (reason = transport disconnected), current nodes: {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q}{10.35.76.37}{10.35.76.37:9300},{ec-dyl09026app04}
[2015-12-23 06:03:43,621][INFO ][cluster.service ] [ec-dyl09026app04] removed {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw} reason: zen-disco-master_failed ({ec-rdl04910app06}
[2015-12-23 06:03:59,567][INFO ][rest.suppressed ] /_bulk Params: {}
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]
Then at 7, same problem:
[2015-12-23 06:07:06,407][INFO ][discovery.zen ] [ec-dyl09026app04] master_left [{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2015-12-23 06:07:06,408][WARN ][discovery.zen ] [ec-dyl09026app04] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{ec-dyl09026app02}{jgY3MDv_TI6I6gbccdPi1Q},{ec-dyl09026app04}{p2QuXlyDR6CNvykHSeD2wA}
[2015-12-23 06:07:06,409][INFO ][cluster.service ] [ec-dyl09026app04] removed {{ec-rdl04910app06}{h_2EPL-dSe6tnWllbIXNhw},}, reason: zen-disco-master_failed ({ec-rdl04910app06}
The master node is across data center. So, I assume some firewall is dropping connection after 60m of inactivity/activity. Looks like the zen discover thing keeps a connection open and that connection gets dropped. Then the cluster thinks the master node has left.
On the master node, here's are the logs at that time, for ex at 5am:
[2015-12-23 05:03:43,903][WARN ][cluster.action.shard ] [ec-rdl04910app06] [topbeat-2015.12.
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][indices:data/write/bulk[s][
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:323)
Caused by: NodeNotConnectedException[[ec-dyl09026app04][10.35.132.143:9300] Node not connected]
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:1096
[2015-12-23 05:03:44,779][WARN ][cluster.action.shard ] [ec-rdl04910app06] [logstash-vasfulf
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][indices:data/write/bulk[s][
Caused by: NodeNotConnectedException[[ec-dyl09026app04][10.35.132.143:9300] Node not connected]
[2015-12-23 05:04:12,404][INFO ][cluster.service ] [ec-rdl04910app06] removed {{ec-dyl0
[2015-12-23 05:04:47,260][INFO ][cluster.service ] [ec-rdl04910app06] added {{ec-dyl090
Same at 6am, only another index having problem:
[2015-12-23 06:04:53,543][WARN ][cluster.action.shard ] [ec-rdl04910app06] [topbeat-2015.12.
SendRequestTransportException[[ec-dyl09026app02][10.35.76.37:9300][indices:data/write/bulk[s][r]
Caused by: NodeNotConnectedException[[ec-dyl09026app02][10.35.76.37:9300] Node not connected]
[2015-12-23 06:07:04,900][INFO ][cluster.service ] [ec-rdl04910app06] removed {{ec-dyl0
[2015-12-23 06:07:04,987][DEBUG][action.admin.cluster.node.info] [ec-rdl04910app06] failed to ex
SendRequestTransportException[[ec-dyl09026app04][10.35.132.143:9300][cluster:monitor/nodes/info[