ELK restarts fails with exception Status changed from red to red

Hi Team,

We are using ELK 2.4.1 and deployed HA mode with 3 instances. Every two hours elasticsearch is restarting and getting below exception observed.
Elasticsearch api:-
{
"name" : "node-10.10.0.12",
"cluster_name" : "ad5be3b4-5f80-5589-b0b2-50fd38592089",
"cluster_uuid" : "_sr-yB7ISF2vE2_DpjhUuA",
"version" : {
"number" : "2.4.1",
"build_hash" : "c67dc32e24162035d18d6fe1e952c4cbcbe79d16",
"build_timestamp" : "2016-09-27T18:57:55Z",
"build_snapshot" : false,
"lucene_version" : "5.5.2"
},
"tagline" : "You Know, for Search"
}

Logs :-
May 8 06:52:09 cbnd18-414-1-server kibana: {"type":"log","@timestamp":"2018-05-08T06:52:09Z","tags":["status","plugin:elasticsearch@1.0.0","error"],"pid":618,"state":"red","message":"Status changed from red to red - [master_not_discovered_exception] null","prevState":"red","prevMsg":"Service Unavailable"}
failed ({node-10.10.0.12}{6G60lvenQYKPucRh380elA}{10.10.0.12}{10.10.0.12:9300})
May 8 06:52:20 cbnd18-414-1-server kibana: {"type":"response","@timestamp":"2018-05-08T06:52:20Z","tags":[],"pid":618,"method":"get","statusCode":200,"req":{"url":"/","method":"get","headers":{},"remoteAddress":"10.10.0.13","userAgent":"10.10.0.13"},"res":{"statusCode":200,"responseTime":2,"contentLength":9},"message":"GET / 200 2ms - 9.0B"}
May 8 06:52:21 cbnd18-414-1-server kibana: {"type":"response","@timestamp":"2018-05-08T06:52:21Z","tags":[],"pid":618,"method":"get","statusCode":200,"req":{"url":"/","method":"get","headers":{},"remoteAddress":"10.10.0.12","userAgent":"10.10.0.12"},"res":{"statusCode":200,"responseTime":2,"contentLength":9},"message":"GET / 200 2ms - 9.0B"}
May 8 06:52:22 cbnd18-414-1-server elasticsearch: [2018-05-08 06:52:22,062][WARN ][rest.suppressed ] path: /_bulk, params: {}
May 8 06:52:22 cbnd18-414-1-server elasticsearch: ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]
May 8 06:52:22 cbnd18-414-1-server elasticsearch: at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:158)

On further check I see a github ticket with similar exception https://github.com/elastic/elasticsearch/issues/11202
The below solution is working after making tcp changes. Is this solution is right?

ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]","status":503}
solution:
set the tcp_keepalive_time to a suitable value ( default 7200 seconds) . example change the value to
tcp_keepalive_time=300

Because TCP change will also impact other TCP connections from various components
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 20
net.ipv4.tcp_keepalive_time = 600

Could you please help me is this right solution or do we have any other solution elasticsearch is providing?
Please let me know if this is resolved in any elasticsearch new versions by default?

Thanks in advance.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.