Cluster become unresponsive after receiving data for sometime using EC2 Discovery

Hi,

I am using Elasticsearch-5.0.1 with EC2 discovery. I am running a two nodes cluster both the node are perfectly joining he cluster and I am able to receive data for sometime from logstash. But after sometime cluster becomes unresponsive and master becomes unavailable and never recovers even if I stop sending data.

Below is my yml config for both the nodes:

cluster.name: test
network:
    host: _ec2:privateIpv4_
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 60s
discovery.zen.join_timeout: 60s

indices.fielddata.cache.size: 25%

#Default
script.painless.regex.enabled: true
thread_pool.bulk.queue_size: 500

discovery:
    type: ec2

discovery.ec2.host_type: private_ip

cloud:
     aws:
        access_key: <access key>
        secret_key: <secret>
        region: ap-south-1

Below is the stacktrace that I am getting:

[2018-02-21T06:04:22,691][WARN ][o.e.d.z.ZenDiscovery     ] [0U577Jt] not enough master 
nodes, current nodes: {{0U577Jt}{0U577JtbTDuGl4d3rw6Blg}{o8k1rmg_SIm0bi9mImlJ-Q}
{172.31.30.60}{172.31.30.60:9300},}
[2018-02-21T06:04:22,692][INFO ][o.e.c.s.ClusterService   ] [0U577Jt] removed {{en5NzDU}
{en5NzDUIR-KCAnfGNro4vw}{bGyxwN3nT463ixk1w6KD2A}{172.31.16.6}{172.31.16.6:9300},}, 
reason: zen-disco-node-failed({en5NzDU}{en5NzDUIR-KCAnfGNro4vw}
{bGyxwN3nT463ixk1w6KD2A}{172.31.16.6}{172.31.16.6:9300}), reason(failed to ping, tried [3] 
times, each with maximum [30s] timeout)[{en5NzDU}{en5NzDUIR-KCAnfGNro4vw}
{bGyxwN3nT463ixk1w6KD2A}{172.31.16.6}{172.31.16.6:9300} failed to ping, tried [3] times, 
each with maximum [30s] timeout]
[2018-02-21T06:04:57,935][WARN ][r.suppressed             ] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: 
[SERVICE_UNAVAILABLE/2/no master];
    at 
  org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:161) ~[elasticsearch-5.0.1.jar:5.0.1]
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:147) ~[elasticsearch-5.0.1.jar:5.0.1]

Also ran with three node cluster. Other nodes are responding well but one of the node is unable to join the cluster. Log from the same node:

[2018-02-21T07:25:50,591][INFO ][o.e.c.s.ClusterService   ] [en5NzDU] detected_master 
{0U577Jt}{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}{172.31.30.60}
{172.31.30.60:9300}, added {{0U577Jt}{0U577JtbTDuGl4d3rw6Blg}{Tp-
1hFUjRp6z964Z63DTew}{172.31.30.60}{172.31.30.60:9300},}, reason: zen-disco-receive(from 
master [master {0U577Jt}{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}
{172.31.30.60}{172.31.30.60:9300} committed version [1]])
[2018-02-21T07:25:59,804][INFO ][o.e.c.s.ClusterService   ] [en5NzDU] added {{eNOS2YK}
{eNOS2YKpT7-fc7sFMUpadw}{IIVZfJ8_QCGQvRcHnmIsBA}{172.31.30.241}
{172.31.30.241:9300},}, reason: zen-disco-receive(from master [master {0U577Jt}
{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}{172.31.30.60}{172.31.30.60:9300} 
committed version [26]])
[2018-02-21T07:30:41,264][INFO ][o.e.c.s.ClusterService   ] [en5NzDU] removed {{eNOS2YK}
{eNOS2YKpT7-fc7sFMUpadw}{IIVZfJ8_QCGQvRcHnmIsBA}{172.31.30.241}
{172.31.30.241:9300},}, reason: zen-disco-receive(from master [master {0U577Jt}
{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}{172.31.30.60}{172.31.30.60:9300} 
committed version [36]])
[2018-02-21T07:38:41,886][INFO ][o.e.c.s.ClusterService   ] [en5NzDU] added {{eNOS2YK}
{eNOS2YKpT7-fc7sFMUpadw}{AWmU3TIkRQ-n6t7vlJMUxw}{172.31.30.241}
{172.31.30.241:9300},}, reason: zen-disco-receive(from master [master {0U577Jt}
{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}{172.31.30.60}{172.31.30.60:9300} 
committed version [48]])
[2018-02-21T07:48:10,129][INFO ][o.e.d.z.ZenDiscovery     ] [en5NzDU] master_left [{0U577Jt}
{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}{172.31.30.60}{172.31.30.60:9300}], 
reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2018-02-21T07:48:10,130][WARN ][o.e.d.z.ZenDiscovery     ] [en5NzDU] master left (reason = 
failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: {{en5NzDU}
{en5NzDUIR-KCAnfGNro4vw}{wz-MMrf2SDuJeGGSnjPzbQ}{172.31.16.6}{172.31.16.6:9300},
{eNOS2YK}{eNOS2YKpT7-fc7sFMUpadw}{AWmU3TIkRQ-n6t7vlJMUxw}{172.31.30.241}
{172.31.30.241:9300},}
[2018-02-21T07:48:10,130][INFO ][o.e.c.s.ClusterService   ] [en5NzDU] removed {{0U577Jt}
{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}{172.31.30.60}{172.31.30.60:9300},}, 
reason: master_failed ({0U577Jt}{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}
{172.31.30.60}{172.31.30.60:9300})
[2018-02-21T07:49:25,360][WARN ][o.e.d.z.p.u.UnicastZenPing] [en5NzDU] failed to send ping 
to [{#cloud-i-0d0e8454f516edd58-0}{Cl9dtsPdQqSbUJItjZMQPQ}{172.31.30.241}
{172.31.30.241:9300}]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [eNOS2YK]
[172.31.30.241:9300][internal:discovery/zen/unicast] request_id [1908] timed out after 
[75001ms]
at 
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:840) 
[elasticsearch-5.0.1.jar:5.0.1]
at 
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:451) [elasticsearch-5.0.1.jar:5.0.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2018-02-21T07:49:55,497][WARN ][o.e.d.z.p.u.UnicastZenPing] [en5NzDU] failed to send ping 
to [{#cloud-i-0d0e8454f516edd58-0}{S1gpMFOuTIafxHdxXIFMkQ}{172.31.30.241}
{172.31.30.241:9300}]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [eNOS2YK]
[172.31.30.241:9300][internal:discovery/zen/unicast] request_id [1912] timed out after 
[75000ms]
at 
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:840) 
[elasticsearch-5.0.1.jar:5.0.1]
at 
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:451) [elasticsearch-5.0.1.jar:5.0.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2018-02-21T07:50:25,619][WARN ][o.e.d.z.p.u.UnicastZenPing] [en5NzDU] failed to send ping 
to [{#cloud-i-0d0e8454f516edd58-0}{DKRtX3ZITY2onGn01BLVlQ}{172.31.30.241}
{172.31.30.241:9300}]

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.