Cluster become unresponsive after receiving data for sometime using EC2 Discovery

vivek.pr.mi · February 21, 2018, 6:50am

Hi,

I am using Elasticsearch-5.0.1 with EC2 discovery. I am running a two nodes cluster both the node are perfectly joining he cluster and I am able to receive data for sometime from logstash. But after sometime cluster becomes unresponsive and master becomes unavailable and never recovers even if I stop sending data.

Below is my yml config for both the nodes:

cluster.name: test
network:
    host: _ec2:privateIpv4_
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 60s
discovery.zen.join_timeout: 60s

indices.fielddata.cache.size: 25%

#Default
script.painless.regex.enabled: true
thread_pool.bulk.queue_size: 500

discovery:
    type: ec2

discovery.ec2.host_type: private_ip

cloud:
     aws:
        access_key: <access key>
        secret_key: <secret>
        region: ap-south-1

Below is the stacktrace that I am getting:

[2018-02-21T06:04:22,691][WARN ][o.e.d.z.ZenDiscovery     ] [0U577Jt] not enough master 
nodes, current nodes: {{0U577Jt}{0U577JtbTDuGl4d3rw6Blg}{o8k1rmg_SIm0bi9mImlJ-Q}
{172.31.30.60}{172.31.30.60:9300},}
[2018-02-21T06:04:22,692][INFO ][o.e.c.s.ClusterService   ] [0U577Jt] removed {{en5NzDU}
{en5NzDUIR-KCAnfGNro4vw}{bGyxwN3nT463ixk1w6KD2A}{172.31.16.6}{172.31.16.6:9300},}, 
reason: zen-disco-node-failed({en5NzDU}{en5NzDUIR-KCAnfGNro4vw}
{bGyxwN3nT463ixk1w6KD2A}{172.31.16.6}{172.31.16.6:9300}), reason(failed to ping, tried [3] 
times, each with maximum [30s] timeout)[{en5NzDU}{en5NzDUIR-KCAnfGNro4vw}
{bGyxwN3nT463ixk1w6KD2A}{172.31.16.6}{172.31.16.6:9300} failed to ping, tried [3] times, 
each with maximum [30s] timeout]
[2018-02-21T06:04:57,935][WARN ][r.suppressed             ] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: 
[SERVICE_UNAVAILABLE/2/no master];
    at 
  org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:161) ~[elasticsearch-5.0.1.jar:5.0.1]
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:147) ~[elasticsearch-5.0.1.jar:5.0.1]

Also ran with three node cluster. Other nodes are responding well but one of the node is unable to join the cluster. Log from the same node:

[2018-02-21T07:25:50,591][INFO ][o.e.c.s.ClusterService   ] [en5NzDU] detected_master 
{0U577Jt}{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}{172.31.30.60}
{172.31.30.60:9300}, added {{0U577Jt}{0U577JtbTDuGl4d3rw6Blg}{Tp-
1hFUjRp6z964Z63DTew}{172.31.30.60}{172.31.30.60:9300},}, reason: zen-disco-receive(from 
master [master {0U577Jt}{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}
{172.31.30.60}{172.31.30.60:9300} committed version [1]])
[2018-02-21T07:25:59,804][INFO ][o.e.c.s.ClusterService   ] [en5NzDU] added {{eNOS2YK}
{eNOS2YKpT7-fc7sFMUpadw}{IIVZfJ8_QCGQvRcHnmIsBA}{172.31.30.241}
{172.31.30.241:9300},}, reason: zen-disco-receive(from master [master {0U577Jt}
{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}{172.31.30.60}{172.31.30.60:9300} 
committed version [26]])
[2018-02-21T07:30:41,264][INFO ][o.e.c.s.ClusterService   ] [en5NzDU] removed {{eNOS2YK}
{eNOS2YKpT7-fc7sFMUpadw}{IIVZfJ8_QCGQvRcHnmIsBA}{172.31.30.241}
{172.31.30.241:9300},}, reason: zen-disco-receive(from master [master {0U577Jt}
{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}{172.31.30.60}{172.31.30.60:9300} 
committed version [36]])
[2018-02-21T07:38:41,886][INFO ][o.e.c.s.ClusterService   ] [en5NzDU] added {{eNOS2YK}
{eNOS2YKpT7-fc7sFMUpadw}{AWmU3TIkRQ-n6t7vlJMUxw}{172.31.30.241}
{172.31.30.241:9300},}, reason: zen-disco-receive(from master [master {0U577Jt}
{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}{172.31.30.60}{172.31.30.60:9300} 
committed version [48]])
[2018-02-21T07:48:10,129][INFO ][o.e.d.z.ZenDiscovery     ] [en5NzDU] master_left [{0U577Jt}
{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}{172.31.30.60}{172.31.30.60:9300}], 
reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2018-02-21T07:48:10,130][WARN ][o.e.d.z.ZenDiscovery     ] [en5NzDU] master left (reason = 
failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: {{en5NzDU}
{en5NzDUIR-KCAnfGNro4vw}{wz-MMrf2SDuJeGGSnjPzbQ}{172.31.16.6}{172.31.16.6:9300},
{eNOS2YK}{eNOS2YKpT7-fc7sFMUpadw}{AWmU3TIkRQ-n6t7vlJMUxw}{172.31.30.241}
{172.31.30.241:9300},}
[2018-02-21T07:48:10,130][INFO ][o.e.c.s.ClusterService   ] [en5NzDU] removed {{0U577Jt}
{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}{172.31.30.60}{172.31.30.60:9300},}, 
reason: master_failed ({0U577Jt}{0U577JtbTDuGl4d3rw6Blg}{Tp-1hFUjRp6z964Z63DTew}
{172.31.30.60}{172.31.30.60:9300})
[2018-02-21T07:49:25,360][WARN ][o.e.d.z.p.u.UnicastZenPing] [en5NzDU] failed to send ping 
to [{#cloud-i-0d0e8454f516edd58-0}{Cl9dtsPdQqSbUJItjZMQPQ}{172.31.30.241}
{172.31.30.241:9300}]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [eNOS2YK]
[172.31.30.241:9300][internal:discovery/zen/unicast] request_id [1908] timed out after 
[75001ms]
at 
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:840) 
[elasticsearch-5.0.1.jar:5.0.1]
at 
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:451) [elasticsearch-5.0.1.jar:5.0.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2018-02-21T07:49:55,497][WARN ][o.e.d.z.p.u.UnicastZenPing] [en5NzDU] failed to send ping 
to [{#cloud-i-0d0e8454f516edd58-0}{S1gpMFOuTIafxHdxXIFMkQ}{172.31.30.241}
{172.31.30.241:9300}]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [eNOS2YK]
[172.31.30.241:9300][internal:discovery/zen/unicast] request_id [1912] timed out after 
[75000ms]
at 
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:840) 
[elasticsearch-5.0.1.jar:5.0.1]
at 
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:451) [elasticsearch-5.0.1.jar:5.0.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2018-02-21T07:50:25,619][WARN ][o.e.d.z.p.u.UnicastZenPing] [en5NzDU] failed to send ping 
to [{#cloud-i-0d0e8454f516edd58-0}{DKRtX3ZITY2onGn01BLVlQ}{172.31.30.241}
{172.31.30.241:9300}]

system · March 21, 2018, 6:50am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster Becomes Unresponsive for 90 Sec After Data Node Leaves Elasticsearch	2	808	March 3, 2017
Cluster hanging on node failure Elasticsearch	2	527	July 6, 2017
Elasticsearch 1.5.2 master unresponsive Elasticsearch	1	396	July 6, 2017
ES cluster becomes unresponsive Elasticsearch	2	696	July 6, 2017
Cluster node unresponsive after search Elasticsearch	2	662	July 5, 2017

Cluster become unresponsive after receiving data for sometime using EC2 Discovery

Related topics