Hi Folks,
Its been almost 3 months i am struggling with issue with my warm node, cluster architecture and issue is described below :
Architecture : 3 Master Nodes ,5 Hot data nodes, 2 Warm Nodes (Spinning Disk) , 1 coordinating node
Hot Nodes H/W : 32 Cores , 64GB Memory, 3.2 TB SSD (Bare Metal)
Warm Nodes : 30 Cores , 60 GB Memory , 20 TB Spinning Disk
Hot Elasticsearch Configuration :
node.name: Cluster1
path.data: /elasticsearch/data
path.logs: /elasticsearch/logs
bootstrap.memory_lock: true
network.host: XXXX
node.data: true
node.master: false
discovery.zen.ping.unicast.hosts: ["XXXX", "XXXX", "XXXX"]
discovery.zen.minimum_master_nodes: 2
xpack.security.enabled: false
xpack.watcher.enabled: true
action.destructive_requires_name: true
thread_pool:
index:
size: 12
queue_size: 2000
thread_pool.bulk.queue_size: 7000
bootstrap.system_call_filter: false
node.attr.box_type: hot
JVM : 30GB Heap with CMS
Warm Node Configuration :
luster.name: Wallet-ELK
node.name: XXXX
path.data: /elasticsearch/data
path.logs: /elasticsearch/logs
bootstrap.memory_lock: true
network.host: XXXX
node.data: true
node.master: false
discovery.zen.ping.unicast.hosts: XXXX
discovery.zen.minimum_master_nodes: 2
xpack.security.enabled: false
xpack.watcher.enabled: true
xpack.ml.enabled: false
action.destructive_requires_name: true
thread_pool:
index:
size: 12
queue_size: 100
search:
size: 5
queue_size: 50
target_response_time: 1s
bootstrap.system_call_filter: false
node.attr.box_type: warm
JVM : 30 GB Heap with CMS (Tried java 10 with G1 GC)
INDEX details :
SHARDS : 5
Replica : 1 (0 for warm node)
ELK VERSION : 6.3.2
ISSUE :
I am using HOT WARM architecture where in after every 10 days i am moving my data to warm node through curator by making replica 0 and forecemerge segments to 1 .
At Random time period my warm nodes stops responding hence cluster become unresponsive , cannot do _cat/nodes or anything , kibana gives me error in red state and elasticsearch connection reset in kibana panel ,ES cluster stays in green yet unresponsive i am keeping only 20 days of data in warm node rest is getting deleted warm node data size stays close to 8TB in each warm node, below is the error which i am getting whenever this problem persist
collector [node_stats] timed out when collecting data
[2019-03-31T07:57:22,382][DEBUG][o.e.x.m.MonitoringService] [WA-ELK-WARM-ES9] monitoring execution is skipped until previous execution terminate
[2019-03-31T07:57:22,382][ERROR][o.e.x.m.c.n.NodeStatsCollector] [WA-ELK-WARM-ES9] collector [node_stats] timed out when collecting data
[2019-03-31T07:57:27,947][DEBUG][o.a.h.i.c.PoolingHttpClientConnectionManager] Closing expired connections
[2019-03-31T08:41:07,092][WARN ][o.e.t.n.Netty4Transport ] [WA-ELK-WARM-ES9] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.140.31.252:56134}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-03-31T08:41:07,092][WARN ][o.e.t.n.Netty4Transport ] [WA-ELK-WARM-ES9] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/XXXX:44492}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-03-31T08:41:07,579][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [WA-ELK-WARM-ES9] failed to execute on node [qaZZKgPwTImlpz7K7TbjwA]
org.elasticsearch.transport.TransportException: transport stopped, action: cluster:monitor/nodes/stats[n]
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:267) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:725) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.2.jar:6.3.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]
[2019-03-31T08:41:07,580][INFO ][o.e.i.s.GlobalCheckpointSyncAction] [WA-ELK-WARM-ES9] [event-logs-2019.03.19][1] global checkpoint sync failed
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:267) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:725) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.2.jar:6.3.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]
[2019-03-31T08:41:07,580][INFO ][o.e.i.s.GlobalCheckpointSyncAction] [WA-ELK-WARM-ES9] [event-logs-2019.03.18][2] global checkpoint sync failed
org.elasticsearch.transport.TransportException: transport stopped, action: indices:admin/seq_no/global_checkpoint_sync[p]
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:267) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:725) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.2.jar:6.3.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]