Elastic Search Warm Node Issue

Hi Folks,

Its been almost 3 months i am struggling with issue with my warm node, cluster architecture and issue is described below :

Architecture : 3 Master Nodes ,5 Hot data nodes, 2 Warm Nodes (Spinning Disk) , 1 coordinating node

Hot Nodes H/W : 32 Cores , 64GB Memory, 3.2 TB SSD (Bare Metal)
Warm Nodes : 30 Cores , 60 GB Memory , 20 TB Spinning Disk

Hot Elasticsearch Configuration :

node.name: Cluster1
path.data: /elasticsearch/data
path.logs: /elasticsearch/logs
bootstrap.memory_lock: true
network.host: XXXX
node.data: true
node.master: false
discovery.zen.ping.unicast.hosts: ["XXXX", "XXXX", "XXXX"]
discovery.zen.minimum_master_nodes: 2
xpack.security.enabled: false
xpack.watcher.enabled: true
action.destructive_requires_name: true
thread_pool:
index:
size: 12
queue_size: 2000
thread_pool.bulk.queue_size: 7000
bootstrap.system_call_filter: false
node.attr.box_type: hot

JVM : 30GB Heap with CMS

Warm Node Configuration :

luster.name: Wallet-ELK
node.name: XXXX
path.data: /elasticsearch/data
path.logs: /elasticsearch/logs
bootstrap.memory_lock: true
network.host: XXXX
node.data: true
node.master: false
discovery.zen.ping.unicast.hosts: XXXX
discovery.zen.minimum_master_nodes: 2
xpack.security.enabled: false
xpack.watcher.enabled: true
xpack.ml.enabled: false
action.destructive_requires_name: true
thread_pool:
index:
size: 12
queue_size: 100
search:
size: 5
queue_size: 50
target_response_time: 1s
bootstrap.system_call_filter: false
node.attr.box_type: warm

JVM : 30 GB Heap with CMS (Tried java 10 with G1 GC)

INDEX details :
SHARDS : 5
Replica : 1 (0 for warm node)

ELK VERSION : 6.3.2

ISSUE :
I am using HOT WARM architecture where in after every 10 days i am moving my data to warm node through curator by making replica 0 and forecemerge segments to 1 .
At Random time period my warm nodes stops responding hence cluster become unresponsive , cannot do _cat/nodes or anything , kibana gives me error in red state and elasticsearch connection reset in kibana panel ,ES cluster stays in green yet unresponsive i am keeping only 20 days of data in warm node rest is getting deleted warm node data size stays close to 8TB in each warm node, below is the error which i am getting whenever this problem persist

collector [node_stats] timed out when collecting data
[2019-03-31T07:57:22,382][DEBUG][o.e.x.m.MonitoringService] [WA-ELK-WARM-ES9] monitoring execution is skipped until previous execution terminate
[2019-03-31T07:57:22,382][ERROR][o.e.x.m.c.n.NodeStatsCollector] [WA-ELK-WARM-ES9] collector [node_stats] timed out when collecting data
[2019-03-31T07:57:27,947][DEBUG][o.a.h.i.c.PoolingHttpClientConnectionManager] Closing expired connections
[2019-03-31T08:41:07,092][WARN ][o.e.t.n.Netty4Transport ] [WA-ELK-WARM-ES9] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.140.31.252:56134}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-03-31T08:41:07,092][WARN ][o.e.t.n.Netty4Transport ] [WA-ELK-WARM-ES9] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/XXXX:44492}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-03-31T08:41:07,579][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [WA-ELK-WARM-ES9] failed to execute on node [qaZZKgPwTImlpz7K7TbjwA]
org.elasticsearch.transport.TransportException: transport stopped, action: cluster:monitor/nodes/stats[n]
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:267) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:725) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.2.jar:6.3.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]
[2019-03-31T08:41:07,580][INFO ][o.e.i.s.GlobalCheckpointSyncAction] [WA-ELK-WARM-ES9] [event-logs-2019.03.19][1] global checkpoint sync failed
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:267) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:725) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.2.jar:6.3.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]
[2019-03-31T08:41:07,580][INFO ][o.e.i.s.GlobalCheckpointSyncAction] [WA-ELK-WARM-ES9] [event-logs-2019.03.18][2] global checkpoint sync failed
org.elasticsearch.transport.TransportException: transport stopped, action: indices:admin/seq_no/global_checkpoint_sync[p]
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:267) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:725) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.2.jar:6.3.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]

Do you have monitoring installed? What does heap usage look like on the worm nodes? What is your average shard size?

If you find that you are suffering from heap pressure, which certainly can happen at the data volumes you mentioned,this webinar might be useful as it discusses how to optimize storage and heap usage.

Attaching the effected node monitoring stats, and with that average shard size is closed to 80GB

Is there anything about long or frequent GC in the logs?

Nothing significant found in gc logs as well.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.