Elastic Search Warm Node Issue

sagar_arora1 · March 31, 2019, 9:22am

Hi Folks,

Its been almost 3 months i am struggling with issue with my warm node, cluster architecture and issue is described below :

Architecture : 3 Master Nodes ,5 Hot data nodes, 2 Warm Nodes (Spinning Disk) , 1 coordinating node

Hot Nodes H/W : 32 Cores , 64GB Memory, 3.2 TB SSD (Bare Metal)
Warm Nodes : 30 Cores , 60 GB Memory , 20 TB Spinning Disk

Hot Elasticsearch Configuration :

node.name: Cluster1
path.data: /elasticsearch/data
path.logs: /elasticsearch/logs
bootstrap.memory_lock: true
network.host: XXXX
node.data: true
node.master: false
discovery.zen.ping.unicast.hosts: ["XXXX", "XXXX", "XXXX"]
discovery.zen.minimum_master_nodes: 2
xpack.security.enabled: false
xpack.watcher.enabled: true
action.destructive_requires_name: true
thread_pool:
index:
size: 12
queue_size: 2000
thread_pool.bulk.queue_size: 7000
bootstrap.system_call_filter: false
node.attr.box_type: hot

JVM : 30GB Heap with CMS

Warm Node Configuration :

luster.name: Wallet-ELK
node.name: XXXX
path.data: /elasticsearch/data
path.logs: /elasticsearch/logs
bootstrap.memory_lock: true
network.host: XXXX
node.data: true
node.master: false
discovery.zen.ping.unicast.hosts: XXXX
discovery.zen.minimum_master_nodes: 2
xpack.security.enabled: false
xpack.watcher.enabled: true
xpack.ml.enabled: false
action.destructive_requires_name: true
thread_pool:
index:
size: 12
queue_size: 100
search:
size: 5
queue_size: 50
target_response_time: 1s
bootstrap.system_call_filter: false
node.attr.box_type: warm

JVM : 30 GB Heap with CMS (Tried java 10 with G1 GC)

INDEX details :
SHARDS : 5
Replica : 1 (0 for warm node)

ELK VERSION : 6.3.2

ISSUE :
I am using HOT WARM architecture where in after every 10 days i am moving my data to warm node through curator by making replica 0 and forecemerge segments to 1 .
At Random time period my warm nodes stops responding hence cluster become unresponsive , cannot do _cat/nodes or anything , kibana gives me error in red state and elasticsearch connection reset in kibana panel ,ES cluster stays in green yet unresponsive i am keeping only 20 days of data in warm node rest is getting deleted warm node data size stays close to 8TB in each warm node, below is the error which i am getting whenever this problem persist

collector [node_stats] timed out when collecting data
[2019-03-31T07:57:22,382][DEBUG][o.e.x.m.MonitoringService] [WA-ELK-WARM-ES9] monitoring execution is skipped until previous execution terminate
[2019-03-31T07:57:22,382][ERROR][o.e.x.m.c.n.NodeStatsCollector] [WA-ELK-WARM-ES9] collector [node_stats] timed out when collecting data
[2019-03-31T07:57:27,947][DEBUG][o.a.h.i.c.PoolingHttpClientConnectionManager] Closing expired connections
[2019-03-31T08:41:07,092][WARN ][o.e.t.n.Netty4Transport ] [WA-ELK-WARM-ES9] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.140.31.252:56134}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-03-31T08:41:07,092][WARN ][o.e.t.n.Netty4Transport ] [WA-ELK-WARM-ES9] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/XXXX:44492}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-03-31T08:41:07,579][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [WA-ELK-WARM-ES9] failed to execute on node [qaZZKgPwTImlpz7K7TbjwA]
org.elasticsearch.transport.TransportException: transport stopped, action: cluster:monitor/nodes/stats[n]
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:267) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:725) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.2.jar:6.3.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]
[2019-03-31T08:41:07,580][INFO ][o.e.i.s.GlobalCheckpointSyncAction] [WA-ELK-WARM-ES9] [event-logs-2019.03.19][1] global checkpoint sync failed
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:267) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:725) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.2.jar:6.3.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]
[2019-03-31T08:41:07,580][INFO ][o.e.i.s.GlobalCheckpointSyncAction] [WA-ELK-WARM-ES9] [event-logs-2019.03.18][2] global checkpoint sync failed
org.elasticsearch.transport.TransportException: transport stopped, action: indices:admin/seq_no/global_checkpoint_sync[p]
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:267) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:725) [elasticsearch-6.3.2.jar:6.3.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.2.jar:6.3.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]

Christian_Dahlqvist · March 31, 2019, 9:53am

Do you have monitoring installed? What does heap usage look like on the worm nodes? What is your average shard size?

If you find that you are suffering from heap pressure, which certainly can happen at the data volumes you mentioned,this webinar might be useful as it discusses how to optimize storage and heap usage.

sagar_arora1 · March 31, 2019, 10:00am

Attaching the effected node monitoring stats, and with that average shard size is closed to 80GB

Christian_Dahlqvist · March 31, 2019, 10:52am

Is there anything about long or frequent GC in the logs?

sagar_arora1 · March 31, 2019, 10:57am

Nothing significant found in gc logs as well.

system · April 28, 2019, 11:09am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic warm node heap unexpected behavior Elasticsearch	1	423	August 2, 2018
Hot-Warm Architecture - Storage Handling Elasticsearch	4	697	September 25, 2018
“Hot-Warm” Architecture in Elasticsearch best practice Elasticsearch	8	4675	July 24, 2019
Data Writing into WARM nodes along with HOT nodes Elasticsearch	2	907	June 7, 2019
Elastic Search Warm node taking too much of Ram than the Heap assigned to that Elasticsearch	1	323	October 29, 2020

Elastic Search Warm Node Issue

Related topics