Hello,
I'm running a 27 nodes cluster (21 data nodes, 3 masters 3 kibanas) which stores around 60TB (10k total shards).
Data nodes+kibanas 30GB heap, masters 8GB.
RPM based installed cluster on physical Centos 7 servers.
ES version 6.6.0 with jdk11_0_2.
I'm experiencing poor performance with the cluster, means nodes gets excessive GC operation, high load average, Kibana returns 503 as data nodes API hangs due to endless GCs which leaves me with nothing but killing ES process and starting it again on every hanged node.
Are there any recommendations/best practices/suggestions that can improve my cluster stability?
I'm really suffering from this, spending most of the day by reviving the cluster.
I will provide any necessary data in order to understand whether cluster scale, sizing or configurations are incorrect.
Since my data nodes heap is very large, I have tried under the suggestions of Elastic at the KubeCon 2019 to use G1GC instead of CMS, but unfortunately it made cluster behavior even worse as data nodes logs shows that GC were taking too long and nodes still hanged.
For example, this is the ES log when one of the data nodes hanged (around last 50 rows - trimmed for 7000 characters limitation):[2019-06-08T10:02:02,573][WARN ][o.e.x.m.MonitoringService] [us-elkdb20] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: Exception when closing export bulk
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1$1.(ExportBulk.java:95) ~[x-pack-monitoring-6.6.0.jar:6.6.0]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1.onFailure(ExportBulk.java:93) [x-pack-monitoring-6.6.0.jar:6.6.0]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:206) [x-pack-monitoring-6.6.0.jar:6.6.0]
.....................
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Final]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$0(ExportBulk.java:156) [x-pack-monitoring-6.6.0.jar:6.6.0]
... 39 more
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:124) ~[?:?]
... 37 more
[2019-06-08T10:02:04,661][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][young][142045][115397] duration [1.6s], collections [1]/[2.3s], total [1.6s]/[1.8h], memory [21.2gb]->[21.5gb]/[29.8gb], all_pools {[young] [434.2mb]->[124.2mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [20.6gb]->[21.2gb]/[28.1gb]}
[2019-06-08T10:02:04,661][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142045] overhead, spent [1.6s] collecting in the last [2.3s]
[2019-06-08T10:02:06,082][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][young][142046][115398] duration [956ms], collections [1]/[1.4s], total [956ms]/[1.8h], memory [21.5gb]->[21.9gb]/[29.8gb], all_pools {[young] [124.2mb]->[8.3mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [21.2gb]->[21.7gb]/[28.1gb]}
[2019-06-08T10:02:06,082][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142046] overhead, spent [956ms] collecting in the last [1.4s]
[2019-06-08T10:02:07,083][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142047] overhead, spent [420ms] collecting in the last [1s]
[2019-06-08T10:02:08,731][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142048] overhead, spent [1.1s] collecting in the last [1.6s]
.......
[2019-06-08T10:02:14,958][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142053] overhead, spent [1s] collecting in the last [1.5s]
[2019-06-08T10:02:15,959][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142054] overhead, spent [632ms] collecting in the last [1s]
[2019-06-08T10:02:37,580][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][old][142055][511] duration [20.5s], collections [1]/[1s], total [20.5s]/[7.4m], memory [27.3gb]->[29.3gb]/[29.8gb], all_pools {[young] [176.8mb]->[610.5mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [27gb]->[19.5gb]/[28.1gb]}
[2019-06-08T10:02:37,582][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142055] overhead, spent [21s] collecting in the last [1s]
[2019-06-08T10:02:39,671][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142057] overhead, spent [541ms] collecting in the last [1s]
......
[2019-06-08T10:02:48,498][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142064] overhead, spent [903ms] collecting in the last [1.4s]
[2019-06-08T10:03:16,662][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][old][142065][513] duration [27.2s], collections [2]/[28.1s], total [27.2s]/[7.8m], memory [27.3gb]->[26gb]/[29.8gb], all_pools {[young] [34.1mb]->[141.3mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [27.1gb]->[25.8gb]/[28.1gb]}
[2019-06-08T10:03:16,662][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142065] overhead, spent [27.5s] collecting in the last [28.1s]
[2019-06-08T10:03:21,725][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142070] overhead, spent [334ms] collecting in the last [1s]
[2019-06-08T10:03:42,855][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][old][142071][514] duration [20.9s], collections [1]/[21.1s], total [20.9s]/[8.2m], memory [28.3gb]->[26.8gb]/[29.8gb], all_pools {[young] [215mb]->[318.2mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [27.9gb]->[26.5gb]/[28.1gb]}
[2019-06-08T10:03:42,856][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142071] overhead, spent [20.9s] collecting in the last [21.1s]
[2019-06-08T10:03:43,856][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142072] overhead, spent [288ms] collecting in the last [1s]
[2019-06-08T10:03:44,857][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142073] overhead, spent [308ms] collecting in the last [1s]
[2019-06-08T10:04:11,303][ERROR][o.e.x.m.c.n.NodeStatsCollector] [us-elkdb20] collector [node_stats] timed out when collecting data
[2019-06-08T10:04:11,479][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][old][142075][515] duration [25.2s], collections [1]/[25.5s], total [25.2s]/[8.6m], memory [28.2gb]->[28.3gb]/[29.8gb], all_pools {[young] [13.7mb]->[684.5mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [28gb]->[27.6gb]/[28.1gb]}
[2019-06-08T10:04:11,479][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142075] overhead, spent [25.2s] collecting in the last [25.5s]
[2019-06-08T10:04:34,523][ERROR][o.e.x.m.c.n.NodeStatsCollector] [us-elkdb20] collector [node_stats] timed out when collecting data
Appreciate any help possible,
Thanks in advance,
Lior