Elasticsearch - Poor cluster performance and stability

Lior_Yakobov · June 8, 2019, 5:45pm

Hello,
I'm running a 27 nodes cluster (21 data nodes, 3 masters 3 kibanas) which stores around 60TB (10k total shards).
Data nodes+kibanas 30GB heap, masters 8GB.
RPM based installed cluster on physical Centos 7 servers.
ES version 6.6.0 with jdk11_0_2.

I'm experiencing poor performance with the cluster, means nodes gets excessive GC operation, high load average, Kibana returns 503 as data nodes API hangs due to endless GCs which leaves me with nothing but killing ES process and starting it again on every hanged node.

Are there any recommendations/best practices/suggestions that can improve my cluster stability?
I'm really suffering from this, spending most of the day by reviving the cluster.
I will provide any necessary data in order to understand whether cluster scale, sizing or configurations are incorrect.
Since my data nodes heap is very large, I have tried under the suggestions of Elastic at the KubeCon 2019 to use G1GC instead of CMS, but unfortunately it made cluster behavior even worse as data nodes logs shows that GC were taking too long and nodes still hanged.

For example, this is the ES log when one of the data nodes hanged (around last 50 rows - trimmed for 7000 characters limitation):[2019-06-08T10:02:02,573][WARN ][o.e.x.m.MonitoringService] [us-elkdb20] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: Exception when closing export bulk
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1$1.(ExportBulk.java:95) ~[x-pack-monitoring-6.6.0.jar:6.6.0]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1.onFailure(ExportBulk.java:93) [x-pack-monitoring-6.6.0.jar:6.6.0]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:206) [x-pack-monitoring-6.6.0.jar:6.6.0]
.....................
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Final]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$0(ExportBulk.java:156) [x-pack-monitoring-6.6.0.jar:6.6.0]
... 39 more
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:124) ~[?:?]
... 37 more
[2019-06-08T10:02:04,661][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][young][142045][115397] duration [1.6s], collections [1]/[2.3s], total [1.6s]/[1.8h], memory [21.2gb]->[21.5gb]/[29.8gb], all_pools {[young] [434.2mb]->[124.2mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [20.6gb]->[21.2gb]/[28.1gb]}
[2019-06-08T10:02:04,661][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142045] overhead, spent [1.6s] collecting in the last [2.3s]
[2019-06-08T10:02:06,082][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][young][142046][115398] duration [956ms], collections [1]/[1.4s], total [956ms]/[1.8h], memory [21.5gb]->[21.9gb]/[29.8gb], all_pools {[young] [124.2mb]->[8.3mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [21.2gb]->[21.7gb]/[28.1gb]}
[2019-06-08T10:02:06,082][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142046] overhead, spent [956ms] collecting in the last [1.4s]
[2019-06-08T10:02:07,083][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142047] overhead, spent [420ms] collecting in the last [1s]
[2019-06-08T10:02:08,731][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142048] overhead, spent [1.1s] collecting in the last [1.6s]
.......
[2019-06-08T10:02:14,958][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142053] overhead, spent [1s] collecting in the last [1.5s]
[2019-06-08T10:02:15,959][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142054] overhead, spent [632ms] collecting in the last [1s]
[2019-06-08T10:02:37,580][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][old][142055][511] duration [20.5s], collections [1]/[1s], total [20.5s]/[7.4m], memory [27.3gb]->[29.3gb]/[29.8gb], all_pools {[young] [176.8mb]->[610.5mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [27gb]->[19.5gb]/[28.1gb]}
[2019-06-08T10:02:37,582][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142055] overhead, spent [21s] collecting in the last [1s]
[2019-06-08T10:02:39,671][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142057] overhead, spent [541ms] collecting in the last [1s]
......
[2019-06-08T10:02:48,498][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142064] overhead, spent [903ms] collecting in the last [1.4s]
[2019-06-08T10:03:16,662][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][old][142065][513] duration [27.2s], collections [2]/[28.1s], total [27.2s]/[7.8m], memory [27.3gb]->[26gb]/[29.8gb], all_pools {[young] [34.1mb]->[141.3mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [27.1gb]->[25.8gb]/[28.1gb]}
[2019-06-08T10:03:16,662][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142065] overhead, spent [27.5s] collecting in the last [28.1s]
[2019-06-08T10:03:21,725][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142070] overhead, spent [334ms] collecting in the last [1s]
[2019-06-08T10:03:42,855][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][old][142071][514] duration [20.9s], collections [1]/[21.1s], total [20.9s]/[8.2m], memory [28.3gb]->[26.8gb]/[29.8gb], all_pools {[young] [215mb]->[318.2mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [27.9gb]->[26.5gb]/[28.1gb]}
[2019-06-08T10:03:42,856][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142071] overhead, spent [20.9s] collecting in the last [21.1s]
[2019-06-08T10:03:43,856][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142072] overhead, spent [288ms] collecting in the last [1s]
[2019-06-08T10:03:44,857][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142073] overhead, spent [308ms] collecting in the last [1s]
[2019-06-08T10:04:11,303][ERROR][o.e.x.m.c.n.NodeStatsCollector] [us-elkdb20] collector [node_stats] timed out when collecting data
[2019-06-08T10:04:11,479][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][old][142075][515] duration [25.2s], collections [1]/[25.5s], total [25.2s]/[8.6m], memory [28.2gb]->[28.3gb]/[29.8gb], all_pools {[young] [13.7mb]->[684.5mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [28gb]->[27.6gb]/[28.1gb]}
[2019-06-08T10:04:11,479][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][142075] overhead, spent [25.2s] collecting in the last [25.5s]
[2019-06-08T10:04:34,523][ERROR][o.e.x.m.c.n.NodeStatsCollector] [us-elkdb20] collector [node_stats] timed out when collecting data

Appreciate any help possible,
Thanks in advance,
Lior

Christian_Dahlqvist · June 8, 2019, 6:03pm

60TB across 10k shards gives an average shard size of 6GB, which for many use-cases with time-based indices is quite small. If you have optimized your mappings I would recommend you watch this webinar around how to minimize heap usage and optimize storage.

Christian_Dahlqvist · June 8, 2019, 9:26pm

It may also help if you tell us a bit about the hardware the cluster is deployed on and any non-default settings you have. I am assuming you have disabled swap and have followed other general best practices.

What type of data do you have? What is the retention period for your data? How and how frequently are you querying the data?

Lior_Yakobov · June 9, 2019, 10:12am

Hey @Christian_Dahlqvist,
Regarding the average shard size, we do have some monthly indices which are much smaller than the monthly indices, and therefore affecting the average.
For the daily indices I set a shard per 15GB, which means for 100GB primary data of daily index I set 7 primary shards (1 replica for all).
About hardware and settings, I believe the only non-regular settings I did are changing the heap size, while following best practices such as not crossing the 31GB heap size, disabling swap as you mentioned.
The data in the cluster is mostly logs (there are some exception indices but they are very small compared to the rest), retention policy varies among the different indices (some applications log data is saved for 3/4 months, while we also have an index pattern with 2y 8m retention).
Regarding data querying, I can't really tell how frequent but I can tell that there are Kibana and Grafana which is querying the cluster through search and visualizations, and by API queries the cluster gets around 250K queries per day (I can't really tell what is the load of these queries as I send the apache logs of our ELK LB to ELK itself, so I get the mapping of which indices mostly queries and which clients are mostly query, but not the request body itself).

Thanks,
Lior

Christian_Dahlqvist · June 9, 2019, 7:27pm

Do you have monitoring installed? What does heap usage look like on the nodes that suffer from long GC?

Lior_Yakobov · June 10, 2019, 9:10am

Hey @Christian_Dahlqvist , I do have monitoring installed, attached screenshot:

node last 100 lines from ES log:

[2019-06-10T02:00:54,892][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][old][2742][14] duration [46.1s], collections [1]/[47.1s], total [46.1s]/[6.9m], memory [28.8gb]->[23.7gb]/[29.8gb], all_pools {[young] [1.1gb]->[257.4mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [27.5gb]->[23.4gb]/[28.1gb]}
[2019-06-10T02:00:54,894][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2742] overhead, spent [46.4s] collecting in the last [47.1s]
[2019-06-10T02:00:56,917][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2744] overhead, spent [357ms] collecting in the last [1s]
[2019-06-10T02:00:57,779][WARN ][o.e.c.s.ClusterApplierService] [us-elkdb20] cluster state applier task [apply cluster state (from master [master {master2}{8EYw3qkaQMGbTk90CGlYew}{ivVKw9giTBeZm23TxYZxpQ}{209.87.212.79}{209.87.212.79:9300}{rack=r1, xpack.installed=true} committed version [44538]])] took [1.6m] above the warn threshold of 30s
[2019-06-10T02:01:00,918][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2748] overhead, spent [461ms] collecting in the last [1s]
[2019-06-10T02:01:02,253][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2749] overhead, spent [563ms] collecting in the last [1.3s]
[2019-06-10T02:01:03,337][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2750] overhead, spent [440ms] collecting in the last [1s]
[2019-06-10T02:01:03,476][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42116}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:01:04,338][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2751] overhead, spent [364ms] collecting in the last [1s]
[2019-06-10T02:01:51,680][ERROR][o.e.x.m.c.n.NodeStatsCollector] [us-elkdb20] collector [node_stats] timed out when collecting data
[2019-06-10T02:01:51,703][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][old][2752][15] duration [47s], collections [1]/[47.3s], total [47s]/[7.7m], memory [28.9gb]->[22.8gb]/[29.8gb], all_pools {[young] [837.3mb]->[204.2mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [27.9gb]->[22.6gb]/[28.1gb]}
[2019-06-10T02:01:51,704][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2752] overhead, spent [47s] collecting in the last [47.3s]
[2019-06-10T02:01:53,752][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2754] overhead, spent [355ms] collecting in the last [1s]
[2019-06-10T02:01:55,773][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2756] overhead, spent [392ms] collecting in the last [1s]
[2019-06-10T02:01:56,774][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2757] overhead, spent [282ms] collecting in the last [1s]
[2019-06-10T02:01:57,775][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2758] overhead, spent [320ms] collecting in the last [1s]
[2019-06-10T02:01:58,776][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2759] overhead, spent [447ms] collecting in the last [1s]
[2019-06-10T02:02:00,778][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2761] overhead, spent [450ms] collecting in the last [1s]
[2019-06-10T02:02:01,778][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2762] overhead, spent [443ms] collecting in the last [1s]
[2019-06-10T02:02:43,661][ERROR][o.e.x.m.c.n.NodeStatsCollector] [us-elkdb20] collector [node_stats] timed out when collecting data
[2019-06-10T02:02:43,760][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][old][2763][16] duration [41.1s], collections [1]/[41.9s], total [41.1s]/[8.4m], memory [28.5gb]->[24.1gb]/[29.8gb], all_pools {[young] [878.8mb]->[469.2mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [27.4gb]->[23.7gb]/[28.1gb]}
[2019-06-10T02:02:43,760][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2763] overhead, spent [41.4s] collecting in the last [41.9s]
[2019-06-10T02:02:45,849][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2765] overhead, spent [526ms] collecting in the last [1s]
[2019-06-10T02:02:46,953][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2766] overhead, spent [458ms] collecting in the last [1.1s]
[2019-06-10T02:02:47,953][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2767] overhead, spent [322ms] collecting in the last [1s]
[2019-06-10T02:02:48,955][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2768] overhead, spent [350ms] collecting in the last [1s]
[2019-06-10T02:02:49,507][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42116}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:02:50,000][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2769] overhead, spent [359ms] collecting in the last [1s]
[2019-06-10T02:02:50,014][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42114}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]

Lior_Yakobov · June 10, 2019, 9:10am

continuation:
[2019-06-10T02:03:35,597][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][old][2772][17] duration [43.4s], collections [1]/[43.5s], total [43.4s]/[9.1m], memory [29.2gb]->[22.9gb]/[29.8gb], all_pools {[young] [995.8mb]->[273.4mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [28gb]->[22.6gb]/[28.1gb]}
[2019-06-10T02:03:35,598][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2772] overhead, spent [43.4s] collecting in the last [43.5s]
[2019-06-10T02:03:37,599][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2774] overhead, spent [376ms] collecting in the last [1s]
[2019-06-10T02:03:38,708][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2775] overhead, spent [412ms] collecting in the last [1.1s]
[2019-06-10T02:03:39,701][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42112}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:03:39,954][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42110}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:03:40,757][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2777] overhead, spent [457ms] collecting in the last [1s]
[2019-06-10T02:03:41,915][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2778] overhead, spent [420ms] collecting in the last [1.1s]
[2019-06-10T02:03:42,915][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2779] overhead, spent [341ms] collecting in the last [1s]
[2019-06-10T02:03:43,016][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42106}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:03:43,195][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42116}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:03:43,862][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42114}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:03:44,024][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42112}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:03:44,197][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42114}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:03:44,198][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42112}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:04:11,673][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][old][2782][18] duration [26.4s], collections [1]/[26.7s], total [26.4s]/[9.5m], memory [28.8gb]->[16.2gb]/[29.8gb], all_pools {[young] [1gb]->[248.6mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [27.6gb]->[16gb]/[28.1gb]}
[2019-06-10T02:04:11,675][WARN ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2782] overhead, spent [26.4s] collecting in the last [26.7s]
[2019-06-10T02:04:13,855][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2784] overhead, spent [327ms] collecting in the last [1.1s]
[2019-06-10T02:04:14,042][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42108}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:04:14,634][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42106}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:04:14,857][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2785] overhead, spent [307ms] collecting in the last [1s]
[2019-06-10T02:04:16,180][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42116}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:04:16,441][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42114}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:04:16,642][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42114}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:04:17,018][WARN ][o.e.t.TcpTransport ] [us-elkdb20] send message failed [channel: Netty4TcpChannel{localAddress=/209.87.212.124:9300, remoteAddress=/209.87.212.207:42112}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-06-10T02:04:36,172][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2806] overhead, spent [490ms] collecting in the last [1s]
[2019-06-10T02:04:37,176][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2807] overhead, spent [370ms] collecting in the last [1s]
[2019-06-10T02:04:41,178][INFO ][o.e.m.j.JvmGcMonitorService] [us-elkdb20] [gc][2811] overhead, spent [433ms] collecting in the last [1s]
Thanks,
Lior

Lior_Yakobov · June 20, 2019, 8:48am

hey @Christian_Dahlqvist,
is the above data I posted helpful/relevant for this issue?
Thanks in advance,
Lior

system · July 18, 2019, 8:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster hangs for 1h. no logs, no throughput Elasticsearch	7	1272	July 24, 2017
First steps troubleshooting ES cluster crashes? Elasticsearch	9	3536	March 3, 2018
Elasticsearch GC timeout on data node Elasticsearch	2	393	August 10, 2021
Newbie performance troubleshooting, high load spikes on ES nodes Elasticsearch	5	5058	June 11, 2018
Elasticsearch High CPU Usage - GC Not Working Elasticsearch	26	7054	July 5, 2017

Elasticsearch - Poor cluster performance and stability

Related topics