Our Elasticsearch cluster is experiencing quite a few issues with garbage collection and "stop the world" events.
My questions.
- Is there any benefit from using Oracle Java vs the OpenJDK? I am using the OpenJDK.
- Can each node run a different type garbage collector?
- Do I need more nodes? ( I think so but....)
- Should I be using more shards? I kept them low due to low number of nodes.
I have a 7 node cluster running Elasticsearch 5.9.6.
- 3 Masters - 2 CPU with 10 GB RAM - Heap 4 GB
- 2 Data - 16 CPU with 64 GB RAM - 45 TB ZFS data storage - Heap 28 GB
- 1 Data - 16 CPU with 32 GB RAM - 45 TB ZFS data storage - Heap 20 GB
- 1 API - 2 CPU with 10 GB RAM - Heap 4 GB
The cluster is running behind a Graylog Cluster. The data is as follows.
- 1,168 Indexes
- 10.8 TB data
- 15,633,087,989 documents
- 14 shards per index
Indexes are rotated daily. Each index is around 500GB in size.
All nodes are using CMS GC at this time except one. A data node is using G1GC.
I am constantly seeing one or more data nodes start to garbage collect heavily. This causes long timeouts and finally the node will run out of memory.
[2019-01-29T17:30:04,764][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][6355] overhead, spent [17.7s] collecting in the last [18.2s]
[2019-01-29T17:30:21,462][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][old][6358][494] duration [13.6s], collections [1]/[14.4s], total [13.6s]/[1.4h], memory [25.7gb]->[24.7gb]/[25.8gb], all_pools {[young] [1.4gb]->[592.7mb]/[1.4gb]}{[survivor] [102.9mb]->[0b]/[191.3mb]}{[old] [24.1gb]->[24.1gb]/[24.1gb]}
[2019-01-29T17:30:21,463][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][6358] overhead, spent [13.6s] collecting in the last [14.4s]
[2019-01-29T17:30:36,172][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][old][6361][495] duration [11.7s], collections [1]/[12.4s], total [11.7s]/[1.4h], memory [25.7gb]->[24.7gb]/[25.8gb], all_pools {[young] [1.4gb]->[623mb]/[1.4gb]}{[survivor] [116.4mb]->[0b]/[191.3mb]}{[old] [24.1gb]->[24.1gb]/[24.1gb]}
[2019-01-29T17:30:36,174][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][6361] overhead, spent [11.7s] collecting in the last [12.4s]
[2019-01-29T17:30:55,435][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][old][6363][496] duration [17.4s], collections [1]/[18.2s], total [17.4s]/[1.4h], memory [25.7gb]->[24.7gb]/[25.8gb], all_pools {[young] [1.4gb]->[611.3mb]/[1.4gb]}{[survivor] [75.6mb]->[0b]/[191.3mb]}{[old] [24.1gb]->[24.1gb]/[24.1gb]}
[2019-01-29T17:30:55,436][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][6363] overhead, spent [17.4s] collecting in the last [18.2s]
[2019-01-29T17:30:56,442][INFO ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][6364] overhead, spent [276ms] collecting in the last [1s]
[2019-01-29T17:31:15,753][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][old][6366][497] duration [17.6s], collections [1]/[18.3s], total [17.6s]/[1.4h], memory [25.6gb]->[24.7gb]/[25.8gb], all_pools {[young] [1.4gb]->[634.3mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [24.1gb]->[24.1gb]/[24.1gb]}
[2019-01-29T17:31:15,754][WARN ][o.e.m.j.JvmGcMonitorService] [graylog4] [gc][6366] overhead, spent [17.6s] collecting in the last [18.3s]
I can post the cluster stats if that helps.