High node JVM heap cause ES cluster almost stop working

Hi Everyone,

We meet some JVM memory issue in 3 of our warm nodes,the JVM heap of this nodes keep increasing and up to 99%,and then the following error message are reported in the node log file.and then,the cluster seems to stop working as we also see timeout from kibana seeing that connect to ES timeout.

This situation will not resolved until we force restart the ES service in that node.

Currently we are running at ES 6.4.2 docker container with about 31G JVM heap for a node ,as I understand we could not add more JVM memory for that node.

Does anybody can help on that?

[2019-01-27T05:01:14,534][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:01:14,535][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5497] overhead, spent [26.8s] collecting in the last [27.8s]
[2019-01-27T05:01:45,021][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][old][5498][465] duration [29.4s], collections [1]/[30.4s], total [29.4s]/[2.1h], memory [28.8gb]->[28.8gb]/[30.6gb], all_pools {[young] [994.1mb]->[993.9mb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [27.9gb]->[27.8gb]/[27.9gb]}
[2019-01-27T05:01:45,020][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:01:45,021][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5498] overhead, spent [29.4s] collecting in the last [30.4s]
[2019-01-27T05:02:14,244][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:02:14,618][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][old][5499][466] duration [28.6s], collections [1]/[29.5s], total [28.6s]/[2.1h], memory [28.8gb]->[28.8gb]/[30.6gb], all_pools {[young] [993.9mb]->[995.2mb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [27.8gb]->[27.9gb]/[27.9gb]}
[2019-01-27T05:02:14,618][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5499] overhead, spent [28.6s] collecting in the last [29.5s]
[2019-01-27T05:02:44,673][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:02:44,674][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][old][5500][467] duration [29.1s], collections [1]/[30s], total [29.1s]/[2.1h], memory [28.8gb]->[28.8gb]/[30.6gb], all_pools {[young] [995.2mb]->[998.7mb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2019-01-27T05:02:44,674][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5500] overhead, spent [29.1s] collecting in the last [30s]
[2019-01-27T05:03:14,461][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:03:14,462][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][old][5501][468] duration [28.8s], collections [1]/[29.7s], total [28.8s]/[2.1h], memory [28.8gb]->[28.8gb]/[30.6gb], all_pools {[young] [998.7mb]->[992.9mb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2019-01-27T05:03:14,462][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5501] overhead, spent [28.8s] collecting in the last [29.7s]
[2019-01-27T05:03:45,099][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:03:45,100][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][old][5502][469] duration [29.7s], collections [1]/[30.6s], total [29.7s]/[2.1h], memory [28.8gb]->[28.8gb]/[30.6gb], all_pools {[young] [992.9mb]->[994.7mb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2019-01-27T05:03:45,100][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5502] overhead, spent [29.7s] collecting in the last [30.6s]
[2019-01-27T05:04:13,513][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][old][5503][470] duration [27.4s], collections [1]/[28.4s], total [27.4s]/[2.1h], memory [28.8gb]->[28.8gb]/[30.6gb], all_pools {[young] [994.7mb]->[997.2mb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2019-01-27T05:04:13,513][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5503] overhead, spent [27.4s] collecting in the last [28.4s]
[2019-01-27T05:04:13,513][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:04:44,178][ERROR][o.e.x.m.c.n.NodeStatsCollector] [172.18.130.15-instance1] collector [node_stats] timed out when collecting data
[2019-01-27T05:04:44,179][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][old][5504][471] duration [29.7s], collections [1]/[30.6s], total [29.7s]/[2.1h], memory [28.8gb]->[28.8gb]/[30.6gb], all_pools {[young] [997.2mb]->[1002.9mb]/[2.4gb]}{[survivor] [0b]->[0b]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2019-01-27T05:04:44,179][WARN ][o.e.m.j.JvmGcMonitorService] [172.18.130.15-instance1] [gc][5504] overhead, spent [29.7s] collecting in the last [30.6s]

How many nodes, indices, shards do you have?
What is the total size of data (primary shards only)?

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

Sorry for lack of some informations.
Currently we have 14 nodes running the whole cluster,include 3 dedicate master node in 3 physical server.8 hot node in 4 physical server and 3 warm node in 3 physical server.

for hot nodes we used SSD disk and 31GB JVM heap for each node,2 nodes are running in the same physical box.
for warm nodes we used SATA disk and 31 GB JVM heap ,each warm node are in one physical box.

Currently data size is the following:

Nodes: 14

Indices: 556

Memory: 215.3 GB / 429.7 GB

Total Shards: 16078

Unassigned Shards: 6452

Documents: 8,991,033,826

Data: 14.2 TB

The unassigned Shards is because the Warm node running out of JVM heap and we have to forced restart the service in that node.so you can ignore that.The cluster state is normal and should hvae no unassigned Shards

I also posted the current configuration for Warm node in the following:

discovery.zen.ping.unicast.hosts:
- 172.18.130.19:9300
- 172.18.130.19:9301
- 172.18.130.15:9300
- 172.18.130.14:9300
- 172.18.130.14:9301
- 172.18.130.20:9300
- 172.18.130.33:9300
- 172.18.130.21:9300
- 172.18.130.35:9300
- 172.18.130.35:9301
- 172.18.130.34:9300
- 172.18.130.36:9300
- 172.18.130.32:9300
- 172.18.130.32:9301
discovery.zen.minimum_master_nodes: 2
#discovery.zen.ping.multicast.enabled: false
#discovery.zen.hosts_provider: file

node.name: 172.18.130.15-instance1
node.data: true

cluster.name: xxxxxx
#bootstrap.memory_lock: true

node.master: false
http.enabled: false
transport.tcp.port: 9300

network.host: 172.18.130.15


node.attr.box_type: warm

################################## threadpool #################################
thread_pool:
  bulk:
    size: 32
    queue_size: 3000
  index:
    size: 32
    queue_size: 3000
  force_merge:
    size: 8


indices.memory.index_buffer_size: 50%
# index.translog.flush_threshold_ops: 50000

################################### gateway ###################################
gateway.expected_master_nodes: 2
gateway.expected_data_nodes: 8
gateway.recover_after_time: 3m
gateway.recover_after_master_nodes: 2
gateway.recover_after_data_nodes: 8

#################################### Paths ####################################
# Path to directory containing configuration (this file and logging.yml):
path.data: /usr/share/elasticsearch/data/
path.logs: /usr/share/elasticsearch/logs/

##############################  x-pack plugin #################################
xpack.security.enabled: false
xpack.monitoring.history.duration: 28d

Again:

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

You probably have too many shards per node.

May I suggest you look at the following resources about sizing:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

And https://www.elastic.co/webinars/using-rally-to-get-your-elasticsearch-cluster-size-right

Thanks a lot for your advise,I have updated the previous post.

I understand add more node may solve the issue,but do we have any more options?

The 32 GB JVM limitation is a big problem for modern servers.We have 128GB in that physical box which could not be used by ES,on the other hand,the ES node itself just keep eating out of it's JVM memory.It seems that the JVM GC method is not handling the its memory useage.

As it seems you are suffering from heap pressure, this webinar may also be useful. I do however agree with David in that you seem to have far, far too many shards given the total data volume, which will be affecting performance.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.