Long GC pauses on data nodes

arvind297 · March 21, 2019, 6:03am

Hi,

We are seeing hot nodes going out frequently(atleast once in a day) with the below error and we need to restart the services manually to add the node back to the cluster.

We have 8 hot nodes and Most of the time nodes go down with heap memory consumption more than 90% and long GC pauses (sometimes in Minutes).

[gc][69753] overhead, spent [54s] collecting in the last [54.9s]
[gc][old][69754][152] duration [42s], collections [1]/[42.5s], total [42s]/[3.5m], memory [30.3gb]->[30.7gb]/[30.9gb], all_pools {[young] [38.9mb]->[389.8mb]/[532.5mb]}{[survivor] [0b]->[0b]/[66.5mb]}{[old] [30.3gb]->[30.3gb]/[30.3gb]}

ERROR Recovering from [gc][old][69754][152] duration [42s], collections [1]/[42.5s], total [42s]/[3.5m], memory [30.3gb]->[30.7gb]/[30.9gb], all_pools {[young] [38.9mb]->[389.8mb]/[532.5mb]}{[survivor] [0b]->[0b]/[66.5mb]}{[old] [30.3gb]->[30.3gb]/[30.3gb]}

Around 76 TB of data is stored in our cluster and around 10 TB of data in HOT nodes(1.7 tb of space is allocated on each hot node and around 300-400gb of free space is available).

We have allocated 31GB of memory on each hot node and around 350 shards currently exists on each hot node.

Please suggest is there any way we can check what is causing huge memory consumption on the hot nodes,I was thinking ingestion might be one of the reason.
Is there anything which we can check specifically to find the root cause.

Most of the time we face this issue only on hot nodes.

ES version:6.2

Any help here is much appreciated.

Thanks,
Aravind

Christian_Dahlqvist · March 21, 2019, 6:37am

Do you have any non-default settings in your Elasticsearch config? What is heap pressure looking like on the data nodes and how much data do these hold? Are you sending queries to all nodes in the cluster or just to hot nodes? What type of data are you ingesting? What is your ingest rate?

arvind297 · March 21, 2019, 6:55am

"cluster": {
"routing": {
"use_adaptive_replica_selection": "true",
"allocation": {
"allow_rebalance": "always",
"cluster_concurrent_rebalance": "5",
"node_concurrent_recoveries": "5",
"disk": {
"watermark": {
"low": "80gb",
"flood_stage": "10gb",
"high": "50gb"
}
},
"node_initial_primaries_recoveries": "5"
}
},
"info": {
"update": {
"interval": "1m"
}
}
},
"indices": {
"recovery": {
"max_bytes_per_sec": "200mb"
}
},
"search": {
"default_search_timeout": "90s"
},

Data is getting ingested to all the 8 hot nodes,Most of the times heap memory consumption will be < 75%.
There will be a spike atleast once in a day where it crosses more than 90% and we see long GC pauses(sometimes in minutes) in the logfiles.

We have allocated 1.7 tb space on each hot nodes and all the hot nodes hold around 1-1.3 TB of data,We were expecting any expensive search queries might be causing this issue and search time out parameter is set to 90 seconds.
But still we are seeing the same issue.

Around 1 to 1.2 TB of data is ingested per day.

Christian_Dahlqvist · March 21, 2019, 7:11am

I was looking for settings in the elasticsearch.yml file, e.g. if you have increased the bulk size queue or similar. Have you tried moving data off the hot nodes a bit earlier to store less data on them? What does the cluster stats API give?

arvind297 · March 21, 2019, 7:31am

We have a curator job which is scheduled to move data from hot to warm nodes daily.

Please find the memory related details below.

"nodes": {
"count": {
"total": 31,
"data": 25,
"coordinating_only": 3,
"master": 3,
"ingest": 11
},
"versions": [
"6.2.0"
],
"os": {
"available_processors": 260,
"allocated_processors": 260,

"mem": {
"total": "1.6tb",
"total_in_bytes": 1804223500288,
"free": "124.2gb",
"free_in_bytes": 133380161536,
"used": "1.5tb",
"used_in_bytes": 1670843338752,
"free_percent": 7,
"used_percent": 93
}
},
"process": {
"cpu": {
"percent": 1002
},
"open_file_descriptors": {
"min": 929,
"max": 8385,
"avg": 3197
}
},
"jvm": {
"max_uptime": "22d",
"max_uptime_in_millis": 1909155466,
"versions": [
{
"version": "1.8.0_192",
"vm_name": "Java HotSpot(TM) 64-Bit Server VM",
"vm_version": "25.192-b12",
"vm_vendor": "Oracle Corporation",
"count": 31
}
],
"mem": {
"heap_used": "397.1gb",
"heap_used_in_bytes": 426438577024,
"heap_max": "840.9gb",
"heap_max_in_bytes": 902976372736
},
"threads": 6498
},

Christian_Dahlqvist · March 21, 2019, 8:28am

Can you please post the full output? I am primarily interested in the sections showing memory usage related to indices and shards.

arvind297 · March 21, 2019, 8:40am

{
"_nodes": {
"total": 31,
"successful": 31,
"failed": 0
},
"cluster_name": "<cluster_name>",
"timestamp": 1553152785205,
"status": "green",
"indices": {
"count": 6958,
"shards": {
"total": 20039,
"primaries": 10009,
"replication": 1.0020981116994705,
"index": {
"shards": {
"min": 1,
"max": 25,
"avg": 2.8799942512216155
},
"primaries": {
"min": 1,
"max": 5,
"avg": 1.438488071284852
},
"replication": {
"min": 0,
"max": 24,
"avg": 1.0030181086519114
}
}
},
"docs": {
"count": 133847115321,
"deleted": 17695178
},
"store": {
"size": "75.8tb",
"size_in_bytes": 83358461265286
},
"fielddata": {
"memory_size": "3.8gb",
"memory_size_in_bytes": 4085858256,
"evictions": 0
},
"query_cache": {
"memory_size": "4.1gb",
"memory_size_in_bytes": 4407021212,
"total_count": 12359903,
"hit_count": 328527,
"miss_count": 12031376,
"cache_size": 43538,
"cache_count": 44066,
"evictions": 528
},
"completion": {
"size": "0b",
"size_in_bytes": 0
},
"segments": {
"count": 384832,
"memory": "147.4gb",
"memory_in_bytes": 158319407915,
"terms_memory": "110.9gb",
"terms_memory_in_bytes": 119111301300,
"stored_fields_memory": "31.4gb",
"stored_fields_memory_in_bytes": 33765067912,
"term_vectors_memory": "0b",
"term_vectors_memory_in_bytes": 0,
"norms_memory": "320.6mb",
"norms_memory_in_bytes": 336177088,
"points_memory": "4.4gb",
"points_memory_in_bytes": 4827562359,
"doc_values_memory": "266.3mb",
"doc_values_memory_in_bytes": 279299256,
"index_writer_memory": "459.1mb",
"index_writer_memory_in_bytes": 481473526,
"version_map_memory": "233.3kb",
"version_map_memory_in_bytes": 238900,
"fixed_bit_set": "184.5kb",
"fixed_bit_set_memory_in_bytes": 189008,
"max_unsafe_auto_id_timestamp": 1553152036160,
"file_sizes": {}
}
},
"nodes": {
"count": {
"total": 31,
"data": 25,
"coordinating_only": 3,
"master": 3,
"ingest": 11
},
"versions": [
"6.2.0"
],
"os": {
"available_processors": 260,
"allocated_processors": 260,
"names": [
{
"name": "Linux",
"count": 31
}
],
"mem": {
"total": "1.6tb",
"total_in_bytes": 1804223500288,
"free": "124.2gb",
"free_in_bytes": 133380161536,
"used": "1.5tb",
"used_in_bytes": 1670843338752,
"free_percent": 7,
"used_percent": 93
}
},
"process": {
"cpu": {
"percent": 1002
},
"open_file_descriptors": {
"min": 929,
"max": 8385,
"avg": 3197
}
},
"jvm": {
"max_uptime": "22d",
"max_uptime_in_millis": 1909155466,
"versions": [
{
"version": "1.8.0_192",
"vm_name": "Java HotSpot(TM) 64-Bit Server VM",
"vm_version": "25.192-b12",
"vm_vendor": "Oracle Corporation",
"count": 31
}
],
"mem": {
"heap_used": "397.1gb",
"heap_used_in_bytes": 426438577024,
"heap_max": "840.9gb",
"heap_max_in_bytes": 902976372736
},
"threads": 6498
},
"fs": {
"total": "195.6tb",
"total_in_bytes": 215152809619456,
"free": "119.2tb",
"free_in_bytes": 131149386911744,
"available": "109.4tb",
"available_in_bytes": 120300727922688
},

arvind297 · March 21, 2019, 12:59pm

can anyone please provide some input on how to proceed on this issue.

dadoonet · March 21, 2019, 4:30pm

Read this and specifically the "Also be patient" part.

It's fine to answer on your own thread after 2 or 3 days (not including weekends) if you don't have an answer.

Also please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

system · April 18, 2019, 4:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Hot nodes running high on heap upon start and cluster timing out Elasticsearch	2	623	July 21, 2019
Finding why long GCs occur and fixing efficiency issues in Elastic Cluster Elasticsearch	4	1308	January 25, 2019
Frequently gc in one node Elasticsearch	6	1226	May 28, 2018
Elasticsearch GC timeout on data node Elasticsearch	2	393	August 10, 2021
High CPU Elasticsearch	4	1498	December 5, 2018

Long GC pauses on data nodes

Related topics