ES JVM pressure spike and garbage collection issue

tusharkale · November 17, 2017, 9:07am

Hi,
We are using AWS ES for the past one year. For the past one or two weeks we are seeing spikes in JVM pressure (and no garbage collection) resulting in writes being blocked. I need help in understanding what could be the reason for this issue where garbage collection is not happening.
I have tried using the following instance types-
r4xlarge, r4large
The only notable change that has happened over the past few weeks is increase in searchable documents from 400 million to 500 million
Before the increase in number of documents we were using
Document Count: 509883131
Size : 362.26 GB
Elasticsearch version: 5.1

Christian_Dahlqvist · November 17, 2017, 9:10am

Can you provide the full output of the cluster stats API to give us a better view of the status of the cluster?

tusharkale · November 17, 2017, 9:29am

Hey Christian,
Thank you for replying.
This is the full output of the cluster stats api right now(when the JVM pressure is 61.9 ). I will add the same again when JVM pressure shoots past 75.

{
"_nodes": {
"total": 4,
"successful": 4,
"failed": 0
},
"cluster_name": "cluster name",
"timestamp": 1510910673574,
"status": "green",
"indices": {
"count": 2,
"shards": {
"total": 42,
"primaries": 21,
"replication": 1,
"index": {
"shards": {
"min": 2,
"max": 40,
"avg": 21
},
"primaries": {
"min": 1,
"max": 20,
"avg": 10.5
},
"replication": {
"min": 1,
"max": 1,
"avg": 1
}
}
},
"docs": {
"count": 510196018,
"deleted": 228488630
},
"store": {
"size": "730gb",
"size_in_bytes": 783917412586,
"throttle_time": "0s",
"throttle_time_in_millis": 0
},
"fielddata": {
"memory_size": "0b",
"memory_size_in_bytes": 0,
"evictions": 0
},
"query_cache": {
"memory_size": "45.1mb",
"memory_size_in_bytes": 47325120,
"total_count": 1077798,
"hit_count": 526476,
"miss_count": 551322,
"cache_size": 30328,
"cache_count": 30328,
"evictions": 0
},
"completion": {
"size": "0b",
"size_in_bytes": 0
},
"segments": {
"count": 1150,
"memory": "6.1gb",
"memory_in_bytes": 6577349073,
"terms_memory": "6gb",
"terms_memory_in_bytes": 6508660713,
"stored_fields_memory": "57mb",
"stored_fields_memory_in_bytes": 59838352,
"term_vectors_memory": "0b",
"term_vectors_memory_in_bytes": 0,
"norms_memory": "2.1mb",
"norms_memory_in_bytes": 2219968,
"points_memory": "5.3mb",
"points_memory_in_bytes": 5633344,
"doc_values_memory": "973.3kb",
"doc_values_memory_in_bytes": 996696,
"index_writer_memory": "0b",
"index_writer_memory_in_bytes": 0,
"version_map_memory": "20kb",
"version_map_memory_in_bytes": 20550,
"fixed_bit_set": "176.1mb",
"fixed_bit_set_memory_in_bytes": 184690496,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {}
}
},
"nodes": {
"count": {
"total": 4,
"data": 4,
"coordinating_only": 0,
"master": 4,
"ingest": 4
},
"versions": [
"5.1.1"
],
"os": {
"available_processors": 16,
"allocated_processors": 16,
"names": [
{
"count": 4
}
],
"mem": {
"total": "62.6gb",
"total_in_bytes": 67314974720,
"free": "607.1mb",
"free_in_bytes": 636645376,
"used": "62gb",
"used_in_bytes": 66678329344,
"free_percent": 1,
"used_percent": 99
}
},
"process": {
"cpu": {
"percent": 10
},
"open_file_descriptors": {
"min": 645,
"max": 650,
"avg": 647
}
},
"jvm": {
"max_uptime": "2.3h",
"max_uptime_in_millis": 8411962,
"mem": {
"heap_used": "19.1gb",
"heap_used_in_bytes": 20580010792,
"heap_max": "31.8gb",
"heap_max_in_bytes": 34220277760
},
"threads": 328
},
"fs": {
"total": "1.1tb",
"total_in_bytes": 1246598791168,
"free": "430.1gb",
"free_in_bytes": 461827280896,
"available": "371gb",
"available_in_bytes": 398409404416
},
"network_types": {
"transport_types": {
"netty4": 4
},
"http_types": {
"filter-jetty": 4
}
}
}
}

Thanks,
Tushar.

Christian_Dahlqvist · November 17, 2017, 9:42am

As the graph shows, it looks like the nodes are suffering from heap pressure. A significant portion, about 3GB per node, is taken up by segment and terms memory, but that does not fully explain it. What does your use-case and workload look like? Are you using nested documents and/or parent/child? It looks like you may be updating documents quite a lot - if so, how is this done?

If I am reading it correctly it looks like all nodes now are configured with a 8GB heap. Did you set the heap to 16GB when you tried the r4xlarge instances? Did this make a difference?

Do you have any non-default configuration settings that could impact heap usage?

tusharkale · November 17, 2017, 10:09am

We are using nested documents.
And as far as updating documents is concerned we add ~ 40000 documents in an hour. and that is done using the bulk api with the limit of 100 updates at a time.
I did not set the heap size to 16GB when I used r4xlarge. Rather I was not sure if that was possible when using AWS Elasticsearch.
We don't have any non default configuration settings as far as I know. I'll still double check this and update here.

tusharkale · November 20, 2017, 6:20am

Updated comment ^
We are using nested documents.

tusharkale · November 27, 2017, 4:28am

Hey @Christian_Dahlqvist,
I might be updating a lot more documents than I thought I was.
What I am doing is firing the bulk api to add/update documents and checking the counts on ES with the count API and I'm getting a count mismatch. How much time does it take reflect the updates by bulk api so that I get the correct count in the count API response.
Figuring this ^ out will help me avoid redundant updates.

Thanks,
Tushar

Christian_Dahlqvist · November 27, 2017, 6:17am

When using nested documents, each nested document is stored as a separate document in Elasticsearch. A deeply nested or large document can therefore correspond to a large number of documents in Elasticsearch. When such a document is updated, all documents are updated, even parts that have not changed. This ensures that the entire document still resides in a single segment. This can therefore get expensive from a performance perspective when documents are updated frequently.

How frequently to you update each nested document? What does your data model look like?

tusharkale · November 29, 2017, 3:31am

Hey @Christian_Dahlqvist,
The problem is solved now.
I was doing too many unnecessary document updates.
In the bulk api call I was sending refresh=False and immediately checking the counts. And the counts did not match.
And if the counts did not match I had a failsafe in place which updated the index with the missing/left out documents.
I solved this by sending refresh = wait_for in the bulk api call. So now the counts match after the bulk update and the failsafe is not triggered.

Thanks for helping me out!

Tushar

system · December 27, 2017, 3:31am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Garbage collection and stop-of-the-world Elasticsearch	3	1031	February 7, 2020
Elasticsearch JVM Memory Pressure Issue Elasticsearch	29	3543	June 26, 2019
Garbage collection not kicking in - Heap is growing to 98% Elasticsearch	3	930	June 29, 2017
JVM memory pressure is high only for one master node Elasticsearch	2	1017	November 20, 2019
Circuit Breaker not triggered even after parent circuit breaker limit is exceeded Elasticsearch	1	434	May 6, 2019

ES JVM pressure spike and garbage collection issue

Related topics