ES JVM pressure spike and garbage collection issue

Hi,
We are using AWS ES for the past one year. For the past one or two weeks we are seeing spikes in JVM pressure (and no garbage collection) resulting in writes being blocked. I need help in understanding what could be the reason for this issue where garbage collection is not happening.
I have tried using the following instance types-
r4xlarge, r4large
The only notable change that has happened over the past few weeks is increase in searchable documents from 400 million to 500 million
Before the increase in number of documents we were using JVM Pressure
Document Count: 509883131
Size : 362.26 GB
Elasticsearch version: 5.1

1 Like

Can you provide the full output of the cluster stats API to give us a better view of the status of the cluster?

Hey Christian,
Thank you for replying.
This is the full output of the cluster stats api right now(when the JVM pressure is 61.9 ). I will add the same again when JVM pressure shoots past 75.

{
"_nodes": {
"total": 4,
"successful": 4,
"failed": 0
},
"cluster_name": "cluster name",
"timestamp": 1510910673574,
"status": "green",
"indices": {
"count": 2,
"shards": {
"total": 42,
"primaries": 21,
"replication": 1,
"index": {
"shards": {
"min": 2,
"max": 40,
"avg": 21
},
"primaries": {
"min": 1,
"max": 20,
"avg": 10.5
},
"replication": {
"min": 1,
"max": 1,
"avg": 1
}
}
},
"docs": {
"count": 510196018,
"deleted": 228488630
},
"store": {
"size": "730gb",
"size_in_bytes": 783917412586,
"throttle_time": "0s",
"throttle_time_in_millis": 0
},
"fielddata": {
"memory_size": "0b",
"memory_size_in_bytes": 0,
"evictions": 0
},
"query_cache": {
"memory_size": "45.1mb",
"memory_size_in_bytes": 47325120,
"total_count": 1077798,
"hit_count": 526476,
"miss_count": 551322,
"cache_size": 30328,
"cache_count": 30328,
"evictions": 0
},
"completion": {
"size": "0b",
"size_in_bytes": 0
},
"segments": {
"count": 1150,
"memory": "6.1gb",
"memory_in_bytes": 6577349073,
"terms_memory": "6gb",
"terms_memory_in_bytes": 6508660713,
"stored_fields_memory": "57mb",
"stored_fields_memory_in_bytes": 59838352,
"term_vectors_memory": "0b",
"term_vectors_memory_in_bytes": 0,
"norms_memory": "2.1mb",
"norms_memory_in_bytes": 2219968,
"points_memory": "5.3mb",
"points_memory_in_bytes": 5633344,
"doc_values_memory": "973.3kb",
"doc_values_memory_in_bytes": 996696,
"index_writer_memory": "0b",
"index_writer_memory_in_bytes": 0,
"version_map_memory": "20kb",
"version_map_memory_in_bytes": 20550,
"fixed_bit_set": "176.1mb",
"fixed_bit_set_memory_in_bytes": 184690496,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {}
}
},
"nodes": {
"count": {
"total": 4,
"data": 4,
"coordinating_only": 0,
"master": 4,
"ingest": 4
},
"versions": [
"5.1.1"
],
"os": {
"available_processors": 16,
"allocated_processors": 16,
"names": [
{
"count": 4
}
],
"mem": {
"total": "62.6gb",
"total_in_bytes": 67314974720,
"free": "607.1mb",
"free_in_bytes": 636645376,
"used": "62gb",
"used_in_bytes": 66678329344,
"free_percent": 1,
"used_percent": 99
}
},
"process": {
"cpu": {
"percent": 10
},
"open_file_descriptors": {
"min": 645,
"max": 650,
"avg": 647
}
},
"jvm": {
"max_uptime": "2.3h",
"max_uptime_in_millis": 8411962,
"mem": {
"heap_used": "19.1gb",
"heap_used_in_bytes": 20580010792,
"heap_max": "31.8gb",
"heap_max_in_bytes": 34220277760
},
"threads": 328
},
"fs": {
"total": "1.1tb",
"total_in_bytes": 1246598791168,
"free": "430.1gb",
"free_in_bytes": 461827280896,
"available": "371gb",
"available_in_bytes": 398409404416
},
"network_types": {
"transport_types": {
"netty4": 4
},
"http_types": {
"filter-jetty": 4
}
}
}
}

Thanks,
Tushar.

As the graph shows, it looks like the nodes are suffering from heap pressure. A significant portion, about 3GB per node, is taken up by segment and terms memory, but that does not fully explain it. What does your use-case and workload look like? Are you using nested documents and/or parent/child? It looks like you may be updating documents quite a lot - if so, how is this done?

If I am reading it correctly it looks like all nodes now are configured with a 8GB heap. Did you set the heap to 16GB when you tried the r4xlarge instances? Did this make a difference?

Do you have any non-default configuration settings that could impact heap usage?

We are using nested documents.
And as far as updating documents is concerned we add ~ 40000 documents in an hour. and that is done using the bulk api with the limit of 100 updates at a time.
I did not set the heap size to 16GB when I used r4xlarge. Rather I was not sure if that was possible when using AWS Elasticsearch.
We don't have any non default configuration settings as far as I know. I'll still double check this and update here.

Updated comment ^
We are using nested documents.

Hey @Christian_Dahlqvist,
I might be updating a lot more documents than I thought I was.
What I am doing is firing the bulk api to add/update documents and checking the counts on ES with the count API and I'm getting a count mismatch. How much time does it take reflect the updates by bulk api so that I get the correct count in the count API response.
Figuring this ^ out will help me avoid redundant updates.

Thanks,
Tushar

When using nested documents, each nested document is stored as a separate document in Elasticsearch. A deeply nested or large document can therefore correspond to a large number of documents in Elasticsearch. When such a document is updated, all documents are updated, even parts that have not changed. This ensures that the entire document still resides in a single segment. This can therefore get expensive from a performance perspective when documents are updated frequently.

How frequently to you update each nested document? What does your data model look like?

Hey @Christian_Dahlqvist,
The problem is solved now.
I was doing too many unnecessary document updates.
In the bulk api call I was sending refresh=False and immediately checking the counts. And the counts did not match.
And if the counts did not match I had a failsafe in place which updated the index with the missing/left out documents.
I solved this by sending refresh = wait_for in the bulk api call. So now the counts match after the bulk update and the failsafe is not triggered.

Thanks for helping me out!

Tushar

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.