ES 7.4 GC keeps reclaiming less memory on each pass

Hello, I recently upgraded 7.4 and I'm having a problem where the sawtooth pattern's reclaimed heap keeps shrinking (EG: Initially GC goes from 75% to 20%... but the node goes to 30%... then 40% with each additional reclamation).

  1. Is there a query that will show how much heap each index is using?
  2. How would I go about debugging this?

Are you using the monitoring functionality on your cluster?

Not the standard monitoring tools. I'm using AWS's managed Elasticsearch but I'm preparing to move off of it over the next few days. Being unable to get a heap dump has really stymied my debugging.

Ahh yeah, they do have somewhat limited tools in that area unfortunately.

You can try out elastic.co/cloud as it includes some good things there. It'll show things like query and indexing rates, resource usage and more. You won't get a heapdump though, as it's *aaS, so no host access.

However some other questions that might help;

  • what size heap?
  • how many nodes, shards, indices?

Generally 5 nodes (but I've tried 3 as well), 10 shards per index, 18 indexes.

I've tried from 5x8GB nodes (assume heap is 40-50%) to 5x32GB nodes. The larger nodes are lasting longer simply due to Heap size I suspect. However, I have to keep triggering Amazon rollovers every day or so before the cluster crashes (Even with twice the machine power as before the upgrade).

The strange detail is that the total index size is not large (~25 GB) and generally CPU usage remains fairly low (Max around 35%). It's a real mystery.

What sort of queries are you running, against what sort of data structure(s)?

The queries are all over the map at around 2,500 per minute. A lot of aggregations and filter operations (There are 471 different elasticsearch calls). Indexing averages a couple hundred per minute.

The documents are generally fairly small (although there are millions of them) and have a couple dozen fields each on average (With the exception of one index that has almost a hundred).

I've looked at copious amounts of memory output from various elasticsearch stats endpoints and everything looks fine besides the heap used number. Cached queries is only in the megabytes. Fielddata is almost 0.

It would be nice if elasticsearch had a "data associated with this index uses 5 GB of heap" endpoint but I haven't found one. I am hopeful that there is one index that is causing trouble that will help isolate the issue.

I'm spinning up my own cluster. Will report back on results.

I've been running my own cluster the entire day side by side with the Amazon cluster forwarding traffic to both. My cluster is running perfectly and the Amazon cluster can't stay up for more than 5 hours. My cluster also literally costs more than 75% less.

I bet Amazon messed up their jvm options.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.