Troubleshooting High heap usage

I have looked at many posts in the discussion forum and googled quite a bit as well. Either these posts are for older versions or there wasn't anything that matched the issue I am seeing. Hence, posting it one more time. Apologies if there is something really obvious that I am overlooking.

There are a 5 data nodes in our cluster and data seem to be distributed evenly, but only one or two data nodes consistently seem to have high heap usage (between 85-90%). Can someone help me out with troubleshooting steps? Few things I have looked at:

  1. Cluster health / unassigned shards : 0
  2. Number of pending tasks: 0
  3. Script stats: look normal
  4. Field data: between 10-15% of memory

Data Node configuration: 16 cores, 30 GB RAM, 15 GB for lucene and ES each
ES version: 5.3.2

I couldn't interpret anything meaningful from other stats. Please see node stats in my onedrive

We don't have X-Pack and installing one in production is out of scope as of now. Let me know if you need more info.

What is the full output of the cluster stats API?

I had to reboot the node. Please find cluster stats output here

We are working on reducing number of shards, but that does not seem to be an issue since other nodes are holding up fine (we have a cluster with over 25k shards and that is also doing alright).

Are you doing a lot of updates to existing documents? I see a lot of deleted docs reported in your cluster stats.

I ran into a similar issue recently where one or two data nodes would have high heap usage and the high CPU that goes along with constant garbage collections.

The culprit was lots of updates to existing documents.

@loren We update in batches (bulk). Our updates are essentially overwriting whole document (no partial updates). Since ES handles updates via delete followed by index, I guess it does end up deleting a lot of documents.

In your thread you seemed to be using X-Pack. Do you have some performance counters / metrics that you particular looked at? We don't use X-Pack, however have installed telegraf plugin.

In my case X-Pack was not of much help diagnosing the problem as it doesn't graph per-node merge rates. I used iostat on the busy node to see that the write rate was 4X more than other nodes and took a guess that the problem was due to frequent merges. I reduced the number of updates by a wide margin, and this solved my problem.

Good luck!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.