Does 1-2 nodes with consistently high(er) heap usage indicate a problem?

Hi,

We often see 1-2 nodes that have consistently high heap usage in our cluster. The other nodes typically follow a "saw tooth" pattern where heap usage climbs for a period of time until it is garbage collected.

For example, the heap usage of the pink node in the picture below never has a meaningful decline:

Is this something to be concerned about? If so, are there other metrics I can look at to determine the cause of this?

We're usually not close to hitting the JVM limit. It just seems odd that we usually have an outlier or two.

Yes, I would say that is something to look for not as you want all nodes to have a healthy saw-tooth pattern. Try to identify what sets these nodes apart from the others with respect to configuration, data, hardware and load.

Thanks.

I looked a little bit closer and it looks like two of the machines that are usually high share some primary/secondary shards. I excluded the other one from the graph above to make graph a bit clearer.

Do you have any recommended strategies for narrowing down the problem index/data?

My current plan is to swap some of the shards with shards on less busy machines to try to isolate which index/shard is causing the problem.

If you have monitoring enabled this may give additional clues, especially if there is some indices receiving more load than others. Manually rearranging shards that you suspect could be affecting this could also be a good way to narrow it down.

We do have monitoring set up however I haven't been able to identify anything unusual there (except for slightly higher CPU on one of the problem nodes for several weeks).

I'll see if it's possible for me to move the shards around.

What is your opinion on on restarting the problem node? I've heard that this is a "bad idea" as we would have to start over with an empty cache.

2 more questions before I rearrange the shards as I'm not sure what I would look at next after rearranging the shards:

Do you have any recommendations on what metrics I should look at on monitoring?

I've so far compared shards along dimensions like this and not gotten anything that stands out for the problem shard:

  • Elasticsearch search fetch time over time
  • Total searches over time
  • Total indexing request over time
  • Cache size (query + request) over time
  • # of segments over time

Search index load by request # is not higher on one node than the other because our client randomly picks a node.

Could the node have gotten stuck in a "bad state?"

How likely do you think it is that something like an expensive/large indexing request put the node in a "bad state" and it hasn't recovered?

The reason I ask is that the memory is constantly high (both max/minimum user traffic) and I don't think that we're putting an excessive load on that particular shard.

I've read that large indexing requests can cause an expensive GC because the young-gen collection fills up so lots of objects are pushed to the older generation and then a major GC is required. However, that's the kind of thing that should be fixed in a matter of minutes/hours right?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.