Does 1-2 nodes with consistently high(er) heap usage indicate a problem?

maxshortzp · January 23, 2019, 2:09am

Hi,

We often see 1-2 nodes that have consistently high heap usage in our cluster. The other nodes typically follow a "saw tooth" pattern where heap usage climbs for a period of time until it is garbage collected.

For example, the heap usage of the pink node in the picture below never has a meaningful decline:

Is this something to be concerned about? If so, are there other metrics I can look at to determine the cause of this?

We're usually not close to hitting the JVM limit. It just seems odd that we usually have an outlier or two.

Christian_Dahlqvist · January 23, 2019, 2:50am

Yes, I would say that is something to look for not as you want all nodes to have a healthy saw-tooth pattern. Try to identify what sets these nodes apart from the others with respect to configuration, data, hardware and load.

maxshortzp · January 24, 2019, 3:08am

Thanks.

I looked a little bit closer and it looks like two of the machines that are usually high share some primary/secondary shards. I excluded the other one from the graph above to make graph a bit clearer.

Do you have any recommended strategies for narrowing down the problem index/data?

My current plan is to swap some of the shards with shards on less busy machines to try to isolate which index/shard is causing the problem.

Christian_Dahlqvist · January 24, 2019, 6:34am

If you have monitoring enabled this may give additional clues, especially if there is some indices receiving more load than others. Manually rearranging shards that you suspect could be affecting this could also be a good way to narrow it down.

maxshortzp · January 24, 2019, 6:38pm

We do have monitoring set up however I haven't been able to identify anything unusual there (except for slightly higher CPU on one of the problem nodes for several weeks).

I'll see if it's possible for me to move the shards around.

What is your opinion on on restarting the problem node? I've heard that this is a "bad idea" as we would have to start over with an empty cache.

maxshortzp · January 24, 2019, 6:54pm

2 more questions before I rearrange the shards as I'm not sure what I would look at next after rearranging the shards:

Do you have any recommendations on what metrics I should look at on monitoring?

I've so far compared shards along dimensions like this and not gotten anything that stands out for the problem shard:

Elasticsearch search fetch time over time
Total searches over time
Total indexing request over time
Cache size (query + request) over time
# of segments over time

Search index load by request # is not higher on one node than the other because our client randomly picks a node.

Could the node have gotten stuck in a "bad state?"

How likely do you think it is that something like an expensive/large indexing request put the node in a "bad state" and it hasn't recovered?

The reason I ask is that the memory is constantly high (both max/minimum user traffic) and I don't think that we're putting an excessive load on that particular shard.

I've read that large indexing requests can cause an expensive GC because the young-gen collection fills up so lots of objects are pushed to the older generation and then a major GC is required. However, that's the kind of thing that should be fixed in a matter of minutes/hours right?

system · February 21, 2019, 6:54pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch heap issues Elasticsearch	4	472	July 5, 2017
Heap Usage is not balanced on all the nodes Elasticsearch	2	702	May 26, 2017
ES high heap memory issue over a couple of nodes Elasticsearch	3	497	October 24, 2018
Why is my heap usage always high? Elasticsearch	10	5060	July 5, 2017
Large heap usage with each node Elasticsearch	15	3756	July 5, 2017

Does 1-2 nodes with consistently high(er) heap usage indicate a problem?

Do you have any recommendations on what metrics I should look at on monitoring?

Could the node have gotten stuck in a "bad state?"

Related topics