One node in cluster is using (a lot) more heap space and cpu

(Paul) #1

We have an ES (version 1.2.4) cluster with 5 nodes. During stress tests on the cluster I notice that one node is using much more (33-50%) heap space and cpu than the other nodes. Eventually this node uses almost all heap space causing slow queries while other nodes should still be fine.

As we use a load balancer in front of ES, all requests are spread evenly across the nodes (checked with access logs).

Has anyone seen such behaviour before, or could you maybe point me in the right direction in what to investigate?

(Nik Everett) #2

Sure. This can be caused by uneven shard distribution. Elasticsearch will try to spread the shards out for each index reasonably evenly but it can't always do it. Check the _cat/shards api and see if the allocation is unbalanced. You can then control allocation somewhat by settings the settings described here. You can also control the number of shards at index creation time.

This could also be caused by something (usually _routing) pushing more data to one shard than to the others. If one shard is much large than the others it will consume more resources. You can check that same _cat/shards api for relative shard sizes. Since queries can also be routed you should check to make sure you aren't doing that. Or if you are then that is probably what is going on.

(Kyle Lahnakoski) #3

I have seen this happen during bulk loading[1]: One node has the majority of primary shards, and will crash as a result

[1] Reducing Memory Consuption while Bulk Indexing

(Paul) #4

I've checked the shards. And they are spread across the nodes evenly and also have the same size.

However, I've been testing with random search terms. And now I ran some tests with just one term everytime and here I see that indeed depends on the term which node uses more cpu/heap. Also there were a few search terms where the heap didn't grow excessively.

What still troubles me though is that the heap can grow a lot for just one search term. Shouldn't ES be able to reuse it's resources for such a term? It shouldn't need to cache more for subsequent requests right?

Also I see that when I stop the stress test, the heap usage is somewhat reduced. But it is still big and most of the heap is used in the 'old' space.

I should've been more clear in my first post. The stress test is for search only, no indexing is happening at this time (also we haven't seen any issues with our bulk indexing so far)

(system) #5