Shards per node, Heap, 100% CPU....help please

Hi, I have a 2 node cluster with 1500 indices. Indices persist over time. This was set up by a third party... but now I am having the problem that the master node becomes unresponsive (100% CPU) and the only way to recover it is by restarting the elasticsearch service.

Each index has 5 shards and 1 replicas.... The cluster has worked very well for the last 5 years... it is around 50Gb in size...

I was reading about shards and came across a post recommending 20 shards per 1Gb of Heap on a node. Now I realise my setup far exceed this.... (13000 shards on 2 nodes) Each node has a 4Gb heap.

The frequency of the cluster becoming unnresponsive is slowly increasing. I am in the process of migrating to the latest version of elasticsearch in the hope that is will help resolve the problem... although I also really need to upgrade to stay within the support periods... I'm on 1.7 at the moment.

So I have reading about setups/config etc and trying to work out how I can improve the current set up...

a) would the number of shards per heap potentially cause the problem I am seeing. Assuming more and more data and indices are being added over time?
b) would decreasing the number of shard per index be the way to go?

Any help would be appreciated....

Yes. Having lots of very small shards is very inefficient and can cause performance as well as severe stability problems.

Yes, that will reduce the rate at which the shard count increases. You do however also need to dramatically reduce the shard cound and as you are on such an old version your options are as far as I can remember limited ton deleting indices or reindexing them into fewer larger ones.

Hi Christian, thank you for the very quick reply...

What would you suggest is the correct number of shard per index in my scenario... Or at least your best guess? I assume I want at least 2? But should it be 3/4?

It is also worth noting that although the template of the indices are all the same and the type of data they store are the same... some are virtually empty and others are "very" large.. i.e. their sizes are not in any way consistent... is there some strategy I should deploy here?

And finally I am am exporting the current data from 1.7.1 and restructuring (manually re-indexing) the docs externally and then importing it into 7.13... so that should work well for this. I will leave the old cluster as is... but set up the new one with the "more appropriate" settings.

Thanks again!

Hi @kk123

Here is a great set of documents, talks about shard sizes and how to fix and over shared cluster

1 Like

Thanks @stephenb I'll certainly give that a read....