I have around 21 nodes cluster[18 data nodes]. I have 9 shards and 1 replica for each index. Whenever there is a network issue[For example 3 data node goes down] I get a huge performance drop because of shards relocation and the load goes to the remaining 15 data nodes. so I end up closing the indexes to make the cluster healthy.
So I am planning to increase the no of shards to 50. Is that a bad idea? Or is there any other way to avoid this performance drop in this scenario? Any suggestions would be really helpful.
I would imagine no. Someone correct me if I am wrong, but the way I understand it is that one of the points of sharding is that it allows you to spread the load across multiple servers. That way each data node can contribute to indexing and searching data. But if you have way more shards than you have data nodes, they will double up and so there isn't a performance gain. For example, having an index with one 10GB shard per node or ten 1GB shards per node would perform similarly since you are not really spreading the load. Each node still has 10GB of that one specific index. Spreading it into multiple chunks wouldn't necessarily make it go faster as it's the same amount of data. Getting too many shards can actually slow things down. But having too big of shards can slow it down also. I believe yours is under the normal threshold though.
With 9 shards and 1 replica, your index has 18 total shards. With 18 data nodes, it means your shards should be evenly spread across each node. Each data node has one 4.5 Gb shard. Search and indexing requests will be split among all data nodes at once.
With 50 shards and one replica, each index will have 100 total shards. That means each data node will have 5-6 shards of that index. Each shard will be smaller, but the total size on each data node will still the same. You still have 81 GB of data that needs to be spread across 18 nodes. Increasing the shard count doesn't change how much total data is on each node, just the individual size of each shard. So if a data node goes down, that same amount of data still needs to be moved to a new server.
So assuming you don't have super large shards or anything, if you have more data nodes than you do shards, increasing the shard count may help performance as it will spread the index to more nodes. But if you have more shards than data nodes, increasing the shard count probably won't help.
An average shard size of 4.5GB seems like a good number, and given that you already have 3000 shards in the cluster, I would not recommend increasing the number of shards per index. Increasing the number of shards in the cluster does not reduce the amount of data that need to be recovered when nodes fail.
Are these network issues common? How long do they typically last?
I would recommend upgrading to at least version 1.7.x as you could benefit from synced flush in order to speed up recovery. If your network issues are reasonably quickly resolved you might also be able to reduce reallocation by delaying recovery.
Be aware that delaying recovery could cause problems if additional nodes were to go down at a later stage since you would be operating with a reduced set of replicas for a period of time. Increasing the number of replicas could help avoid this, but will naturally also increase the amount of data held on each node.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.