I have an index with 21 primary shards and 2 replicas that has grown such that each shard is between 90-120 GB. This is spread between 9 nodes that sit at around 37-40% disk usage right now.
I attempted a split operation that grew the number of primary shards to 168 (increase by a factor of 8) with the goal to reduce shard size to around 10-15GB. The result was that the split created and started all 168 primary shards, but they were all the original sizes. The replica shards never got allocated though, and the shards did not shrink. I reduced the number of replicas down to 0 and then the cluster began to move the large shards around to balance. I saw that this was going to take forever given the size, and could potentially make the cluster run out of disk space since I was now copying all that data.
I believe the shard size is expected as the split must be creating duplicates and then deleting the data that isn't needed in each shard. However, I don't think my cluster has enough resources to handle a split by a factor of 8. Does anyone know if I would have more luck splitting by a factor of 2 and then repeating this a couple more times after the merge has taken care of the extra dup docs?