Hey, I'm trying to stabilize an old cluster.
We are running on 7.10, and plan to upgrade, but first need to stabilize the situation.
Our current configuration is running 57 shards, replication 1, on 4 nodes + 3 master nodes.
Shard sizes average ~65gb, where some of the shards are at 230gb.
The hotspots are caused by custom routing, which is the customer_id.
In short, it's an app for many users. Each user can have X number of documents. I guess the initial idea was to limit number of shards per query, since all queries limit to single customer_id. It apparently worked fine for many years, but now, we have couple of big customers, where the number of documents is in millions.
The routing_partition_size is set to 3, but we would have to increase it significantly.
With ~5tb of data, I was thinking about reindexing to ~168 shards, routing_partition_size 24, and 8 nodes. This way we distribute the big customer data where id doesn't create such hot spots.
Question is - is this the recommended approach? Is there any better solution for this?
We do not delete data, ever. So I'm not sure if rollover index would be a good choice.
I also don't want to manually move big customers to separate indexes, as this, well - is manual work.
Other approach would be to change routing to default doc_id, but then each query would have to hit 160+ nodes.
I'll be performing experiments, but just looking for some suggestions.
Thanks