I am working on building a large cross cluster search setup, consisting of several underlying clusters all equally sized, and a single smaller dedicated cluster to be used to access all nodes below with cross cluster search.
We ran a PoC and load testing, and it works quite well, but we have doubts in regards to how to distribute the data.
One idea is to do deterministic round robin to load data across all clusters equally. Every cluster will end up with the same set of indexes, about same amount of data, queries would need to be executed on all clusters to have a complete view, kibana views would need to be based on
*:indexname-* to be complete, etc. Main advantages of this setup are easier management, truly equal load distribution between clusters, better performance on queries (if I understand the internals correctly).
The other idea is to distribute complete sets of data across different clusters, so we don't run into a scenario where a single cluster having issues affects completeness for every dataset.
We do have an active/active DR setup planned, (data replicated either via a duplication of the ingeastion pipeline or from cross cluster replication), so I was leaning towards the first option as we could setup automatic failover of the seed pool to equivalent DR cluster if it comes down to that, while we solve issues and resume indexing on the current active cluster.
So, asking the community for educated opinions of pros and cons here.