Elasticsearch Cross-site cluster

Hi,

Has anyone got experience with running a cluster across Data Centers? I'm interested in latency and bandwidth required between data centers. I've seen some information that 10Gbps / 10ms is required for a reliable cluster.

I do understand that it depends on how much data is indexed, but is there any minimum that anyone would recommend?

You are better off using CCR and CCS than running a cluster across a WAN link.

CCR has some limitation around replication of some of the kibana indexes I believe, or is this an incorrect statement?

I’m looking to make this highly available across data centers without having to create additional functionality to failover kibana and all the production data that is not being replicated with CCR. I want a failure to be zero touch.

The other issue is that with a hot standby, I’m just wasting compute sitting there when I could be using it.

I’ve been doing some testing with by pushing about 50k eps on primary shards with around 30ms latency between sites (i purposely picked 30ms to see how sensitive to latency it was). It appeared on the surface to eat that up fairly well. No cluster issues at all. Of course I understand that there is going to be a reduced performance, but how bad is it?

If I said that this cluster is not ingesting anything but local system indices and is just focusing on CCS, is spanning this across two DC’s that bad of an idea?

Spanning across a WAN is a bad idea.
What happens if you lose the link, which set of nodes would take leadership in the cluster? And what would happen to data being ingested on the "other" side of the cluster while one side kept going?

I am not aware of that?

Setup CCS, and have a Kibana talking directly to each cluster. Then you can put something on top of the Kibana's that handles detecting if one cluster is down and then point to the other one?

That's a nice goal, but I am not sure if it's possible without a fair bit of investment.

What happens if you lose the link, which set of nodes would take leadership in the cluster? And what would happen to data being ingested on the "other" side of the cluster while one side kept going?

A third site with additional master only nodes to act as a tie breaker. The cluster should continue to run at the other site. No different to if you had a number of nodes go down in a single site cluster.

We want to use this as a central point for a SIEM where we CCS clusters to directly search remote clusters. We're running hundreds of queries a minute through CCS. So all the SIEM rules & exceptions need to be in sync between Kibana's. Having a separate cluster as a backup would require that all the rules, exceptions etc stay in sync so that in the case of a failure, the other site is ready to run those rules. I am not entirely sure that CCR is actually replicating all SIEM rules & exceptions ( i need to test this) and I would need to activate those rules to run if the primary site fails. There is likely going to be a time frame where rules are not running until the backup site is running rules.

A stretched cluster that is really only doing CCS would make this a zero touch failure scenario. I can understand that if you are ingesting a heavy load that maybe there could be some significant drawbacks in a stretched cluster, however if the ingest load is very light and ingest performance is not really a concern then those drawbacks are possibly not a problem.

Standalone clusters would need a fair bit of custom work to have a no touch failover.