First of all, I know all the guidance against cross-data centre clusters. However, I have successfully implemented cross-DC ELK before but not on the multi-region scale, hence this post.
Our setup:
Multiple worldwide regions, each with 2 DCs. The DCs in each regions are pretty close, physically. We want a cluster in each region, ideally with cross-DC resiliency. The clusters will all be storing the same types of data, with the same field names etc. The data is just logging for now, so while it isn't critical to operations it would be very helpful to have it with DC failover.
The indices would have X shards with 1 replica shard. The primary and replica shard for shard S would not be located in the same DC. We are looking at tens of GBs a day in each region, but haven't got an accurate estimate yet.
We will want to be able to query and chart the information in each region using Kibana. Currently no ELK deployment exists, we are free to use the latest versions of the ELK stack. The servers are all high-end, with SSDs and heaps of memory.
I am trying to work out the best way to structure the clusters and to query them using Kibana. Some things to address:
-
What would be the maximum acceptable latency between nodes in a single cluster? Currently the worst ping time between servers in different DCs in the same region is 6.8ms. Other regions' pings between DCs are 1 or 2ms.
-
There is an issue for split-brain. Since you need an odd number of the total master-eligible nodes to be active to prevent split brain you can't solve the issue properly with just two data centres, as either could go down. In each region there is no 3rd location for a quorum-making node to go in the event of a DC failure. A dodgy solution would be to have a cluster's additional master node in a different region. I haven't seen a way to make a master node have less priority to becoming the active master, which would have perhaps made this solution a bit less insane.
-
Kibana. The initial thought would be to have a Kibana instance in each region, just for that region's cluster. However, we'd want all the Kibana instances to have the same charts and dashboards so this setup would require duplicating any changes everywhere. Since the dev/support teams are only located in one region the idea was raised to just have Kibana running in that location. In upcoming version 5.5 Kibana will support cross-cluster search. We could perform all the Kibana analysis through the Kibana instance in that one region, without any need to duplication issue.
-
Quick non-ES question: Am I correct in my understanding that Redis is no longer needed as a queue buffer between Filebeat and Logstash Indexers, since Logstash now has queuing built in?
If it all sounds utterly terrible the MVP would be a cluster in each region on a single DC, and an instance of Kibana in each region. But if the above plan is workable then it would be ideal.
Any comments on any of the points are welcome.