Question on CCR feature on Elasticsearch 6.7

Hello,

I had a quick question on CCR feature which is now GA in 6.7, based on the documentations & presentations available, we can now create two different clusters in two different regions, so going by this ideology.

If I setup a cluster of 3 masters & 3 data nodes in Azure (US-East) region which I would term as a Leader cluster (assuming all the indices in this cluster are leaders), another cluster of 3 masters & 3 data nodes in Azure (US-West) which I would term as follower.

Now that everything is setup & the follower is accurately replicating the leader, I am able to perform index / search operations on the leader.

Due to some reason the region holding my leader cluster goes out, I would be left will the following DR mechanism.

  1. Update all the indices in the follower cluster to stop following using the "unfollow" API, as this is an index level API I would have to run this against all the indices in the cluster.
  2. Using the "unfollow" API now makes this cluster a "leader" eligible cluster, which means I can send indexing requests to this cluster.
  3. Since the "unfollow" API has a drawback "Converting a follower index to a regular index is an irreversible operation.", when my orignal leader region is up, I am left with no other option but to discard everything indexed on that cluster & setup a new cluster altogether which would now be a follower cluster - per documentation "Currently cross-cluster replication does not support converting an existing regular index to a follower index".

Could you please let me know how / what would be the appropriate action in-case of the above scenario using CCR, or are there any planned features which could fill the gap mentioned above.

Thanks,
Vikas.

There is no need to set up a whole new cluster, but you will need to create new follower indices after using the unfollow API.

The problem in the process you describe above is that you have created divergent histories in the two clusters and there is no way to reconcile them again. This is not an omission in CCR so much as a fundamental problem in this way of handling the temporary failure of the leader cluster.

I think it'd be better not to unfollow the leader when it becomes unavailable. The follower cluster remains available for searches, and when the leader cluster comes back it will carry on replicating data to the follower again.

Hey David,

Thanks for the quick response,

In my case the data is expected to flow into the indexes on active basis i.e. the indexing operation will run as & when the transactions occur within our application & the search should be performed on the latest available data, you could compare it to a stock market live feed pushing data to elastic, in such scenarios performing searches on stale data would be problematic at times.

Hence the thought process was to perform an "unfollow" on the follower index & use that cluster / node as leader which will receive the latest data, once the original leader region is back up, use the original leader as a follower now.

What do you suggest in such a use case.

Thanks,
Vikas.

If your application is also running in the US-East region then it seems reasonable to assume that it won't be indexing data while the US-East region is unavailable.

More generally, it's probably best for your application to index data into a cluster within the same region. You can then use CCR in both directions to replicate each region's data to the other region.

Could you please elaborate this.

Thanks,
Vikas.

Sure. The leader/follower relationship in CCR is an index-level relationship. This means that each cluster can contain some indices which are CCR leaders and some others that are CCR followers. Thus in the US-East region cluster you could have a data-us-east index, and similarly in the US-West region cluster a data-us-west index. In the US-East region your application indexes into the data-us-east index, and in the US-West region your application indexes into the data-us-west one. Then in the US-East region you can set up a CCR follower called data-us-west-replica which follows the data-us-west leader index in the US-West region, and conversely in the US-West region you can set up a CCR follower called data-us-east-replica that follows the data-us-east leader index in the US-East region. Any searches you need to do can run over both indices in the local cluster: GET /data-us-east,data-us-west-replica/_search. If the US-East region fails then the application running in US-West can still index into the data-us-west index, and will still serve searches with its most recent data. When the US-East region recovers then each -replica index will catch up with its leader again.

Thanks for that information David,

In my use case there is only a single application (web app), which has the flexibility of deploying into multiple regions, we are using Azure's "SQL server always on" feature, so the data is first pushed into SQL then to elastic, in-case of region outage, the always on feature helps in storing the latest data at a single data store which is active and available.

The problem I see is that if we scatter the write enabled indices across regions then during an outage those indices would never receive latest data & converting a follower index into a leader seems to be an expensive effort (thinking of creating a follower now in the other region etc.)

Would like to know if there are any better approaches to this which are in similar lines to Azure's SQL always on feature.

Thanks,
Vikas.

The bidirectional-CCR setup I described seems very similar to the "active/active" setup in the page to which you linked.

The setup that you originally described seems similar to the "active/passive with hot standby" one, with similar problems around failover and rebuilding replicas when a failed region comes back online.

You are correct, your explanation is similar to an "active/active" setup, the major difference here is that indices on either one of the nodes is read-only, so during an outage the indexing operations on the indices present in that region will have to be suspended, which is a concern here.

Came up with this example to understand how we would handle / what are our DR options during an outage.

Are there any future plans to support / convert an existing index to become a follower?

I could see a discussion on converting an existing index to become a follower, but couldn't see any further updates on this topic, can you shed some light on this please.

Sure. The problem of divergent histories that I mentioned above is the obstacle to that idea. Once you've detached a follower from a leader they will contain different documents, and there's no way to reconcile this without risking data loss. You have to build new followers.

I don't understand this concern. If a region is down then the application won't be running in that region either, so it doesn't matter that the indices from that region are read-only.

Consider this scenario :

A single web app deployed on two different regions on Azure, along with Azure Traffic Manager to handle incoming browser requests to route users to appropriate endpoints (web app hosted in different regions), with a SQL DB hosted on different regions containing active read/writes - replication enabled.

If in the above scenario a particular region isn't available, the traffic manager will acknowledge & start routing all future requests to the region which is available, the SQL DB will also perform in the same manner.

But the elastic index will remain read-only for all the indices which are falling in the outage region.

So the app would still be live along with the SQL DB, but the indexing operations won't.

Hope I could explain it.

Thanks for your responses again David.

Vikas.

I think there is some misunderstanding here. The proposed architecture is to write to the indices corresponding with the region in which each request is being handled. For example if a request is handled in the US-West region then the corresponding indexing traffic goes to the data-us-west index. If the US-West region is unavailable then the data-us-west index is indeed read-only, but there are no requests being handled in that region so there's no indexing traffic for this index and therefore it doesn't matter that it's read only.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.