Should I use ElasticSearch across two datacenters for convenient data replication?

AEvar_Arnfjord_Bjarm · September 17, 2011, 9:31am

I currently have 3x ElasticSearch nodes in a single cluster in a
single datacenter.

They have 1x active index at any time (there's an alias that I move
around) which is miniscule by most standards, it has around 4 million
documents and is ~1GB on disk and is updated daily.

There's 1 shard for the index and the replica configuration is such
that there's always a complete copy of the index on every machine, and
when I make queries I use "preference: local" to ask ElasticSearch not
to forward the query to other machines.

I'm about to add 3x more machines in a second datacenter, and I'm
wondering whether to set them all up in one cluster or have two
clusters.

The reasons I'd set it up in one cluster:

Easy to monitor the entire installation with one instance of
ElasticSearch Head / other monitoring tools.
Easier to maintain it, I just have to run one cron script per day
that talks to one cluster, although running it against N clusters
would also be trivial.
I don't really care about the slowness of the inter-DC link. It
doesn't really matter if it takes 10 minutes or 1 hour to populate
the index. While it's being populated I have yesterday's index
around serving requests.

Reasons not to do it:

Perhaps even with my setup of having a complete copy of the index
on each machine and using "preference: local" there will be some
situations (e.g. a resource exhaustion on one node) that'll cause
an ElasticSearch node in one datacenter to forward requests it
can't handle to a node in another datacenter, at which point the
inter-DC link would become a very painful bottleneck.

But I have no idea whether that actually happens in practice, and I
couldn't find anything documented about this.
If I ever start using indexes that need to be more synchronous I'd
have to change the configuration to use two clusters.

I'll probably just set it up as two clusters because it's easy to run
two cronjobs & to have the chance to start using more synchronous
indexes.

But I thought I'd post this here in case it generates some interesting
discussion, in particular I'd be interested in more details about when
ES nodes decide to forward requests to other nodes.

ppearcy · September 20, 2011, 10:05pm

Hey,
From my experience, I'd strongly discourage this. The biggest pain
point I've had with ES is network partitions within a datacenter
causing things to blow up. The 0.16 release has improved that greatly,
but I've still had some issues (I'm hoping 0.17 addresses these).

Across DCs, you're much more susceptible to partitions.

There are some newer features that might mitigate most of the issues,
eg minimum number of masters, the quorum support, but I haven't
started using these.

The approach I'd recommend for multiple data centers is to have
elasticsearch sit on top of a cross-DC data store (eg, couchdb) and
use the datastore for cross DC replication.

Hope this helps.

Best Regards,
Paul

On Sep 17, 3:31 am, Ævar Arnfjörð Bjarmason ava...@gmail.com wrote:

I currently have 3x Elasticsearch nodes in a single cluster in a
single datacenter.

They have 1x active index at any time (there's an alias that I move
around) which is miniscule by most standards, it has around 4 million
documents and is ~1GB on disk and is updated daily.

There's 1 shard for the index and the replica configuration is such
that there's always a complete copy of the index on every machine, and
when I make queries I use "preference: local" to ask Elasticsearch not
to forward the query to other machines.

I'm about to add 3x more machines in a second datacenter, and I'm
wondering whether to set them all up in one cluster or havetwo
clusters.

The reasons I'd set it up in one cluster:

Easy to monitor the entire installation with one instance of
Elasticsearch Head / other monitoring tools.

Easier to maintain it, I just have to run one cron script per day
that talks to one cluster, although running it against N clusters
would also be trivial.

I don't really care about the slowness of the inter-DC link. It
doesn't really matter if it takes 10 minutes or 1 hour to populate
the index. While it's being populated I have yesterday's index
around serving requests.

Reasons not to do it:

Perhaps even with my setup of having a complete copy of the index
on each machine and using "preference: local" there will be some
situations (e.g. a resource exhaustion on one node) that'll cause
an Elasticsearch node in one datacenter to forward requests it
can't handle to a node in another datacenter, at which point the
inter-DC link would become a very painful bottleneck.

But I have no idea whether that actually happens in practice, and I
couldn't find anything documented about this.

If I ever start using indexes that need to be more synchronous I'd
have to change the configuration to usetwoclusters.

I'll probably just set it up astwoclusters because it's easy to runtwocronjobs & to have the chance to start using more synchronous
indexes.

But I thought I'd post this here in case it generates some interesting
discussion, in particular I'd be interested in more details about when
ES nodes decide to forward requests to other nodes.

Topic		Replies	Views
ElasticSearch Setup to store data from multiple datacenters Elasticsearch	6	400	July 6, 2017
Problem with keeping in sync Elasticsearch across two data centers Elasticsearch	9	2178	July 6, 2017
ES and multiple datacenters Elasticsearch	11	1096	July 6, 2017
Multi DC cluster or separate cluster per DC? Elasticsearch	6	1373	July 6, 2017
Async Replication and Multi DC Elasticsearch	1	383	July 6, 2017

Should I use ElasticSearch across two datacenters for convenient data replication?

Related topics