Should I use ElasticSearch across two datacenters for convenient data replication?

I currently have 3x ElasticSearch nodes in a single cluster in a
single datacenter.

They have 1x active index at any time (there's an alias that I move
around) which is miniscule by most standards, it has around 4 million
documents and is ~1GB on disk and is updated daily.

There's 1 shard for the index and the replica configuration is such
that there's always a complete copy of the index on every machine, and
when I make queries I use "preference: local" to ask ElasticSearch not
to forward the query to other machines.

I'm about to add 3x more machines in a second datacenter, and I'm
wondering whether to set them all up in one cluster or have two
clusters.

The reasons I'd set it up in one cluster:

  • Easy to monitor the entire installation with one instance of
    ElasticSearch Head / other monitoring tools.

  • Easier to maintain it, I just have to run one cron script per day
    that talks to one cluster, although running it against N clusters
    would also be trivial.

  • I don't really care about the slowness of the inter-DC link. It
    doesn't really matter if it takes 10 minutes or 1 hour to populate
    the index. While it's being populated I have yesterday's index
    around serving requests.

Reasons not to do it:

  • Perhaps even with my setup of having a complete copy of the index
    on each machine and using "preference: local" there will be some
    situations (e.g. a resource exhaustion on one node) that'll cause
    an ElasticSearch node in one datacenter to forward requests it
    can't handle to a node in another datacenter, at which point the
    inter-DC link would become a very painful bottleneck.

    But I have no idea whether that actually happens in practice, and I
    couldn't find anything documented about this.

  • If I ever start using indexes that need to be more synchronous I'd
    have to change the configuration to use two clusters.

I'll probably just set it up as two clusters because it's easy to run
two cronjobs & to have the chance to start using more synchronous
indexes.

But I thought I'd post this here in case it generates some interesting
discussion, in particular I'd be interested in more details about when
ES nodes decide to forward requests to other nodes.

Hey,
From my experience, I'd strongly discourage this. The biggest pain
point I've had with ES is network partitions within a datacenter
causing things to blow up. The 0.16 release has improved that greatly,
but I've still had some issues (I'm hoping 0.17 addresses these).

Across DCs, you're much more susceptible to partitions.

There are some newer features that might mitigate most of the issues,
eg minimum number of masters, the quorum support, but I haven't
started using these.

The approach I'd recommend for multiple data centers is to have
elasticsearch sit on top of a cross-DC data store (eg, couchdb) and
use the datastore for cross DC replication.

Hope this helps.

Best Regards,
Paul

On Sep 17, 3:31 am, Ævar Arnfjörð Bjarmason ava...@gmail.com wrote:

I currently have 3x Elasticsearch nodes in a single cluster in a
single datacenter.

They have 1x active index at any time (there's an alias that I move
around) which is miniscule by most standards, it has around 4 million
documents and is ~1GB on disk and is updated daily.

There's 1 shard for the index and the replica configuration is such
that there's always a complete copy of the index on every machine, and
when I make queries I use "preference: local" to ask Elasticsearch not
to forward the query to other machines.

I'm about to add 3x more machines in a second datacenter, and I'm
wondering whether to set them all up in one cluster or havetwo
clusters.

The reasons I'd set it up in one cluster:

  • Easy to monitor the entire installation with one instance of
    Elasticsearch Head / other monitoring tools.

  • Easier to maintain it, I just have to run one cron script per day
    that talks to one cluster, although running it against N clusters
    would also be trivial.

  • I don't really care about the slowness of the inter-DC link. It
    doesn't really matter if it takes 10 minutes or 1 hour to populate
    the index. While it's being populated I have yesterday's index
    around serving requests.

Reasons not to do it:

  • Perhaps even with my setup of having a complete copy of the index
    on each machine and using "preference: local" there will be some
    situations (e.g. a resource exhaustion on one node) that'll cause
    an Elasticsearch node in one datacenter to forward requests it
    can't handle to a node in another datacenter, at which point the
    inter-DC link would become a very painful bottleneck.

    But I have no idea whether that actually happens in practice, and I
    couldn't find anything documented about this.

  • If I ever start using indexes that need to be more synchronous I'd
    have to change the configuration to usetwoclusters.

I'll probably just set it up astwoclusters because it's easy to runtwocronjobs & to have the chance to start using more synchronous
indexes.

But I thought I'd post this here in case it generates some interesting
discussion, in particular I'd be interested in more details about when
ES nodes decide to forward requests to other nodes.