Cluster(s) scale and failure design (ES 5.4)

Hi guys,
I decided to post a question about ES cluster(s) design because I think my actual design sucks a bit.

My goal is to set up ES infrastructure which will:

  • be able to index/update 10 000 - 100 000 (and more) daily, there are also data (100 000 - 1000 000) which will be used for reading only
  • be failure resistant (I understand there are many levels to achieve this, but let's talk about a basic failure resistant design)

I can run 8 nodes at this moment in total (my resources limitation).

My design #1:

  • 2 clusters, run on separated machines
  • each cluster contains 2 master-only-eligible nodes, 2 data nodes
  • usage of design: cluster1 and cluster2 contain mirrored data, when update/index is needed I will tell my application to use cluster1 and I will update cluster2, if everything is OK, I will tell application to use cluster2 (and do synchronization of cluster2->cluster1)
  • goal is to have a data backup and do not limit application when large index/update is running
  • problems: even number of data nodes (i know about split brain problem), quite complicated data migration

My design #2:

  • 1 cluster
  • 3 master-only-eligible-nodes, 5 data nodes, nodes run on separated machines
  • here comes my questions, how to design update/index operations scenario, but do not throttle or limit application in background and still have a backup
  • would you suggest using aliases with backup&restore function of ES ?

(I have accidentally sent a non completed post, sorry).

Thank you.

That's bad, see Important Configuration Changes | Elasticsearch: The Definitive Guide [2.x] | Elastic

Not sure what that means.

I have accidentally sent a non completed post, sorry. Post is updated.

What do you mean by that?

Consider scenario that I need to do large index/update/upsert operation for a index. This index is used by a web application for search purposes. If I start doing large index/update/upsert operation, my search result speed may throttle during that. Also, if index/update/upsert operations ends in incorrect state, I still need the original data.
I am considering to mirror index in cluster (design #2) and use aliases to point to updated version of index, but I am not sure about that.

Define large?

Also, you should use 5.6 as that is latest :slight_smile:

I do not know exact numbers, but I think at start there could be 100 000 - 1 000 000 and more updates daily, it could be growing in time.
Yep I know about 5.6 version, but I optimized a lot of things for 5.4, I need to check changes etc.

That load does not seem very high to me but I'm not updating any documents just indexing new ones...

We are running 3 different clusters for log analysing and the mid sized one runs on two physical servers with 5 ES nodes on each machine. Looking in Kibana, we're getting 15M logs per day at the moment. And the machines are not really under any high load but that is up to the HW resources of course. Also probably how complex the structure of your documents are...

These machines are 5 years old with spinning disks. Just to give some context.
24x Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz / 125.90 GiB RAM

Each ES node has a dedicated drive.

Some other design possibilities.
Dedicate 3 machines as master nodes if you really want dedicated master nodes and run two master nodes per machine, one for each cluster. That would leave you with 5 machines for data. Maybe one primary cluster with 3 machines and one secondary with two...

Or do something like what I do and run multiple nodes per machine. That of course depends a lot on what sort of hardware you have...

My setup is not really "production grade" but in my case that is acceptable. Sounds like you will have higher demands on availability.

Good luck. And I would also recommend to use 5.5 or 5.6.


Thank for your context.
My problem are hardware resources. I do not have enough resources to run 2 clusters with 3 master-only-eligible nodes + data nodes.
I think I can make my index/update and backup logic with one cluster using aliases. My idea is, as I wrote above, to have 3 master-only-eligible nodes and 5 data nodes. Nodes would run on separated machines to ensure stability against hardware failure. My index/update logic must ensure that web application (search features only) using ES is not throttled during indexing/updating:

  • 2 mirrored indices, alias using index #1
  • when update is needed, I start updating index #2 and make snapshot of updated index
  • if everything is fine I set alias to use index #2 and restore snapshot to index #1

Am I missing something? I know that this scenario has to be tested, but in general - could it work ?

Thank you.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.