Elasticsearch cluster design thoughts

Hi there,

I have a perfectly running cluster on AWS (not managed), that is composed of 6x M5D instances.

So we have local ssd storage (we got rid of EBS since this has caused troubles), 2 CPU cores and some RAM.

For costs reason, each node is master eligible, data node & client node and we have a loadbalancer in front of them.

The total cost is about 587$ for the instances.

Our cluster contains now 90 indices that have 6 shards and 2 replicas each, and about 190M documents.

We are adding every two month a new instance because we tends to run out of storage and costs are going up.

So I'm wondering if we are doing good or not with this setup, maybe it's time for a redesign?

Cluster needs to meet those points:

  • "100% uptime" (should be "as reliable as possible" )
  • Can loss 2 nodes at anytime without disruption/losing data
  • Cheap to run & effective

Current search rate is about 200/s and indexing rate is between 900 and 1500/s.

Edit:

Oh, and we do bulk indexing and could have some complex queries.

Any thoughts?

Bump,

Any advices?

Hi,

Still seeking for advices :slight_smile:

Hi,
Still interested for any advices/feedback if someone runs a similar setup // similar use-case.

Hello,

I would advise you to drop a replica on all of your indices, now and going forward.

In your heavily populated cluster, by having two replicas of each shard, you're increasing the average indexing load on every node by 50% over a single replica setup, and increasing the storage and segment memory demand on every node by 50%. As far as having a greater guarantee of avoiding loss of data due to losing nodes, those additional burdens on your resources are actually increasing your risk of node instability and hence cluster instability.

As far as any hypothesis of improving your search responsiveness by having an additional copy to search for any given shard that needs to be searched (that is, when a search says it needs to search a copy of shard my-index shard #2, each copy gets about 33% of those instead of 50% of them), that's nullified by the fact that now each node is hosting 50% more shards and, hence, on average, is still receiving the same number of searches to process with the same amount of physical resources - except wait a minute, we know that those resources are being over-taxed for indexing and storage.

Sometimes that's a difficult sell in an organization that has decided on that policy of being able to lose 2 nodes without losing data.

A way to start to deal with that is - if you are in a time-series scenario where you know only the current day (or week or month) indices will be indexing, and you aren't already snapshotting, set up a repository in S3 and start doing so. Then, if you drop a replica after indices are snapshotted with their final contents, the loss of a single node will still result in at least one copy of all shards in the cluster, and the loss of two nodes will result in at least one copy of all shards in either the cluster or the snapshots.

If you can share more information about whether you are doing daily indices and what the sizes of those indices are, additional advice about your sharding strategy can be provided.

1 Like

I’d have a look at using i3.large as they offer 475GB internal NVMe drives which work very well for your indexes.

Usually run a two or three node layout and able to store billions of docs with relatively good performance.

Cheers - geo

1 Like

Hello,

Thanks for your answers :slight_smile:

We don't move the indices very much (to be more precise there is almost no rotations), our applications are realtime/near realtime (we have users connected to our apps that relies on elasticsearch for some features).

We're trying to achieve high availability (hence the "two nodes can be lost at any moment").
If the cluster goes down or red, "the world stops" for our users, this is not an option and that's my first priority.
I am tied to a budget that is already too big for the liking of my management so my options are quite limited.

We already have backups on S3 every days at midnight, we tried to make hourly backups but I discovered that using S3 is consuming RAM and this causes too much pressure on the node, exceeding 90% JVM utilization and putting the node at risk of doing an OOM.

Please find below a capture of our kibana, I ordered indices by disk usage, I can't reveal indices names here.

As for I3 instances, the disk performance is great but CPU performance is lacking behind the M5D we are using (but, I think Glen_Smith is right, by doing 2 replicas we are putting more stress on the cluster :confused: ).
Plus, 4 instances of I3.L is almost the same price of 6 M5D.L instances, without the availability to lose two nodes for the price.

I saw cluster designs with something like 3 master nodes with T3 instances, data nodes with I3 instances, client nodes with C5 instances, that sounds like a good idea for a well-balanced cluster, but the costs are on another world.

So here I am, trying to do my best trying to solve the impossible puzzle ^^
Again thanks for your help!

I gather, then, that of the 90 indices, you have 80 that are less than 12GB. You're currently pushing 300 shards per node, which is a large number for these resources.
If you reduce those to a single primary shard per index, searches will consume less CPU. But that isn't going to help your storage capacity.

Have you put much into rationalizing your mappings yet? Field by field, determine whether you really need to search/aggregate on that field, and set up analysis (or lack thereof) appropriately. You could reduce your indexing CPU load and your storage. Are you using best_compression?

That reminds me, have a look at this blog post. It's got strategies applicable to more then just file-beat indices.

That's a good point, I will ask the developpers about that.

So far, I get that :

  • We are running too much shards per instance
  • We are running too much replicas per instance
  • We are probably not doing great with our mappings (not confirmed)

Reducing the numbers of shards should be possible, is reducing primary shards from 10 to 5 is a wise move?

For the replicas, I'm stuck here, sure we can reduce the number of replicas, that's a simple & very effective solution but doing so will put the cluster at risk in case of multiple nodes failure at the same time, I'm really wonderring how everyone that has a mission critical cluster is doing to achieve HA.

As for mapping optimizations we will check what can be done.