How to setup ES Cluster for Specific Short Spikes

We are using ES and have a certain basic usage that is easily covered by a few nodes with 4 cores each. A few times a day we have very short high spikes (15 min) where we have to process lots of new data. Here even when we change the instance size to 32 cores we get sometimes 429 rejects because the CPU usage is unable to deal with the bulks (disk, mem, etc are not blocked). The spikes are not plannable (depend on user actions) and full performance should be available within a minute after we know it.

We can deal with this by sizing the cluster to a max, however this drives costs up by a high factor. Are there any recommended patterns how to dynamically size a cluster like this on a CPU usage up within seconds?

The shorted option I can come up with (untested) is to shutdown a node of the cluster, reboot it with the same data disks under a new machine with more resources, wait until it has recovered it's cluster state (should not take to long as it was only missing a minute) and then do it for the next one. When load drops do the same the other way around.

  • Platform is GCE, but same pattern works for any cloud deployment.

How are you sending data to your cluster? Can you shunt the data to a queue (buffer) and then feed your cluster from the queue at a rate that it can handle?

Tin

Data is send as it is calculated, a queue is not an option (actually there is a queue right now to solve this problem), but the data should be available as fast as possible because basically people trigger an action, we do a recalculation, and results should be available as fast as possible. The calculation we can scale dynamically and within a few seconds, I'm looking for a similar way for ES.

I dunno if you will get within seconds though. ES does need some time to join and then rebalance load etc.

Autoscaling Elasticsearch, and most other data stores, is generally difficult as restartiung nodes to scale up take time and a lot of data need to be transferred if skalning out. Both these options also cause extra work for the cluster at exactly the wrong time. If there are any patterns to these user triggered actions, e.g. that they tend to occur during specific periods, it would be better to scale out the cluster in advance so that extra capacity is available during the period and then scale back.

You may also be able to more quickly increase indexing throughput by temporarily setting the number of replicas for the index to 0, but that trades away resilience for performance, which may or may not be an acceptable tradeoff.

Thanks for your replies, the "within seconds" maybe overstating it, but it should be done within the time it takes for the load to pick up. Let's say without scaling it takes 4 hours, using the higher CPU it takes 15 minutes. If now it takes 5 minutes or 10 minutes for nodes to join and then be fully available it still takes it down from 4 hours to 1hour. But the faster and more dynamic the better of course.

Regarding the replicas: The times are already without replicas, on large operations we disable replica and then replicate after the initial load is done, if a node goes down during this process we can always retrigger it.