We are using ES and have a certain basic usage that is easily covered by a few nodes with 4 cores each. A few times a day we have very short high spikes (15 min) where we have to process lots of new data. Here even when we change the instance size to 32 cores we get sometimes 429 rejects because the CPU usage is unable to deal with the bulks (disk, mem, etc are not blocked). The spikes are not plannable (depend on user actions) and full performance should be available within a minute after we know it.
We can deal with this by sizing the cluster to a max, however this drives costs up by a high factor. Are there any recommended patterns how to dynamically size a cluster like this on a CPU usage up within seconds?
The shorted option I can come up with (untested) is to shutdown a node of the cluster, reboot it with the same data disks under a new machine with more resources, wait until it has recovered it's cluster state (should not take to long as it was only missing a minute) and then do it for the next one. When load drops do the same the other way around.
- Platform is GCE, but same pattern works for any cloud deployment.