Why is first index creation each day slower?

Hello

On various Elasticsearch deployments I am finding that the first time I create a document in a new index each day (or at least in a long period of time) the response time is much higher. In these cases the total response time (measured from the client) is often 5-10 seconds, compared to a few 100s of milliseconds when creating a document in an existing index.

If I consecutively create documents in multiple new indexes, I only see increased response times in the first case. If my first document creation of the day is to an existing index it appears normal, but when I first create a new index I still see the increased response time.

I am using AWS's ElasticSearch service with a mix of "t2.small.elasticsearch" and "t3.small.elasticsearch" instances, but the CPU load is consistently low so I would not expect any increased latency due to that. I am seeing this behaviour even on newly-created instances, with very low document and index counts, albeit to a lesser degree.

Is this expected behaviour? If so, what causes it to only occur once a day?

When you create a new index the cluster state need to be updated and propagated to all nodes. t2 and t3 instances have very limited CPU allocation and can at times get starved, which may slow this process down. It also need to be persisted to disk so slow storage could also have an impact. The last factor that affects this is the size of the cluster state, and if you have a lot of indices and shards in the cluster this can become a bottleneck.

5-10 seconds is however very slow so might indicate your cluster is not optimised or overloaded/underresourced.

Hi Christian, thanks for the response.

In terms of CPU my metrics are showing that it occasionally spikes a little above 20% but is usually far below that, averaging more like 3%. I'm using GP2 SSD disks which I'd hope should be ok. In the largest case my cluster has around 40 indices and 200 shards.

I'm running the clusters with only a single node in each, which I realise probably isn't ideal, but I had hoped it would be ok since these are quite low volume for the most part.

I've been able to replicate these slow responses in cases where the cluster is completely idle except for my single request to create a document in a new index.

The part that I find most odd is that this only happens on the first index creation, and afterwards it's quite fast. Is there anything in the way that ElasticSearch works internally that could lead to this?

What size is your gp2 EBS volumes? These get IOPS proportional to size so small volumes can be very slow and can become the bottleneck.

To try to identify whether the instances or storage is the bottleneck I would recommend trying with a small m5 instance or faster storage to see what, if any, makes a difference.

I believe the volumes are currently 10GB, as the data storage used is not expected to go over that.

Thanks for the suggestions, they sound like good ideas. I will try them over the next couple of days and see what happens.

That would give you 30 IOPS (3 per GB) which is very, very low. I would be surprised if that was not a contributing factor to your problems.

Based some more testing I've done it seems like larger volumes (e.g. 100GB) may help a bit, but not very significantly. Using m5 instances rather than t3 did make a big difference. It's strange, since I don't see anything in particular in the metrics suggesting that the clusters are CPU or memory constrained, but the latencies dropped significantly after changing instance type. However, I'm not sure that it will be practical to use m5 instances in every case where I'm currently using t3.

As a workaround I think I can avoid using daily indices for these documents. There are only a small number per day so even if I use only a single index I don't expect it to get particularly large. It looks likely that this will avoid the extra load from index creation causing the initial request to specifically take longer, but I am still seeing some long-ish latencies - probably due to the fact that I am still using t3 instances.

If there's anything else I might be missing I'd welcome more advice! But otherwise I'm going to press on with what I've described above