Cluster taking a long time to open closed indices

I'm running a 3 x m5.xlarge cluster (all nodes data & master) at version 6.4.2 on EC2, with 9000 baseline IOPS GP2 storage and essentially default config (apart from obligatory discovery config for EC2). We have a lot of small indices (approximately 43,000 currently, 99.99% of them would only have a single primary shard + 1 replica, of size less than 100 MB), however the vast majority of these remain closed at any given time (as I write this only 150 are open, i.e. approximately 300 active shards), and get opened on demand and then closed after inactivity.

The problem we have is that indices used to open within 2 - 3 seconds (this is the time taken for at least 1 shard to be allocated), but are now opening in about 7.5 - 8 seconds (whether this was gradual or abrupt we aren't sure). One can assume here that load on the system from querying and/or indexing is not a factor because these timings are taken from periods of zero user activity. Initially I thought that the active shard count might be a factor, as at peak times we've had up to 2400 active shards (which I know is a bit high for a 3-node 8GB cluster, but memory pressure has not been an issue at all), however the index open time remains the same regardless of how many/few shards are active.

I have a separate environment with a lower spec'ed cluster in which I also activated up to 2500 shards, but where the index open time remained steady at around 2 - 3 seconds. The only difference between the two environments that I would consider to be potentially significant is the total number of indices (12,000 in the smaller environment vs 43,000 in the other one). However, it seems very odd/unlikely that the total number of indices would impact how long it takes to open a single index when there is no other activity on the system (0% CPU load), as that would imply that opening an index requires iterating over all indices in some way.

Can anyone shed some light on the factors that influence how long it takes to open a closed index and how/why that time might blow out? Is there any logging we could enable to find out exactly where the time is spent?

As a side note, I know it's somewhat unusual to have so many indices (closed or otherwise), but that's not up for debate unless it is the sheer quantity of them that is causing degradation in performance.

First thought is that because of the number of indices you have, the cluster state is super big which makes cluster state updates a bit slow.

That could explain the difference between both platforms.

Something that you might try from 6.6 is frozen indices. That might help (may be)... https://www.elastic.co/guide/en/elasticsearch/reference/6.6/frozen-indices.html

2 Likes

I agree with David that the sheer size of the cluster state almost certainly explains why it takes so long to open and close indices. The cluster state host information about mappings, indices and shards and updated are run in a single thread to ensure consistency. Before changes can be applied they must also be replicated and applied to the other nodes in the cluster, which is why the large number of indices and shards will have an impact.

Upgrading to version 6.6 and converting all indices from closed to frozen should likely help improve this, at least as long as the indices are read-only. You will also no longer need to open/close indices, which should make management much easier.

If you want to make changes to the indices you will need to unfreeze them and them freeze them again, which again will require the cluster state to be updated and might be slow.

If this is still a problem, using a larger number of smaller clusters may help as that will reduce the cluster state size for each cluster.

Scaling a multi-tenant cluster by having a dedicated index per tenant rarely scales well. What is the use case? Why can't you have multiple users share indices to make this more manageable?

2 Likes

Many thanks for the quick reply Christian & David. I hadn't realised that index opening required that full replication of cluster state (and that it included all the closed index information), although on reflection that does make some sense.

I had a quick look at the index freezing feature last week, and as you say I think it may be a good fit for our use case. Changes are possible but rare (and the biggest ones involve re-indexing anyway), so unfreezing is somewhat reasonable, albeit if we continue to accumulate more and more indices then it will take progressively longer to do that.

The reason we have so many indices is that customers will take some input data and run a potentially unique analysis on it that produces a potentially unique set of fields, hence two such analyses can't necessarily share an index mapping. There is also the possibility that a given analysis may need to be deleted, which is doable with bulk delete, but easier if we just need to delete the whole index. This approach was obviously taken on the naive assumption that the number of indices in the cluster would not affect performance, which is now proven false.

We'll likely experiment with 6.6 and also consider our own form of 'freezing', as these indices can be re-created from master data (that exists elsewhere) at any time, and having all of them in ES all of the time is an unnecessary luxury (that we likely can't afford).

While it's true that freezing the indices will make management a lot easier than opening/closing them explicitly, it does mean that they are a little heavier than if they are closed. For instance they appear in the shard routing table and are actively replicated (i.e. if a node fails then we rebuild frozen indices on other nodes just as with normal indices). I think with 43,000 indices this might present a problem.

I think it's possible that Elasticsearch does iterate over all indices at some point in the process of opening a closed index. We don't really expect that many indices in a cluster so we don't try and optimise for it very hard. Closed indices are cheap but not totally free.

One possibility could be to freeze them more deeply: get the unneeded indices completely out of the cluster by taking a snapshot, and then restore them on demand.

1 Like

Thanks for the clarification David. Indeed it seems freezing would be useful to us only as an alternative to closed indices, and likely only at a total index count of <10000 (which is when unfreeze/open duration starts getting a bit unacceptable for us).

I'd completely forgot about snapshot - that could indeed be a very suitable option for us. Thanks for the tip.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.