I'm running a 3 x m5.xlarge cluster (all nodes data & master) at version 6.4.2 on EC2, with 9000 baseline IOPS GP2 storage and essentially default config (apart from obligatory discovery config for EC2). We have a lot of small indices (approximately 43,000 currently, 99.99% of them would only have a single primary shard + 1 replica, of size less than 100 MB), however the vast majority of these remain closed at any given time (as I write this only 150 are open, i.e. approximately 300 active shards), and get opened on demand and then closed after inactivity.
The problem we have is that indices used to open within 2 - 3 seconds (this is the time taken for at least 1 shard to be allocated), but are now opening in about 7.5 - 8 seconds (whether this was gradual or abrupt we aren't sure). One can assume here that load on the system from querying and/or indexing is not a factor because these timings are taken from periods of zero user activity. Initially I thought that the active shard count might be a factor, as at peak times we've had up to 2400 active shards (which I know is a bit high for a 3-node 8GB cluster, but memory pressure has not been an issue at all), however the index open time remains the same regardless of how many/few shards are active.
I have a separate environment with a lower spec'ed cluster in which I also activated up to 2500 shards, but where the index open time remained steady at around 2 - 3 seconds. The only difference between the two environments that I would consider to be potentially significant is the total number of indices (12,000 in the smaller environment vs 43,000 in the other one). However, it seems very odd/unlikely that the total number of indices would impact how long it takes to open a single index when there is no other activity on the system (0% CPU load), as that would imply that opening an index requires iterating over all indices in some way.
Can anyone shed some light on the factors that influence how long it takes to open a closed index and how/why that time might blow out? Is there any logging we could enable to find out exactly where the time is spent?
As a side note, I know it's somewhat unusual to have so many indices (closed or otherwise), but that's not up for debate unless it is the sheer quantity of them that is causing degradation in performance.