Hello,
I recently upgraded my cluster from 5.5 to 5.6 (I have same issues on a 6.1 cluster too). Soon, I got two issues:
- DateHistogramAggregation queries took a long time (related to https://github.com/elastic/elasticsearch/issues/28727)
- I lost the monitoring, 1 node did not stop to log " collector [index-stats] timed out when collecting data"
I suspected that I messed up something during the upgrade with xpack. So yesterday I did a full cluster restart. Not only I got the monitoring back, DateHistogramAggregation were much faster (from timeout at 30s to ~6s).
This morning I decided to check and DateHistogramAggregation were very slow (timeout on few data) and monitoring was lost, with the same logs. It happened around 7 A.M (Europe/Paris), not too long after some index mappings updates.
I searched a little bit and indeed, the cluster-stats API was slow : 1m44s. I read it could be caused by oversharding and indeed, I was oversharding : 8000 shards in a few hundreds indices time-based, for 6 servers with 16GB heap each. I was way over the "good rule of thumb" of < 25/HEAP_GB. Because of a mess up with index templates, even .monitoring-es-6* were on several shards.
Now, I am at less than 2000 shards, which is a lot but way below 25/HEAP_GB (25*96=2400). I removed the old monitoring indices to start clean.
=> Cluster stats API is now at 12s. It's less than 1m44s but it's still a lot more than my cluster in 5.1.6 with 320 shards (0.01s).
And the slowness of DateHistogramAggregations came back.
It's going to be difficult for me to reduce again the number of shards (yes, I could be ~1500 but not a lot less), so the question is, what can I do ? Is there a hard number of shards I must not exceed (I have not read that) ? Some tuning ? Something to monitor ?
Thank you very much,
Regards,
Grégoire