Hello everybody.
A few weeks ago I've set up a new Elasticsearch cluster to hold logs coming from our Kubernetes cluster.
Since we installed a brand new 8.17 we decided to give the new logsdb format a go.
We started with a new basic setup with a week data persistence, standard rollout at 50GB and 1 partition and 1 replica, pushing data to a data stream.
Everything seemed to go pretty smoothly at the beginning but after a couple of day we started to have some kind of task-storm starting when an index was being rolled. The node holding the primary partition of the index being rolled would to almost 100% CPU (80% actually), the disks would go 100% and everything would totally stop for many hours. Usually 8 to 10 hours during which the cluster stopped ingesting and stopped answering to any query.
We tried to tweak various settings (disabling dynamic indexing, reducing the amount of indexed fields, removing the replica partition...) without any particular improvement.
After about a week we gave up, we just switched back from logsdb to stardard index format and the cluster started working as it should, with CPUs never going over 15% and disk acting... well... normally.
This is a production cluster so we can't really change settings to reproduce the issue again but I can share some of the things we saw when the disaster was happening
The cluster was overwhelmed with pending _tasks with any node reporting 8/10k pending tasks, mostly related to index commits (sorry I can't remember the exact task description).
The very-root cause may be that the cluster doesn't have very high speed disk but commodity ones - yet the impact we saw (8 hours to close an index) doesn't seem proportional to the fact that with a standard index disk are never working more than 50%.
The biggest problem I think is the fact that the whole cluster stopped working when this activity was going on (even other indexes) - no ingestion, no search.
Does anyone have any similar experience on this?
Can we expect some kind of improvement on this topic in next releases?
Thanks.