I have a small Elasticsearch cluster having 3 master nodes(2core,2gb t3.small ec2) and 2 data nodes data-0 and data-1 (2core 8GB m6a.large ec2 4gb max heap for data nodes). Cluster runs in an EKS cluster. Cluster has one index(40p1r) currently having around 920 million docs with index size of around 1.9TB.
Cluster receives continuous live doc indexing traffic 24/7 at avg 60calls/sec and search query rate of 2calls/sec
I upgraded the cluster from 8.3.1 to 8.13.4 on which I suddenly saw a huge number of document deletes which lead to huge number of segment merges. This lead to 100% cpu usage for good amount of time during which cluster was not able to serve search traffic or index documents properly. I am not able to figure why so many deletes happened.
Here is the events which happened
- 16:18 : performed POST /flush and PUT _cluster/settings {"persistent": {"cluster.routing.allocation.enable": "primaries"}} (live indexing was still going on)
- 16:20 : Restarted node data-1
- 16:22 : data-1 came up, it took considerable long amount for 40 shards to get assigned one by one(i have done restarts in past and shard assignment happens very quickly like in 2-3 seconds but this time it took around 20-30 minutes). Several Delete rate spikes of order > 500K docs/sec happened only on data-1 node. I didn't notice this at that moment as i was expecting smooth upgrade as i had already done it in test env.
- 16:41 : data-0 was restarted
- 16:43 : data-0 came back up. The behaviour on data-0 was same as data-1, where shards took long time to get assigned and deletes of order of 500K docs/sec were seen.
- 16:52 : cpu of data-1 almost reached 100%
- 17:15 : cpu of data-0 also touched almost 100%
- 17:30 : restarted data-1 in order to upgrade the instance to 4core16GB. As soon as it restarted, high spikes of deletes docs continued. 4core instance also hit 100% eventually.
- 18:06 : We stopped sending index documents traffic to decrease load.
- 18:18 : all the deletion spikes got over, and now immediately Document merge rate spikes were seen touching 300K docs/sec
- 18:45 : merges got over and cpu on both nodes went down immediately
I had reset the cluster.routing.allocation.enable property after node restarts
I am assuming high number of segment merge was due to high number of deletes, as segments might have hit the percentage deleted documents threshold, which would trigger merge(we have not configured this percentage manually, everything is default)
For disk we use gp3 EBS, in EBS i can see read throughput touch its max threshold(125mb/s), IOPS were well under limits.
I am not able to figure out:
- Why so many deletes got triggered, In whole april month, we have barely seen a delete rate of 500docs/sec and merge rate of 3.5K/sec. But after upgrade we immediately saw around 500K deletes/sec and then 350K merges/sec. I couldn't find anything related this in my searches or in breaking changes from 8.3.1 to 8.13.1. Is this something which is normal ? and happens regularly as this was my first Elasticsearch upgrade and i am unknown to lot of things here.
- Why shards took so much time to get assigned to nodes on node restarts ? I have restarted nodes in past during EKS cluster version upgrades and shard assignment on node restart happens in seconds, This time it took like 30-40 mins for each node after upgrade and same time during instance upgrade as well from 2 core to 4 core.
- Disk reached its max throughput after Node restart on latest version, why did this happened, was it also due to high number of deletes only ? or after upgrades there is some big IO's done ?
Hoping someone here can answer my doubts above or point me to resources where i can find answers to above so i can be ready next time with solutions before performing another upgrade
Wanted to add multiple grafana screenshot but combining them all in one as i am not allowed to embed more than 1