Intermittently slow queries after migrating from self-hosted ES6 to ECK 8.7.1 on GCP

Hey all,

We were running Elasticsearch 6 for a few years and the performance was good but we decided to upgraded to Elasticsearch 8 and leveraging the Elasticsearch operator and ECK stack. The node machine is the same as it was (16 CPU).

We are noticing some queries are intermittently slow and we can't quite correlate it from the dashboards. Our internal Grafana dashboards show some queries taking over 1 second to return whereas before, we had consistently good response times of ~50ms.

One thing we have noticed is the heap memory % jumps rapidly between 5% and 60%.

As I mentioned, from an application perspective, nothing has changed. The write throughput is the same (~4k per second) and the search through is the same (~5 rps).

Another thing we have also noticed is that periodically, we see the Lucene Merge Thread taking 100% of the CPU of a node - though we're not sure if this is related.

I hope you can help and I can provide any more details that might assist with solving this interesting problem!