Increased Read IOPS usage after upgrade from 8.18.0 to 9.0.1

Be reminded what the settings here are.

One is the size you are allocating to the container.

The other is just the JVM heap size.

There’s no indication (shared) you had / have any heap pressure

There was indication you had more general memory pressure. I’m still finding those host-level major page faults a bit strange, like we’re missing some other factor here.

1 Like

I've set the memory request/limit to 40Gi, which is already 12Gi more than before. Also I have removed manually specifying XMS/XMS values.

Your values are going to double our cloud costs for this monitoring cluster, so we're driving with mine.

Our monitoring clusters have all been working fine before the update.

Due to the fact that:

  • the update can not be rolled back easily
  • we'd need to check each upcoming release whether the issue has been fixed (which i doubt due to the lack of a developer/manager having raised a hand here)
  • we're not really able to find a root cause here

we're slowly phasing out ES after many years of use for logging.

I would recommend setting the heap size to at most 30GB as that likely is more efficient as you will benefit from compressed pointers. If your cluster works well with a 20GB heap and you do not see any evidence of slow or frequent GC in the logs (which seems to be the case), increasing the heap size does not result in improved performance. This is why I suggested you leave it at 20GB.

Sebastian: There's bits of this thread which are not making sense, even a little bit irrational.

This (volunteer) forum has a lot of threads, but not many are intellectually that interesting. Yours is one of the few. For the intellectually interesting puzzles, they tend to go 2 ways:

  • quite a lot of back and forth, eventually reaching an understanding, which is maybe a bug, maybe a misunderstanding of how stuff works, sometimes user error, sometimes even "working consistently with its own documentation/expectation", ...
  • the poster gets frustrated and gives up

I don't blame anyone for giving up btw, we're not walking in their shoes so we dont know the pressures they are under and/or the choices they have to make. It is disappointing, like a mystery novel missing the last few pages.

BUT, a while ago you wrote:

So both the original settings and the self-decided 4x increased settings went against the clearly documented guidelines. I recall reading that and thinking wow!, you were concerned on cost before, surely that size has (direct or indirect) significantly more cost. But we were exploring options, so you changed stuff, fine.

Christian then made suggestion, which was given after you had decided to allocate 100GB:

to which you responded:

Eh? I was on my hols. I come back to:

His values? I'm not following, 25GB --> 40GB is 15GB aka 60% more, but his input was to set the heap size appropriately given what you are choosing to set for the memory allocation (which I suspect is going to be the biggest impacter on cloud cost).

No-one disputes it's your bus, and you are driving, And Christian has a thick skin, thicker than mine, so he again just responds politely trying to explain why he suggested what he did.

None of that brings us closer to understanding the core issue. I've suggested a few ways you can delve deeper.

But several observations:

  • You did the 8.x --> 9.x major version upgrade, not us. I'm not blaming you for not realizing the problem you seem to have found in any pre-production environment testing, but ... in the end you're driving that bus!
  • "Our monitoring clusters" --> plural ? I had understood we were considering a single cluster ?
  • A lot of people who know Elastic wayyyyy better than me will have read this thread. None of them have weighed in with anything like "Ahhh, looks like you might have hit this bug" or similar. This suggests to me there is a purely local factor at play here. The fact you had the unusual JVM heap size ideas lends support to that.
  • Many of the elastic staff that follow here will have access to the paying-customer support tickets, if this is a known issue that other customers have/had raised then I would hope they would share this. I have no reason to think they wouldn't have done so, since they have shared info on similar situations in then past.
  • "due to the lack of a developer/manager having raised a hand here" <-- This is a bit cheap, probably borne out of frustration but still. If you are a paying customer, you should open a support ticket. This is a volunteer forum.
  • "Our monitoring clusters have all been working fine before the update" <-- I asked before on this, and you said they were still working fine after the upgrade.
1 Like