Haha, true that, but how do I ensure that it’s an IOPS issue ![]()
Well, if the initial claim:
is not true then, automatically, IOPS is a bottleneck.
The rest of my comments was more on my experience. Which is that often, in a corporate env specifically, or by a vendor, you get told things that are marketing or specs or ... but not actually true in reality.
There seems to me evidence that you are in fact IO bound.
You wrote in that PM exchange (slightly edited)
"Increasing cores from 20 to 32 didn’t help at all, we still capped at 10k upserts per second. Although, reducing the writes load by 90%, we were able to reduce the duration of the cpu spike. Previously the cpu used to be spiked up for hours, but now it down to a couple of minutes, but for us, even those 5-10 minutes become critical."
Maybe I am wrong, but this smells IO bound to me.
The other big brains on the forum might have different view.
The query is an upsert query if you intend to ask that. We currently support 10k max writes per second, however, on reaching 10k, we hit 95% cpu on the hot shards/nodes, which causes a decrease in further writes along with a higher read latency. What we aim to achieve is 10k consistent writes with no-cpu spikes, similar to what we experience on the non-hot nodes.
For hot nodes, 10k writes is the peak.
Sounds a bit right, but I have very less knowledge about IOPS at the moment to be frank
, I’ll need to study a bit on that for now. Can you meanwhile please help me with some debugging steps which I can perform on this cluster to guarantee that it’s an IOPS issue.
Could you please answer my questions around the query load? I just want to check that the routing, which primarily is designed to help at query time, is actually necessary.
I didn’t completely understand your question, are you asking if we have specific usecase for this custom routing key?
I am asking for details about how you query and use the data in the cluster, not how you are updating it. This goes back into diving a bit deeper into this issue that was raised early on (and as far as I can see might have been dropped).
Routing is primarily used to make querying more efficient as only a limited number of shards need to be queried to serve each query request. As this was set up a long time ago I was wondering if there is still a strong argument supporting this design decision.
We have composite read aggregation query use case and scroll / slice use case.
If we don’t use custom routing, then the requests will go all the shards, that can lead to too many open scroll contexts per node, hence we require routing.
@RainTown / @Christian_Dahlqvist could you please help me with this.
If there was a /usr/bin/do-i-have-an-iops-issue command, I'd have suggested you run it ages ago. And, even if it existed, it might return "No", or even "are you asking the right question?" !
I am waiting to see the iostat data for when you have "95% CPU usage". For a semi-extended period, like a couple of minutes.
Not just @Christian_Dahlqvist , but others, are all welcome to weigh in with new/different/better ideas.
Sure, I’ll add that iostat data as soon as I get the window.
I personally think breaking out the hot IDs into a separate index with a good number of primary shards and no routing on a separate subset of nodes in the cluster would be the best way to resolve the issue. As you however have ruled that out I have a hard time seeing anything but storage performance being the bottleneck. Maybe others have some suggestions though.
The output of iostat -x during a period of heavy load when you are experiencing issues is exactly what we want to see. Please note though that every storage system and node has some limitation and if your storage is indeed very fast and performant there may not be an easy solution that avoids reachitecting the sharding there.
Cool. While we wait you could also enquire about the specifics of the IO that is showing as vdb in your linux VMs. There was this exchange:
One possibility is it's a virtual disk carved from a large/huge VMware storage pool, backed by a big brand storage vendor. Which is usually good. But maybe all 60 of your data node VMs all have a virtual disk from exactly same storage pool (just speculation).
Anyways, your 15 hot VMs are all writing to their "looks like a local disk" disk at pretty much the same time, the other 45 too but less so due to skew. Depending on what it is, this might be stressing the storage, or stressing something else. It'll certainly be "doing the best it can", but it's possible you need to know the specific details here. So, if there is a "Storage Management" team, or the "VMware Team", ask them for specifics, and tell them your cluster is very IOPS dependent. If they say 'yeah, we noticed!" that would tell us something in itself ![]()
EDIT: Now I recall that /dev/vda and /dev/vdb probably means it's using a paravirt device, which then maps to a native (local) device from the host. So this is all maybe moot. Output of lshw -C disk would be nice to know, if lshw is installed.
Sure, I’ve re-ran the command, hopefully we get the spike this time.
*-virtio2
description: Virtual I/O device physical id: 0 bus info: virtio@2 logical name: /dev/vda size: 10GiB (10GB) capabilities: gpt-1.00 partitioned partitioned:gpt configuration: driver=virtio_blk guid=3cf094a2-5168-9444-b870-5ddad3b6327a logicalsectorsize=512 sectorsize=4096*-virtio3
description: Virtual I/O device physical id: 0 bus info: virtio@3 logical name: /dev/vdb size: 446GiB (478GB) capabilities: partitioned partitioned:dos configuration: driver=virtio_blk logicalsectorsize=4096 sectorsize=4096
@RainTown / @Christian_Dahlqvist this is the machine we use https://download.semiconductor.samsung.com/resources/brochure/PM1733%20NVMe%20SSD.pdf
I might be wrong, but don’t you think if it would have been a “noisy neighbour” problem, then all 60 nodes would have experienced a cpu spike and not only the 15 nodes?
Since the write IOPS is approximately 135,000 for this instance type, and our total indexing rate is 50k at peak times, don’t you think we should be good wrt IOPS?
Good morning
should be good - yes.
Are good - TBD
Yes, but this depends on the skew. If 15 nodes need 10k IOPS and the other 45 nodes need 100 IOPS then the other 45 wont spike. But, that was based on a false conjecture, so it's moot now.
So it's not direct pass-thru to the actual device? I'm no VMware expert, but is this the best way (someone else might know)?