If you have 15 hot shards and you know which ones they are you might benefit from monitoring their distribution and reroute when required through a script in order to ensure they are spread out. It is not elegant, but might reduce overload on individual nodes.
This only means if you happened to have 3 hot shards on same node, your IO might be more efficient, and you'll see more raw IOPS on that node as the node can use different threads.
But main issue will still boil down to the skewed routing keys, which means only 15 threads (across whole cluster) are doing most of the IO. It's this limited concurrency that creates a bottleneck. The iotune settings probably just bound that bottleneck. I hope, if you are able to remove/increase those limits, you will see some, maybe substantial, improvement. The deduping helps, as lots of updates is sort of worst case scenario here, so its important to validate (as per @Christian_Dahlqvist ) thats doing what you think its doing.
Yeah, Iāve not been able to generate the load and verify that using the commands specified by Christian, however, there are other metrics using which I can 99% say that the de-dup layer is working fine (Iāve compared the number of writes which used to happen before the de-dup layer and after itā¦there is a 90% reduction in the number of writes, and the number of high cpu instances have also dropped from 8-10 times daily to at max 1-2 occurrence a day).
However, most likely, by tomorrow I should have the load test ready along with the iostat results.
If the current deduplication solution reduced the write/update load by 90% that is indeed a great reduction and a sign that it is working, at least to some extent. It does not necessarily prove that it is working as you specified though. If there was a mistake in the logic or implementation and a correct implementation could have reduced the writes by 95% (instead of 90%), this would have halved the number of writes you are experiencing now, so it is in my opinion worth verifying this.
Damn, thatās a very fair point, as an additional part, Iāll also revisit our de-duplication logic to see if we can fine tune that layer as wellā¦didnāt realise that every additional % we get beyond 90 will have such a major impact.
If you have a hot document that recieves 10 updates per second that are sent to the deduplication layer, a reduction by 90% results in one update being processed per second. If instead the logic worked and only one write was generated per 30 seconds, that would be a reduction by 99.667% (300 updates reduced to 1). The thing is that I believe you only need one hot document like this to cause the problem for each shard as that is enough to trigger new small segments being created, so even if the logic worked for most documents it could still have an impact.
You are right, and this is how our logic works. What I said 90% for 30 seconds window, it meant that reducing the total number of writes (Cold + Hot) by 90% per secondsā¦what this essentially means is that the total write per document is at most 1 in 30 seconds, and since cold documents, donāt receive multiple duplicate updates, their % reduction would not play a major role hereā¦itās the hot documents which are contributing to this 90% goodness factor, so we can say that 99.667% is what we get for the hot data.
If I have to explain the de-dup layer in the simplest terms, itā just āAt max 1 update per document can be passed through this layer per 30 seconds even if we receive 500 updates in 30 seconds for the given documentā.
@RainTown / @Christian_Dahlqvist I was thinking of creating 2 VMs, one with 320gb ssd and another with 640gb ssd. Given the fact that we have 100 IOPS per gb, Iāll try to simulate load on both of them and using FIO (Flexible IO), see if we are hitting the IOPS, what do you guys think? (If the same load is able to pass through the 640gb machine with less cpu % spike, then we can say that itās an IOPS issue and we need faster disks?)
I would try to verify that the deduplication logic is working as intended first. It may have a seemingly low probability of being the cause of the issue, but if it did turn out to be incorrect fixing it could have an order of magnitude larger gain that perhaps doubling I/O performance.
I am not saying I believe it is an issue, but prefer to tackle low hanging fruit with potentially large impact first.
Feel free to explain it in a more complex way ![]()
The important thing is that it's accurate.
My simple understanding says If you change that 30s window to a 60s window, you'd reduce from 2 writes/60s to 1 write/60s for "often updated documents". If thats almost all the updates, then that's a decent win.
But actually I thought you said you were doing the writes even second, with a sliding 30s "lookback window", or similar? If so, a hot doc that gets an update every single second would still get updated every single second, but with refresh set to 30s a lot of those updates are never found in a search.
If I remember correctly there was a change a number of major versions ago where the write path was simplified. I recall this leading to a refresh being performed whenever a document in the transaction log is updated, which is why frequent updates are expensive. This overrides the refresh interval.
If this is no longer the case or if I recall incorrectly I am sure someone will correct me.
The course model, a certain number of IOPS per GB, doesn't really fit your use case.
It would be I guess interesting to see fio numbers for the "320GB" and "640gb" case, to see if that kvm/qemu setup section actually "works as designed". But, IMHO, easier test is just remove that section entirely and test that. Try driving with the handbrake off.
Definitely, I have that in my checklist, as soon as Iām successfully able to generate the load, this is one of the things Iāll make sure to verify.
Okay, I see there is a lot of confusion around how the de-dup layer is implemented, I guess Iām not explaining it well enough
, let me re-explain the whole dedup layer, this time in full depth ![]()
This is the design - here listingId is our _id in Elasticsearch as well as the key in redis on which de-dup is happening. (Note that the design doc says 60 seconds, but we went ahead with 30 seconds before we cannot breach the SLAs)
- Letās assume 3 listingIds, LST_A, LST_B, LST_C for which we are getting an update request every 1 second. Letās consider the initial request were received at Time T1.
- The polling time for these entries will be set at T1 + 30 seconds. Any updates coming for these listings between T1 to T1 + 30 will not cause an update in the redis.
- At T1 + 30 seconds, the spout will pick the entry from Redis Queue 1 and put it in the processing Redis Queue 2. After this process, any entries coming for LST_A, LST_B, LST_C will be set with a polling time of + 30 seconds from now and this process continues.
- So in a nutshell, every 30 seconds, we can only have 1 update per listingId (_Id in Elastic).
Since you mentioned versions, just to point out, we are currently on ES 7.17ā¦do you reckon any goodness if we go to version 8.x or 9.x?
I donāt think we can remove the 100IOPS/GB configuration. (Hope you are talking about this config, or is it something else?)
I am not aware of any changes that would have an impact on this scenario, but would still recommend you upgrade to a supported version before you fall too far behind.
This forum is for elasticsearch issues. That aspect of your problem, or rather troubleshooting your problem, is not an elasticsearch issue, it's a company policy issue.
Request a VM with maximum size, 3.8TB?, virtual disk. That way you get the maximum "IOPS" that the specific rules allows. In fact, better would be to use the SR-IOV tech the drive expressly supports, but that option is also not open. Those are organizational issues.
The test with massive drive, hence max IOPS, is useful. You might see fantastic numbers with fio. But remember you likely have a single hot IO threads per hot shard. I looked for ways within kvm/qemu to confirm/demonstrate this inside the VM, but could not find anything definitive. I am presuming you don't have access to the host.
EDIT1: when poking around trying to understand, I came across few posts saying using virtio-scsi, not virtuo-blk. Without clear consensus. The why boiled down to virtio-blk == āhere's a a disk, send it requestsā vs virtio-scsi == "here's a a controller, send it commands".
EDIT2: Another test you could do is actually reduce the vCPUs in a VM. 20 -> 32 made no significant difference? I expect 20 -> 10 might make little real-world-performance difference either.
Youāve read my mind aha, this was one of the steps I wrote down to check if we actually require 20 cores or we are just wasting 10 cores.
Youāre suggesting the same right? Going to 10 cores wonāt give performance benefits but will reduce the number of cores we might be wasting, correct? Or am I getting it wrong and moving to 10 cores can improve the performance somehow?
