Any update on this? Did it give the expected results?
How did you go about verifying it?
If I was doing this I would consider creating a query that retreives a set of documents that are expected to have a high update frequency and periodically, e.g. every N minutes, run this through a script using cron. The script would run the query with a fixed sort order (to ensure same set of documents are retrieved if possible) while returning document _id and _version. It would store this on disk so version numbers for the documents can be compared between invokations and whenever if finds that the version number has increased by more than 2*N compared to the previous invokation it would report this in a log file.
Hey guys, we carried out 1 more experiment…we installed XFS filesystem on one of the machines and ext4 on the other and the results were almost identical except the “search latency” occurring during the writes. Here, the XFS filesystem is having nearly 0.5ms latency as compared to 5ms latency in the ext4 cluster (roughly 10 times better it seems).
Any thoughts?
I’ve also collected the Iostat and sar command outputs for all these stress testing if you wish to deep dive .
On one of the production nodes, or one of the 2 test systems?
It is an interesting find, tho I'm a bit surprised by a 10x difference. Check mount options on both. If you are just comparing with synthetic load on test systems then ... better to test on real load.
It is certainly true that XFS and ext4 are different filesystems, with different internals and will have different performance characteristics, even more so under some specific load patterns. Maybe you have hit a sweet spot for XFS over ext4. ext4 and XFS both journal metadata, but XFS originally came from SGI and IIRC it (XFS) was designed with large SMP systems in mind.
The sar/iostat output would not help me particularly to confirm anything here.
Yeah, well, this type of testing is pretty hard to do, and often the hardest bit is to have something which is truly representative of your real world. IT is also full of horror stories which have a "it worked perfectly, and great, in staging/pre-production" near the start of the story.
Gaining deeper insight would likely require low-level instrumentation/tools, eBPF?, to observe IO behavior across the full stack. Given the additional complexity introduced by virtualization, it’s not obvious to me that such an investigation would be worth the effort.
XFS vs ext4 for elasticsearch is a nice topic, maybe someone from Elastic would be interested to do a deep dive, on a blog or similar. It did not take me long to find a 3rd party blog claiming XFS was better than ext4 for elasticsearch, and others suggesting to avoid XFS due to risks of kernel deadlocks/other issues. Admittedly both were from years ago.
BUT as I've said a few times in this thread, your design choice still seems your main limiter.
As I stated earlier I do not think there are any major gains from tuning the storage layer as you seem to have good performance, so I will stay away from this thread of further investigation. If there are no issues in your indexing layer (see my previous post), I suspect you will need to make significant changes in order to alleviate or resolve the issue.
You are right, we are done with the sort of deep analysis and beating around the bush…it’s now time to change the design!
Was considering of exploring a few options -
Increasing routing partition size so that data for the same routing key is split across more number of shards. This will help prevent concentrated writes during peak load, however might increase read latencies (will test this out).
Implementing salt routing on the highly skewed routing keys.
Creating a separate index itself for the highly skewed routing keys.
I will try to evaluate all of these approaches and their feasibility and how we can manage if let’s say tomorrow another routing key becomes hot.
@Christian_Dahlqvist / @RainTown Since you guys have much more technical knowledge than me, I’ll be happy to hear your thoughts and insights on the above approach and also if you have any other approaches in mind . Thanks!
This approach spreads the data of all IDs across a larger number of shards while it really only is the busy ones that need this. You stated you experienced issues with open contexts when no routing was in place and I suspect this would increase the risk of this reoccurring while likely just moving the needle a bit. This sounds like a band-aid solution to me.
Am not sure how this would help with the situation. Could you please clarify and elaborate?
I think this is the best approach as it allows you to tune access and partitions separately for the two categories. It may be possible to reduce the routing partition size for the current index holding non-busy IDs and thereby reduce the number of open contexts while having a larger number of primary shards and completely avoid routing for the busy index.
I did not see any update on the testing of the accuracy of your update “deduplication”. It is worth noting that if this was an issue it would affect all of the options you are exploring as well as the current solution.
You’re right, what if we also implement PIT with search after instead of scrolling? That should take care of the open scroll context issue?
So I’ll elaborate this with the help of an example. Let’s consider there are 3 routing Ids, “A”, “B” and “C”, out of which C is the hot routing key causing disruption. Then, we maintain a config map which stores “C” and during incoming writes, it bifurcates “C” into “C1”, “C2”….”C10” and then routes the data to ES. While firing get queries, we make use of similar concept.
True, do you see any challenges in this? And any insights you would like to offer around this design?
The thread is getting long and if I missed some post please let me know. As far as I can tell you have only stated that it should work and do seem very reluctant to actually verify that this is the case even though this could have a significant impact on performance. I have ever only deployed solutions that in theory and according to tests should work into production, but that has unfortunately not completely prevented bugs from occasionally creeping in over the years. Some of these have been subtle concurrency bugs that have been hard to detect in testing and have not had an impact until in a high-volume production scenario.
I have provided the feedback I can on this as I do not know much about the actual use case.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.