It's not 1% if see at glance, much more Of course, I understand that ES isn't a DB but why so slow((
A Lot more than 1%? Jeez.
So, if you a simple dedup outside of ES, itāll likely be much faster at indexing. But still limited by how many of your index commands are actually really updates to existing documents.
Iām not sure what else to say.
It wouldnāt actually be too hard to write a little benchmark tool to see effect of higher values of X, against different bulk size. Maybe Iām over estimating it, others can weigh in if they think so.
If you look at the guide for optimizing for indexing throughput you are going against quite a few of the recommendations:
- You are using external document ids, which adds overhead
- You are limiting the number of concurrent bulk requests you send, which prevents you from reaching max indexing capacity of the cluster
- You are using nested mappings which make indexing and updates more expensive
The advice to avoid frequent updates to documents is one of the main causes of poor performance, but it is unfortunately missing from the documentation. I will see if we can get it added.
I suspect a lot of, if not most, data stores would exhibit poor performance if you went againt a lot of best practices. It is clear that Elasticsearch may not be optimised for your use case, or at least how you are currently using it, but I hope some of the workarounds will provided will help improve the situation.
Iām struggling with bulk indexing in which a lot of the indexing requested is already KNOWN to be pointless. If some client code is doing that indexing, improve that code, it doesnāt seem to me that hard.
If itās a 3rd party tool, then look to insert a bit of sense in between that tool and ES.
I don't think that's a particularly helpful thing to say @RainTown. It's definitely a good idea to move this update work out of ES, but unless you know a lot more about the OP's environment than they've shared so far you cannot possibly know how hard or easy that might be for them.
Fair observation.
Agreed that I simply donāt & canāt know how much work anything is within an unknown environment. But thatās why I used āseemsā, I didnāt state it as a fact. And as explained in the thread , thereās been a fair bit of engineering effort into the flows already.
Whether itās helpful or not to encourage @ronwilson that itās worth at least exploring? Others can take their own view on that.
But if I implied it was trivial, then that is not correct nor my intention, and I apologise.
Mmm. I wrote a while ago:
It wouldnāt actually be too hard to write a little benchmark tool to see effect of higher values of X, against different bulk size. Maybe Iām over estimating it, others can weigh in if they think so.
Reminder X - Number of documents per bulk that are updating same _id
Well, I wrote a simple python script to do that, at least approx. And on my Mac, it makes very little difference to the _bulk ingest speed if my documents contain a lot of "duplicate" _ids as compared to when all docs have a (user-supplied) unique _id.
Now my docs are simple, just 3 fields, timestamp and a couple of keyword fields.
Only if I set a low value to the chunk_size in my helpers.streaming_bulk call (low means 1000) does bulk ingest really suffer, and does so independently of X.
I get ~42k - 45k docs/second ingesting 500k docs (incl any repeats) in chunks of anything bigger than around 5k. I get close to 50k docs/s, same docs aside from the _id, if I let ES generate the _id.
My results:
X chunk #docs time(s) docs/s
0 1000 500000 16.6 29982
0 5000 500000 11.4 43569
0 10000 500000 11.8 42297
0 25000 500000 11.4 43720
0 50000 500000 11.1 44878
10 1000 450130 15.6 32050
10 5000 450130 11.3 43904
10 10000 450130 11.2 44335
10 25000 450130 11.6 42843
10 50000 450130 12.2 40913
25 1000 374988 17.1 29079
25 5000 374988 11.7 42646
25 10000 374988 11.0 45113
25 25000 374988 11.5 43268
25 50000 374988 11.5 43153
50 1000 250147 16.9 29497
50 5000 250147 11.0 45168
50 10000 250147 10.6 46951
50 25000 250147 11.1 44714
50 50000 250147 11.2 44576
Not my case - the sample was given in the start of the topic, is it localhost? not real case - in case of {} it's look very great, of couse))
I assume you are sending bulk requests in a single thread without any parallelism?
I do however unfortunately not think this simple test necessarily is very relevant. You have run it on different hardware (which could be a factor) and completely different data. He said that he estimated his average event size to be around 3kB or so and we do not know for sure how accurate this is. His data also likely have more complex mappings that in your example as a lot of the fields are text instead of keyword (wonder if that may be a mistake) and the documents also contain 10 potentially large nested fields. In my experience indexing speed tend to go down as the document size grows and the use of nested fields is likely to add some overhead when creating segments and updating documents.
I would not be surprised if the effect of duplicate IDs within a batch is larger under these conditions than in your test. How large remains to be seen. I do not rule out that there also are other factors that may affect performance.
I already wrote about this - we cannot send bulk in parallel, also why I wrote why
I know. That was directed to Kevin as I wanted to verify that his test did not use parallelism.
No parallelism, just a 30-line python script written while I was watching some football, because I was curious. I'd share it, but it would simply distract us as I'm not a very good python programmer. Anyways:
-
I never claimed it was a representative test of the problem reported by @ronwilson
-
Obviously I simply do not have access to test with real data on the real system
-
I used my Mac, standalone, single-node, which is obviously completely different hardware, and my script, and reported my results. Interpretation (the harder bit) I left to the reader, including how relevant it is.
-
One can even debate how to count docs/s. If I index 500k docs in 10 seconds, and half of them are dup _ids, I end up with 250k docs in my index. Did I index at 25k docs/s, or 50k docs/s? I used the latter. But 250k unique _id docs will still index a lot quicker than 500k with dups.
I wrote that the dup _id was likely an issue, slightly scolded for suggesting it would be a "simple fix". Well, whether simple or not, would it even be a fix? I thought yes.
I (and I mean me) thought the mere presence of a lot of duplicate _id in same chunk would make a significant difference, even for simple documents. . Well, that seems just wrong, it doesn't, at least not in a very simple test case. It may still be a factor, I can certainly see as documents get more complex, impact would increase. If there's a simply python library to produce increasingly more complex json documents, I would try it
@Christian_Dahlqvist had already summarized a few ways the current solution is going against best practice for achieving indexing speed. Likely addressing a combination of those reasons will help, but a lesson to learn (again) is one must validate, not just theorize.
In end @ronwilson with has to decide how to proceed with his own setup.
tbh I was also a little bit surprised I got close to 50k+ (simple) docs/s on my ca: 4 year old Mac Mini on a single thread, though I'd simply never measured it before.
There's an old math joke:
A mathematician, a physicist, and an astronomer are riding a train through Scotland. The astronomer looks out the window, sees a black sheep in the middle of a field, and exclaims, "How Interesting! All Scottish sheep are black!!"
The physicist looks out the window and corrects the astronomer, "No, No! Some Scottish sheep are black".
"No," says the mathematician, "All we know is that there is at least one sheep in Scotland, at least one side of which is black".
[ I am a mathematician AND Scottish ]
What setup? Parallel bulks? It's not a surprise, please, read more about environment and actual things
OK, thats a bit rude.
Good luck going forward.
@ronwilson I am not sure I understand what you mean. Could you please clarify?
@RainTown ran a simple single threaded test on a local machine that is likely a lot slower than your server and tested with varying levels of duplicate IDs. He found that the duplication in his test did not affect throughput a lot. I pointed out that it is not representative as your data is very different as events are larger, more complex and use nested fields, which adds overhead. I did point this out as I have in the past seen users assume that they should be able to reach specific numbers without realizing that their conditions are completely different.
I think this is quite clear, but it seems it is you who have not bothered to read through what was written. Making rude comments like this when people are volunteering to help is a great way to have people avoid answering your questions in the future and maybe even muting you. At least that is what I am going to do.
No solution - just lot of samples without any reading about original problem - am I rude?)
We have read and understood your problem and also provided several suggestions on how to further troubleshoot it to narrow down the impact of your different design decisions. We also provided some ideas for how to potentially improve performance.
The fact is however that you have designed your software to work against several of the best practices for optimal indexing performance. Given this I do not necessarily think there is any magic solution to get greatly improved indexing throughput unless you are willing to change your design.
Yes, I would say so.
You want a solution? Maybe try to find a Solutions Architect ?
Seriously, you can get in contact with Elastic and get someone from Professional Services to advise. They are very good. You can be rude to them too, but they'll be getting paid.
Ok, I got it, no solution - I made rude, ok-ok)
Finally, I found the reason and got a speed that was more or less satisfactory (up to 60,000 RPS). Iāll write here for other people who come across this wonderful system - doing parallel bulk on dotnet Tasks or Threads under Linux is a bad idea. I got the expected speed by re-writing the program to multi-process (fork) mode, parallel bulk was created inside the processes. I note that there were no locks in the old version, i.e. talking about locks within one process is pointless. So - I wrote a solution to help other people who encounter the same problem without turning the conversation into strange assumptions about the participants in the discussion.