Slow bulk speed

ronwilson · January 27, 2025, 11:52am

It's not 1% if see at glance, much more Of course, I understand that ES isn't a DB but why so slow((

RainTown · January 27, 2025, 12:50pm

A Lot more than 1%? Jeez.

So, if you a simple dedup outside of ES, it’ll likely be much faster at indexing. But still limited by how many of your index commands are actually really updates to existing documents.

I’m not sure what else to say.

It wouldn’t actually be too hard to write a little benchmark tool to see effect of higher values of X, against different bulk size. Maybe I’m over estimating it, others can weigh in if they think so.

Christian_Dahlqvist · January 27, 2025, 1:26pm

If you look at the guide for optimizing for indexing throughput you are going against quite a few of the recommendations:

You are using external document ids, which adds overhead
You are limiting the number of concurrent bulk requests you send, which prevents you from reaching max indexing capacity of the cluster
You are using nested mappings which make indexing and updates more expensive

The advice to avoid frequent updates to documents is one of the main causes of poor performance, but it is unfortunately missing from the documentation. I will see if we can get it added.

I suspect a lot of, if not most, data stores would exhibit poor performance if you went againt a lot of best practices. It is clear that Elasticsearch may not be optimised for your use case, or at least how you are currently using it, but I hope some of the workarounds will provided will help improve the situation.

RainTown · January 28, 2025, 4:47pm

I’m struggling with bulk indexing in which a lot of the indexing requested is already KNOWN to be pointless. If some client code is doing that indexing, improve that code, it doesn’t seem to me that hard.

If it’s a 3rd party tool, then look to insert a bit of sense in between that tool and ES.

DavidTurner · January 29, 2025, 11:12am

I don't think that's a particularly helpful thing to say @RainTown. It's definitely a good idea to move this update work out of ES, but unless you know a lot more about the OP's environment than they've shared so far you cannot possibly know how hard or easy that might be for them.

RainTown · January 29, 2025, 11:38am

Fair observation.

Agreed that I simply don’t & can’t know how much work anything is within an unknown environment. But that’s why I used “seems”, I didn’t state it as a fact. And as explained in the thread , there’s been a fair bit of engineering effort into the flows already.

Whether it’s helpful or not to encourage @ronwilson that it’s worth at least exploring? Others can take their own view on that.

But if I implied it was trivial, then that is not correct nor my intention, and I apologise.

RainTown · January 29, 2025, 10:53pm

Mmm. I wrote a while ago:

It wouldn’t actually be too hard to write a little benchmark tool to see effect of higher values of X, against different bulk size. Maybe I’m over estimating it, others can weigh in if they think so.

Reminder X - Number of documents per bulk that are updating same _id

Well, I wrote a simple python script to do that, at least approx. And on my Mac, it makes very little difference to the _bulk ingest speed if my documents contain a lot of "duplicate" _ids as compared to when all docs have a (user-supplied) unique _id.

Now my docs are simple, just 3 fields, timestamp and a couple of keyword fields.

Only if I set a low value to the chunk_size in my helpers.streaming_bulk call (low means 1000) does bulk ingest really suffer, and does so independently of X.

I get ~42k - 45k docs/second ingesting 500k docs (incl any repeats) in chunks of anything bigger than around 5k. I get close to 50k docs/s, same docs aside from the _id, if I let ES generate the _id.

My results:

 X chunk  #docs time(s)  docs/s
 0  1000 500000   16.6    29982
 0  5000 500000   11.4    43569
 0 10000 500000   11.8    42297
 0 25000 500000   11.4    43720
 0 50000 500000   11.1    44878
10  1000 450130   15.6    32050
10  5000 450130   11.3    43904
10 10000 450130   11.2    44335
10 25000 450130   11.6    42843
10 50000 450130   12.2    40913
25  1000 374988   17.1    29079
25  5000 374988   11.7    42646
25 10000 374988   11.0    45113
25 25000 374988   11.5    43268
25 50000 374988   11.5    43153
50  1000 250147   16.9    29497
50  5000 250147   11.0    45168
50 10000 250147   10.6    46951
50 25000 250147   11.1    44714
50 50000 250147   11.2    44576

ronwilson · January 30, 2025, 6:32am

Not my case - the sample was given in the start of the topic, is it localhost? not real case - in case of {} it's look very great, of couse))

Christian_Dahlqvist · January 30, 2025, 6:49am

I assume you are sending bulk requests in a single thread without any parallelism?

I do however unfortunately not think this simple test necessarily is very relevant. You have run it on different hardware (which could be a factor) and completely different data. He said that he estimated his average event size to be around 3kB or so and we do not know for sure how accurate this is. His data also likely have more complex mappings that in your example as a lot of the fields are text instead of keyword (wonder if that may be a mistake) and the documents also contain 10 potentially large nested fields. In my experience indexing speed tend to go down as the document size grows and the use of nested fields is likely to add some overhead when creating segments and updating documents.

I would not be surprised if the effect of duplicate IDs within a batch is larger under these conditions than in your test. How large remains to be seen. I do not rule out that there also are other factors that may affect performance.

ronwilson · January 30, 2025, 7:09am

I already wrote about this - we cannot send bulk in parallel, also why I wrote why

Christian_Dahlqvist · January 30, 2025, 7:10am

I know. That was directed to Kevin as I wanted to verify that his test did not use parallelism.

RainTown · January 30, 2025, 11:24am

No parallelism, just a 30-line python script written while I was watching some football, because I was curious. I'd share it, but it would simply distract us as I'm not a very good python programmer. Anyways:

I never claimed it was a representative test of the problem reported by @ronwilson
Obviously I simply do not have access to test with real data on the real system
I used my Mac, standalone, single-node, which is obviously completely different hardware, and my script, and reported my results. Interpretation (the harder bit) I left to the reader, including how relevant it is.
One can even debate how to count docs/s. If I index 500k docs in 10 seconds, and half of them are dup _ids, I end up with 250k docs in my index. Did I index at 25k docs/s, or 50k docs/s? I used the latter. But 250k unique _id docs will still index a lot quicker than 500k with dups.

I wrote that the dup _id was likely an issue, slightly scolded for suggesting it would be a "simple fix". Well, whether simple or not, would it even be a fix? I thought yes.

I (and I mean me) thought the mere presence of a lot of duplicate _id in same chunk would make a significant difference, even for simple documents. . Well, that seems just wrong, it doesn't, at least not in a very simple test case. It may still be a factor, I can certainly see as documents get more complex, impact would increase. If there's a simply python library to produce increasingly more complex json documents, I would try it

@Christian_Dahlqvist had already summarized a few ways the current solution is going against best practice for achieving indexing speed. Likely addressing a combination of those reasons will help, but a lesson to learn (again) is one must validate, not just theorize.

In end @ronwilson with has to decide how to proceed with his own setup.

tbh I was also a little bit surprised I got close to 50k+ (simple) docs/s on my ca: 4 year old Mac Mini on a single thread, though I'd simply never measured it before.

There's an old math joke:

A mathematician, a physicist, and an astronomer are riding a train through Scotland. The astronomer looks out the window, sees a black sheep in the middle of a field, and exclaims, "How Interesting! All Scottish sheep are black!!"

The physicist looks out the window and corrects the astronomer, "No, No! Some Scottish sheep are black".

"No," says the mathematician, "All we know is that there is at least one sheep in Scotland, at least one side of which is black".

[ I am a mathematician AND Scottish ]

ronwilson · January 30, 2025, 11:50am

What setup? Parallel bulks? It's not a surprise, please, read more about environment and actual things

RainTown · January 30, 2025, 11:56am

OK, thats a bit rude.

Good luck going forward.

Christian_Dahlqvist · January 30, 2025, 12:02pm

@ronwilson I am not sure I understand what you mean. Could you please clarify?

@RainTown ran a simple single threaded test on a local machine that is likely a lot slower than your server and tested with varying levels of duplicate IDs. He found that the duplication in his test did not affect throughput a lot. I pointed out that it is not representative as your data is very different as events are larger, more complex and use nested fields, which adds overhead. I did point this out as I have in the past seen users assume that they should be able to reach specific numbers without realizing that their conditions are completely different.

I think this is quite clear, but it seems it is you who have not bothered to read through what was written. Making rude comments like this when people are volunteering to help is a great way to have people avoid answering your questions in the future and maybe even muting you. At least that is what I am going to do.

ronwilson · January 30, 2025, 12:10pm

No solution - just lot of samples without any reading about original problem - am I rude?)

Christian_Dahlqvist · January 30, 2025, 12:19pm

We have read and understood your problem and also provided several suggestions on how to further troubleshoot it to narrow down the impact of your different design decisions. We also provided some ideas for how to potentially improve performance.

The fact is however that you have designed your software to work against several of the best practices for optimal indexing performance. Given this I do not necessarily think there is any magic solution to get greatly improved indexing throughput unless you are willing to change your design.

Yes, I would say so.

RainTown · January 30, 2025, 12:28pm

You want a solution? Maybe try to find a Solutions Architect ?

Seriously, you can get in contact with Elastic and get someone from Professional Services to advise. They are very good. You can be rude to them too, but they'll be getting paid.

ronwilson · January 30, 2025, 4:08pm

Ok, I got it, no solution - I made rude, ok-ok)

ronwilson · February 21, 2025, 6:41am

Finally, I found the reason and got a speed that was more or less satisfactory (up to 60,000 RPS). I’ll write here for other people who come across this wonderful system - doing parallel bulk on dotnet Tasks or Threads under Linux is a bad idea. I got the expected speed by re-writing the program to multi-process (fork) mode, parallel bulk was created inside the processes. I note that there were no locks in the old version, i.e. talking about locks within one process is pointless. So - I wrote a solution to help other people who encounter the same problem without turning the conversation into strange assumptions about the participants in the discussion.

Topic		Replies	Views
Improving Bulk Indexing Elasticsearch	12	4613	July 6, 2017
Issue Indexing 50mil Docs via Bulk API Elasticsearch	23	2458	July 5, 2017
Inserts get slower when index become large Elasticsearch	10	476	July 6, 2017
Indexing Performance, Threads + Bulk Size Elasticsearch	2	413	July 6, 2017
Bulk insertion taking long and throwing lots of errors Elasticsearch	6	459	July 6, 2017

Slow bulk speed

Related topics