Why do we have so many bulk indexing tasks?

We ingest data into our elasticsearch cluster solely using 16 logstash nodes each configured with the following settings:

pipeline.workers: 48
pipeline.batch.size: 4000
pipeline.batch.delay: 1000

These hosts only have 8 CPUs but as per the advice here, we found that we needed to scale up workers to this level in order to saturate CPU.

Now, when I run:

GET _cat/tasks?detailed&v&s=running_time:desc

I see that we have over 10,000 bulk indexing tasks. 3% of these have been running for over 10 seconds, 15% for over 5 seconds, and 60% for over 1 second. Firstly, the time these are taking seems quite long. Anything I can do to troubleshoot what is causing this?

Secondly, as far as I understand from this output, we have around 600 "parent" bulk tasks (tasks with no parent_task_id). Most of these have requests[4000], or something close to that in the description but some of them have way fewer requests, for example, 4, 40 or 1328. Any idea why some of these are so low?

Thirdly, why do we have so many child bulk requests (tasks with a parent_task_id)? I can see that some of our parent bulk tasks have indices[<around 30 different indices here>] in their description. Also, some of those indices can have up to 40 shards so is it simply that each bulk request is hitting a bunch of indices and therefore a bunch of shards that is causing this multiplicative effect? What can I do to alleviate this?

Let me know if you need more info here.


Yes. If you send in a bulk request containing documents that are going to a lot of different shards there will be a high task count.

That depends on the data, how you index it and whether your data is immutable or not. If you have bulk requests that can contain data going to many different indices it may be worthwhile looking at splitting these indexing flows up so bulk requests are more targeted. Maybe you can also reduce the number of primary shards, but that also depends heavily on the use case.

I'd look for evidence of bottlenecks in system-level metrics I think. But also these numbers are not particularly worrying: If your goal is to saturate CPU (i.e. throughput) then a little extra latency is to be expected.

These are whatever requests you're sending to ES, so you'd need to look upstream (i.e. to Logstash) if you want to know why some of its requests are smaller than others. I'd suggest a separate post in the Logstash forum if you need more help with tracking it down, but one possibility is that it hit the pipeline.batch.delay timeout for these requests before it had enough work to send a full-sized batch of 4000. This isn't necessarily a problem tho.

Alleviate what exactly? That suggests you're seeing a problem here, but there's nothing wrong with having 10k+ tasks related to bulk indexing in a busy cluster.

1 Like

Sorry, I missed these responses. Thanks for the advice. We already reduced the number of shards that we're indexing into and that seemed to help. And, yes, we increased pipeline.batch.delay and that helped with the request sizes too :slightly_smiling_face: