Regarding Bulk Indexing Requests

dawiro · April 9, 2024, 6:56am

Hi,
Is there a way I can view the size (number of documents) of a bulk index request once it hits the elasticsearch queue? I want to estimate how efficiently logstash is batching requests that it pushes to elasticsearch.

Thx
D

Christian_Dahlqvist · April 9, 2024, 7:09am

No, I do not think there is. Logstash works with a specific max pipeline batch size (defaults to 125) and this is as far as I know the largest bulk size that will be sent. If you have multiple elasticsearch ouput plugins configured, controlled by conditionals, it is possible that bulk sizes by default will be much smaller which can indeed be inefficient.

dawiro · April 9, 2024, 7:29am

So how can we measure our pipeline efficiency? Sending salami sized bulks to elasticsearch can make it look like elasticsearch is the bottleneck when it may not be.

Christian_Dahlqvist · April 9, 2024, 7:42am

I would recommend the following:

Try to have reasonably focused pipelines with a single Elasticsearch output plugin each. Use multiple defined pipelines together with pipeline to pipeline communication to control the flow. Avoid having large number of different Elasticsearch outputs controlled by conditionals. If you think you need the Elasticsearch output to send data to multiple indices, store the index name in the metadata and use this in a single Elasticsearch output.
Increase the batch size to something slightly larger, e.g. 1000 or 2000. If you have focused pipelines there is rarely a need to go much higher.
Use Logstash monitoring metrics to see how much data is processed by different plugins.
Have a look at the Logstash performance tuning and monitoring guide if you have not already.

dawiro · April 10, 2024, 10:44am

Am I correct in thinking that reading messages from message queues will saturate the logstash pipeline more effectively that having large numbers of beats, or any other agent types, fire their output directly at the input port? And, assuming a sane config, will therefore size bulk writes more efficiently?

Christian_Dahlqvist · April 10, 2024, 10:50am

Reading messages from a large message queue is likely to result in full batches of events being processed. I am not sure how the input plugins receiving data from Beats map this to batches, so am not sure whether it would have any impact. That said I do not think it would hust performance.

dawiro · April 10, 2024, 10:54am

I say this because I've seen much better scaling on the backend when reading from queues (pull) vs receiving directly from agents (push). But my thinking is being disputed internally because there aren't metrics which directly corroborate my stance.

Christian_Dahlqvist · April 10, 2024, 11:56am

I have not run any tests or benchmarks on it so do not know for sure. I do suspect using a message queue would help in a number of ways if you have very large number of Beats feeding data, e.g. more evently distributing load and added resiliency. If the pipelines that are receiving data are inefficient (as per initial notes), adding a message queue is unlikely to improve performance much by itself.

dawiro · April 10, 2024, 12:53pm

Our pipelines are set up in the manner you recommend.

Christian_Dahlqvist · April 10, 2024, 12:56pm

Have you overridden the default batch size?

dawiro · April 10, 2024, 12:59pm

Yes. We also run extra workers. Should we consider increasing the batch delay?

Christian_Dahlqvist · April 10, 2024, 1:05pm

Nor sure that will help much. Lets take a step back first.

How have you determined that Elasticsearch is not the limiting factor? What indexing throughput are you seeing? How many indices and shards are you actively indexing into?

What is the size and specification of your cluster? What kind of storage are you using?

dawiro · April 10, 2024, 1:17pm

Elasticsearch isn't the limiting factor because we've run tests where for a given fixed topology (elasticsearch/logstash) we see higher sustained throughput with no loss when reading from queues. We see loss when pushing and ingestion rate is more erratic. Logstash isn't seeing 429s from the backend.
The shard layout is set at one primary and one replica per node.

Christian_Dahlqvist · April 10, 2024, 1:23pm

OK. Good to see that you have taken a systematic approach. Sounds like a reasonable conclusion.

What does your pipelines look like? Have you looked at pipeline performance statistics to see if you have any slow/expensive filters configured?

dawiro · April 10, 2024, 1:31pm

That is a next step to take.

dawiro · April 10, 2024, 1:33pm

How can we determine, aside from cpu/heap utilisation, how/why logstash is choking on the frontend?

Christian_Dahlqvist · April 10, 2024, 1:43pm

How many indices are you actively indexing into? Is it just one?

With respect to tuning Logstash I would look at performance metrics and try to identifying inefficient filters. I have on numerous occasions seen e.g. Inefficient grok filters slow down throughput a lot. There are also filetrs that rely on callouts that can severely limit throughput.

dawiro · April 10, 2024, 1:55pm

Yes, just a single index. We don't actually do any heavy transforms on our data. We try to keep it relatively simple - just index the data as smoothly as possible and leave data formatting to the upstream.

Topic		Replies	Views
Elasticsearch Output Plugin - Batching, How? Logstash	2	615	April 23, 2021
Logstash Bulk Indexing Logstash	7	4903	November 22, 2018
Bulk Payload Sizes Elasticsearch	8	712	July 9, 2018
Logstash bulk requests growing after some time Logstash	2	397	May 4, 2023
Monitoring Bulk Request Queue Elasticsearch	4	1648	July 5, 2017

Regarding Bulk Indexing Requests

Related topics