How does logstash use pipeline.batch.size to execute pipeline?

rickfish · January 9, 2023, 9:28pm

I am looking for documentation on how Logstash processes a pipeline when pipeline.batch.size is more than 1.

This is my assumption:

The input plugin generates several events, not dictated by batch size
While there are events left, Logstash takes a batch at a time, and for each batch, gets events and calls filter() for each
if the output plugin implements multi_receive, then the output plugin is called once with the whole filtered batch.
If the output plugin implements receive(), then the output plugin is called for each filtered batch event, one at a time.

Is this correct? Is there documentation that discusses this?

Badger · January 9, 2023, 9:55pm

That matches my understanding. I do not think it is documented anywhere.

I believe an entire batch of events moves from plugin to plugin within the filter section. In all the years I have used logstash I have only been bitten by that once.

rickfish · January 9, 2023, 10:02pm

Thanks @Badger, I didn't even think about the possibility that the batch would be passed to each plugin within the filter section. I was envisioning each event going through all of the filter plugins before the next event would be processed. I appreciate the clarification.

Rios · January 10, 2023, 12:09am

There is some documentation.

Each input stage in the Logstash pipeline runs in its own thread. Inputs write events to a central queue that is either in memory (default) or on disk. Each pipeline worker thread takes a batch of events off this queue, runs the batch of events through the configured filters, and then runs the filtered events through any outputs.

If I am correct, 125 events (default) will be push to the filter, processed all 125 as a transaction and push to output as bulk. If there is 8 workers, there will be processed 1000 events at the same time inside filter+output.

The pipeline.batch.size setting defines the maximum number of events an individual worker thread collects before attempting to execute filters and outputs. Larger batch sizes are generally more efficient, but increase memory overhead. Some hardware configurations require you to increase JVM heap space in the jvm.options config file to avoid performance degradation. (See Logstash Configuration Files for more info.) Values in excess of the optimum range cause performance degradation due to frequent garbage collection or JVM crashes related to out-of-memory exceptions. Output plugins can process each batch as a logical unit. The Elasticsearch output, for example, issues bulk requests for each batch received. Tuning the pipeline.batch.size setting adjusts the size of bulk requests sent to Elasticsearch.

Here is an interesting pic. Queues are between input and filter, and inside filter there is the most likely a loop for processing, 125 received events.

Badger · January 10, 2023, 4:23am

Let me expand a little on the "push to output as bulk" bit.

The base output plugin is here. The code suggests that the plugin must define a receive method, but I suspect that that is not true. I think the receive_multi method of the plugin is called in preference to the receive method, so that exception never gets thrown if the plugin implements receive_multi_encoded or a receive_multi that actually processes the batch.

The base receive_multi will send the batch to receive_multi_encoded if it exists, otherwise it will loop over the batch calling receive for each event (and throwing that exception if the plugin does not implement it). Note that a plugin can override the base receive_multi if it wants access to the batch without encoding (and not that long ago with encoding was not even an option).

Some examples ... the http output defines multi_receive, because it wants to send the batch to elasticsearch in a single _bulk request if it can. The udp output just sends off each event in a UDP packet, so it defines receive. The s3 output defines receive_multi_encoded.

If I understand it correctly (which I may well not) then if the output plugin has a receive_multi_encoded method then the base output plugin will call the codec to encode each event before passing it to the output plugin. Looking at the PRs on github this appears to be a baby step towards the synchronization required for multi-threaded outputs. Maybe.

The old logstash 6.6 how-to on writing an output plugin no longer exists (which suggests that it is no longer accurate), but it can be found in the Wayback Machine here. That suggests that only multi_receive is required.

rickfish · January 10, 2023, 10:54am

Wow! Thanks to both of you so much for this great description. @Rios , would tou mind posting the URL where that doc is? I am having a tough time finding this type of info.

Rios · January 10, 2023, 1:36pm

How Logstash Works
Tuning and Profiling Logstash Performance
Logstash Persistent Queue

rickfish · January 10, 2023, 2:00pm

Thank you. Much appreciated.

system · February 7, 2023, 2:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch Output Plugin - Batching, How? Logstash	2	615	April 23, 2021
Commands Logstash	2	627	August 21, 2018
Logstash file input performance Logstash	5	608	July 15, 2021
I want to know what the flush_size and idle_flush_time has been changed to Logstash	5	3847	May 9, 2018
Collect multiple events for processing in the filter and send to output (bulk) Logstash	2	508	April 26, 2022

How does logstash use pipeline.batch.size to execute pipeline?

Related topics