That matches my understanding. I do not think it is documented anywhere.
I believe an entire batch of events moves from plugin to plugin within the filter section. In all the years I have used logstash I have only been bitten by that once.
Thanks @Badger, I didn't even think about the possibility that the batch would be passed to each plugin within the filter section. I was envisioning each event going through all of the filter plugins before the next event would be processed. I appreciate the clarification.
Each input stage in the Logstash pipeline runs in its own thread. Inputs write events to a central queue that is either in memory (default) or on disk. Each pipeline worker thread takes a batch of events off this queue, runs the batch of events through the configured filters, and then runs the filtered events through any outputs.
If I am correct, 125 events (default) will be push to the filter, processed all 125 as a transaction and push to output as bulk. If there is 8 workers, there will be processed 1000 events at the same time inside filter+output.
The pipeline.batch.size setting defines the maximum number of events an individual worker thread collects before attempting to execute filters and outputs. Larger batch sizes are generally more efficient, but increase memory overhead. Some hardware configurations require you to increase JVM heap space in the jvm.options config file to avoid performance degradation. (See Logstash Configuration Files for more info.) Values in excess of the optimum range cause performance degradation due to frequent garbage collection or JVM crashes related to out-of-memory exceptions. Output plugins can process each batch as a logical unit. The Elasticsearch output, for example, issues bulk requests for each batch received. Tuning the pipeline.batch.size setting adjusts the size of bulk requests sent to Elasticsearch.
Here is an interesting pic. Queues are between input and filter, and inside filter there is the most likely a loop for processing, 125 received events.
Let me expand a little on the "push to output as bulk" bit.
The base output plugin is here. The code suggests that the plugin must define a receive method, but I suspect that that is not true. I think the receive_multi method of the plugin is called in preference to the receive method, so that exception never gets thrown if the plugin implements receive_multi_encoded or a receive_multi that actually processes the batch.
The base receive_multi will send the batch to receive_multi_encoded if it exists, otherwise it will loop over the batch calling receive for each event (and throwing that exception if the plugin does not implement it). Note that a plugin can override the base receive_multi if it wants access to the batch without encoding (and not that long ago with encoding was not even an option).
Some examples ... the http output defines multi_receive, because it wants to send the batch to elasticsearch in a single _bulk request if it can. The udp output just sends off each event in a UDP packet, so it defines receive. The s3 output defines receive_multi_encoded.
If I understand it correctly (which I may well not) then if the output plugin has a receive_multi_encoded method then the base output plugin will call the codec to encode each event before passing it to the output plugin. Looking at the PRs on github this appears to be a baby step towards the synchronization required for multi-threaded outputs. Maybe.
The old logstash 6.6 how-to on writing an output plugin no longer exists (which suggests that it is no longer accurate), but it can be found in the Wayback Machine here. That suggests that only multi_receive is required.
Wow! Thanks to both of you so much for this great description. @Rios , would tou mind posting the URL where that doc is? I am having a tough time finding this type of info.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.