solution:
FileBeat single section increases throughput by enabling load balancing and starting multiple Logstash nodes
Refer to the article:
opened 08:00AM - 03 Apr 19 UTC
Filebeat
Team:Agent
enhancement
**Describe the enhancement:**
- This ER arose from a Support Ticket. User open… ed a ticket to resolve intermittent problems where filebeat processing just seemed to stop for no identifiable reason. The configuration included filebeat sending events 2 logstash processes which were publishing to kafka.
- what we eventually determined was that all filebeat processing was being blocked by 1 of the logstash processes which continued to process earlier events sent by filebeat. In fact logstash was sending periodic message updates back to filebeats and these messages were being reported in the filebeat log, although in a manner difficult for the user to understand.
- more significantly, what we also found was that a processing block in 1 logstash process caused all of the filebeat worker threads to block. This appears to have been true even though the 2nd logstash process was available for processing.
As a result of this experience, the following enhancements to the logstash "ACK" messaging to beats have been requested:
(1) where beats is sending to multiple logstash processes, modify the ACK protocol so that only those beat worker threads which are connected to the blocking logstash process are blocked while the worker threads connected to non-blocked logstash processes can continue working.
(2) issue explicit messages in the beat log file explaining that specific threads are blocked waiting on logstash to complete processing of previously-sent events.
**Developer Comments From Support Ticket**
> This [problem] was somewhat relaxed in Beats 6.x, by making the publisher asynchrouns and removing the spool as is. Still Filebeat needs processing ACK to be done in order for keeping the registry file in a sane state. That is event Filebeat 6.x might suffer similar issues here.
>Filebeat MUST keep order of events when writing the registry file. As filebeat 6.x supports out-of-order batch publishing, all state updates need to be kept in memory in filebeat.
If a batch is never ACKed, then filebeat would require to accumulate all state in memory, eventually going OOM. This is already supported in the small, but at some point in time the queue of state updates is full. This is where even Filebeat 6.x would start to block.
> one way to resolve it is:
> add a timeout to output workers, duplicating batches to other idle workers once the timeout kicks in. Once a batch is ACKed by one output worker, the other outputs will be cancelled/reset stopping to processing of the current Batch.
>The change in the LB algorithm creates a few events that must be locked:
> - resend timeout kicks in
> - one worker finally ACKing a duplicate batch
> - Batch being cancelled
> - followup event: cancell success or ACK received (duplicate events).
> - The logger/workers are alrready aware of the endpoint. So the actual worker+endpoint will be included in these log messages.
> This strategy guarantees progress and is often used to maximize throughput in a dynamically load-balanced system, in case some services sees some slow-down.
The disadvantage is the potential for duplicates, but filebeat already has at least once semantics. So this is no change in semantics at all. If an output missbehaves or an IO error occurs we have to send again.
**Related Issues/Enhancement Requests**
https://github.com/elastic/apm-server/issues/1298
https://github.com/elastic/beats/issues/8080
https://github.com/elastic/beats/pull/7925