Thanks for the advice, DLQ sounds like a good idea for the long term and is exactly what I was after. I have found there was an issue processing certain events, these are dropped now in Logstash so I hope I'm okay going forward.
One side issue was that this also appeared to cause Filebeat to hang, Filebeat has load balancing enabled to 3 different logstash nodes, when this occurs the other nodes stop getting new events. As soon as I restart this one logstash node the others get events right away.
To help anyone else who may have a similar issue, the log message I was getting was:
2020-06-18T18:01:58,884][ERROR][logstash.outputs.elasticsearch] Encountered a retryable error. Will Retry with exponential backoff {:code=>400, :url=>"https://server.name:9200/_bulk"}
This made it difficult to track down where the issue was. The following command helped a lot to get your stats: curl -XGET 'localhost:9600/_node/stats/jvm?pretty'
From the output I can find one of the pipelines logged this:
"bulk_requests": {
"successes": 7,
"responses": {
"200": 7
},
"failures": 398
}
So in good news I know what pipeline it is occuring in. With in this pipeline stats I also saw the following:
"events": {
"duration_in_millis": 3872,
"in": 4235,
"filtered": 2632,
"queue_push_duration_in_millis": 2,
"out": 2632
},
That is 1690 events, my max pipeline size is 1500 for this specific pipeline. (So perhaps thats what caused it too block??) Looking at my main pipeline I see this:
"main": {
"events": {
"duration_in_millis": 9031144,
"in": 404520,
"filtered": 340514,
"queue_push_duration_in_millis": 13119253,
"out": 340514
}
There appears to be about 640002 events in here. My main pipeline size is 4 workers, 8000 batch size. Filebeat has 8 workers. Changed main to 6000 events, and this time it hangs at 48,000 events.
My feeling would be the main pipeline is filled up to its max bulksize, which is filebeat works * batch size of the main pipeline. This causes Filebeat to hang. Logstash can not process anymore events because one of the pipelines is hung because it can not bulk request.
I believe I can monitor this with: curl -XGET 'localhost:9600/_node/stats/pipelines/main?pretty'
, not sure If this gets sent to metric beat where I could be alerted on it?