Logstash Hangs After Too Many 400's on Bulk Requests to ES

Intermittently I ran into an issue where Logstash will hang. I think what is happening is that for some reason I start logging "Encountered a returnable error. Will Retry with exponential backoff" and the code is 400.

My guess I am sending bad data to ES, but I'm having trouble narrowing it down. Is it possible Logstash is gathering up a large amount of these events in its internal queue and can not submit any of them, but keeps trying? once the queue is full it can not accept any new events?

I thought in the past I saw documentation on permanently failing 400's? Is this perhaps something I can implement?

In my latest case with Logstash 7.5 I printed the above error about 820 times before everything stopped over about 14 hours.

Looking at the stalled stacks most pipelines show a block in start workers. The main pipeline shows a block in push.

As it says, the output thread will keep retrying the post to elasticsearch, so it will hang the thread. You can use a DLQ to set such events aside.

1 Like

Thanks for the advice, DLQ sounds like a good idea for the long term and is exactly what I was after. I have found there was an issue processing certain events, these are dropped now in Logstash so I hope I'm okay going forward.

One side issue was that this also appeared to cause Filebeat to hang, Filebeat has load balancing enabled to 3 different logstash nodes, when this occurs the other nodes stop getting new events. As soon as I restart this one logstash node the others get events right away.

To help anyone else who may have a similar issue, the log message I was getting was:

2020-06-18T18:01:58,884][ERROR][logstash.outputs.elasticsearch] Encountered a retryable error. Will Retry with exponential backoff  {:code=>400, :url=>"https://server.name:9200/_bulk"}

This made it difficult to track down where the issue was. The following command helped a lot to get your stats: curl -XGET 'localhost:9600/_node/stats/jvm?pretty'

From the output I can find one of the pipelines logged this:

"bulk_requests": {
    "successes": 7,
    "responses": {
        "200": 7
    },
    "failures": 398
}

So in good news I know what pipeline it is occuring in. With in this pipeline stats I also saw the following:

"events": {
"duration_in_millis": 3872,
"in": 4235,
"filtered": 2632,
"queue_push_duration_in_millis": 2,
"out": 2632
},

That is 1690 events, my max pipeline size is 1500 for this specific pipeline. (So perhaps thats what caused it too block??) Looking at my main pipeline I see this:

"main": {
    "events": {
		"duration_in_millis": 9031144,
		"in": 404520,
		"filtered": 340514,
		"queue_push_duration_in_millis": 13119253,
		"out": 340514
    }

There appears to be about 640002 events in here. My main pipeline size is 4 workers, 8000 batch size. Filebeat has 8 workers. Changed main to 6000 events, and this time it hangs at 48,000 events.

My feeling would be the main pipeline is filled up to its max bulksize, which is filebeat works * batch size of the main pipeline. This causes Filebeat to hang. Logstash can not process anymore events because one of the pipelines is hung because it can not bulk request.

I believe I can monitor this with: curl -XGET 'localhost:9600/_node/stats/pipelines/main?pretty', not sure If this gets sent to metric beat where I could be alerted on it?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.