Should A Full DLQ Queue Effect Logstash Performance Signifigantly?

I'm sending logs from Filebeat -> Logstash -> ES with 7.10.2.

I have enabled DLQ and set it to 50 MB's. Once something is in the DLQ I do not bother reprocessing it and I'm fine with the data loss.

Once I start seeing these messsage: "cannot write event to DLQ reached maxQueueSize of 52428800" my Logstash instances slow to a crawl and almost nothing gets processed.

If I look under the monitoring I see a very high failure rate for Filebeat, is perhaps Filebeat sending these batches over and over again?

I increased the max size to 500MB in the logstash.yml, and sent the SIGHUP signal, but maybe these settings do not reload. It seems like when I get in these situations my only recourse is to manually delete the files from the OS and reload the service.

Documentation notes: "The maximum size of each dead letter queue. Entries will be dropped if they would increase the size of the dead letter queue beyond this setting."

It doesn't seem like these entries are getting dropped though.

Is it best just to create a dead letter queue pipeline and redirect all DLQ entries to /dev/null?

Just curious if this is the case why configure DLQ at all? Why not just let the log drop? Or do you use it for investigation?

If I don’t configure DLQ my Logstash pipeline will keep sending the event to ES infinitely.. and ES will keep responding with a 400 series error. (This is why I initially configured DLQ on 7.5)

Interesting, retry indefinitely is not my understanding.

In 7.12 retry policy

The following document errors are handled as follows:

400 and 404 errors are sent to the dead letter queue (DLQ), if enabled. If a DLQ is not enabled, a log message will be emitted, and the event will be dropped. See DLQ Policy for more info.
409 errors (conflict) are logged as a warning and dropped.

That seems to be the same policy in 7.5 as well perhaps I am misunderstanding something.

When an event is sent to an elasticsearch there are two status codes returned, and the retry handling for them is different.

If the _bulk request returns an error, unless it is a 429 it is retried indefinitely. In some cases this makes no sense (e.g. a 413 error will always recur if retried and thus it will shut down the flow of events).

Each request within the _bulk payload also returns a status. For these, if the code is 400 or 404 then the event is sent to the DLQ if configured, otherwise, if a 409 it is dropped, otherwise it is retried indefinitely (or "for everything else there is Mastercard" as the comment says).

logstash has an at-least-once delivery model, so it leans towards retries. The documentation says "If the DLQ is not enabled, and a mapping error happens, the problem is logged as a warning, and the event is dropped". That is specific to 409 errors, everything else is retried indefinitely.

1 Like

@Badger
Thanks for the clarification, then back to his original question do we know why when the DLQ is full everything slows down?

No. In the dlq_writer code the only thing I can see that could take any significant time is lock.lock(). I would suggest running the system into a state where throughput has significantly dropped and then getting a thread dump. It may then become apparent what the threads are waiting for.

Generally mutex operations are fast, but there are circumstances where lock contention can slow a system to a crawl.

I restarted the Logstash nodes, but my system still did not recover, hours later I also had to restart Filebeat. (Filebeat sends messages to 3 Logstash nodes). During this time Filebeat appeared to send a very small events every few minutes, but monitoring also showed failures.

Filebeat has: queue.mem.events: 196608

Here are two snippets from the filebeat monitor logging in syslog

				   "pipeline": {
                    "clients": 19,
                    "events": {
                        "active": 192537,
                        "published": 196608,
                        "retry": 129024,
                        "total": 196608
                    },

                "pipeline": {
                    "clients": 19,
                    "events": {
                        "active": 194585,
                        "published": 196608,
                        "retry": 114688,
                        "total": 196608
                    },
                    "queue": {
                        "acked": 2048
                    }

I'm not sure which each field really means, but it "feels" like Filebeat was waiting on some type of ACK for 95% or so of the events it sent to Logstash, but never got them. Because of this perhaps it was only able to send a few thousand events each batch as opposed to utilizing the fill queue size?

Every two minutes Filebeat would log this for the IP of each logstash node:

2021-04-23T13:58:09.607Z        ERROR        [publisher_pipeline_output]        pipeline/output.go:180        failed to publish events: write tcp 1.4.3.11:53390->1.4.3.1:5044: write: connection reset by peer

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.