_node/stats - duration_in_millis vs queue_push_duration_in_millis

We recently upgraded Logstash to 5.4.1 and have been having some performance issues. In order to try drilling down into what is going on I have been drilling into the node stats API. However not all of the fields are well documented.

In the results there is a duration_in_millis field and a queue_push_duration_in_millis field.

  "pipeline" : {
    "events" : {
      "duration_in_millis" : 3766620,
      "in" : 489102,
      "filtered" : 489088,
      "out" : 485088,
      "queue_push_duration_in_millis" : 6094242
    },

duration_in_millis shows up in the filters and outputs, and queue_push_duration_in_millis shows up in the inputs. Is there a difference between these? Or are they the same thing, just labeled differently depending on where the event is?
For the data above, could I say that events spent roughly twice as long in the input area as opposed to the filters and outputs? Or is that not how it should be read?

1 Like

TL;DR
A higher queue_push_duration_in_millis than duration_in_millis means that your inputs are fast and your filter/outputs are slow.


The value of queue_push_duration_in_millis is a bit difficult to make sense of because it is dependent on whether you are using the memory or persistent queue.

Some inputs are single threaded, some are fixed multi-threaded and some are variable multi-threaded e.g. TCP, it creates a Thread per connection.

The memory queue is a synchronous hand off between an input thread (writer) and a worker thread (reader). A reader will wait for a writer thread and vice-versa. A reader, when it has moved to the front of the waiting list, will try to acquire a "batch" of events, by taking an event from any waiting writers (or leave with a short batch if it times out).
If you have many writers and, say 8 readers (# of cores) that are slow then the time spent waiting and the queue_push_duration_in_millis will be higher.

The persistent queue is async. The readers and writers have no dependency on each other but there is a queue wide read/write lock on it at the moment, so the readers and writers have to interleave their calls. If one has slow readers that are off filtering or outputting then the writers only have to interleave with themselves or if writers are waiting for new input then the readers are free to drain the queue, interleaving amongst themselves only (this is why we see that less workers with bigger batches can be faster than more workers with smaller batches when there is highly variable incoming event cadence).

To get a sense of duration_in_millis one needs to think of this as the time spent doing things to the batch - from the instant the batch is "opened" (at the end of the reader acquiring a batch described above) to the instant that the batch is "closed" after the filter and output sections are done.

The persistent section is included for other readers.

2 Likes

We are using the memory queue with a kafka input that supports multiple input threads.
So queue_push_duration_in_millis is basically the time the input spent waiting to hand off an event to a free worker?

Does this include the time the input took to retrieve/receive the event?
For example, let's say it took 1000ms for an event to get all the way through logstash. 200ms was spent receiving the data. 200ms was spent waiting for a free worker. 400ms was spent running filters. And 200ms was spent sending the event off elsewhere. Would queue_push_duration_in_millis show 200ms or 400ms? If 200ms, is there a way to see the other 200ms?

Thanks!

At this time each input may or may not have metrics added. Each input will be differently measured.

The common piece that we could easily roll out was timings on the queue's write method.

Perhaps Kafka's consumer metrics can provide some insights.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.