Aggregate filter plugin - timeout function bug

Hello everyone,
I have some problems with the Timeout command from the aggregate filter plugin. I want to create a script that calculates the meantime between two log entries. There are examples of such a function (https://www.elastic.co/guide/en/logstash/current/plugins-filter-aggregate.html#Plugins-filter-Aggregate-Example1). That has worked as far as possible.

However, in order to save resources, I would like to use a timeout so that long intermediate times are canceled and -1 is given back. In most cases that works too. However, there is partly the situation where the Timeout command did not work.

In the following I have an exemplary script for displaying the error:

filter {
    #match My TestInputs
    grok {
        match => [ "message", "%{NOTSPACE:id} - %{NOTSPACE:status}" ]
    }


	# the Startpoint
    if ([status] == "start")
    {
        aggregate {
            task_id => "%{id}"
            code => "map['duartion_map'] = (event.get('@timestamp').to_f*1000).to_i"
            map_action => "create"
			timeout => 1
        }
    }


	# the Endpoint
    if [status] == "end"
    {
        #Fill the field duration_field, so that is filled every time, even if the time difference is over the timeout
        mutate{
            add_field => {duration_field => "-1"}
        }
        aggregate 
		{
			task_id => "%{id}"
            code => "event.set('duration_field', (event.get('@timestamp').to_f*1000).to_i - map['duartion_map'])"
			map_action => "update"
            end_of_task => true
        }
    }
}

I have entered a successive of the following records, for testing:
1 - Start.
1 - end
2 - Start.
2 - end

However, when the script is issued, you can see that the timeout command has not worked for the second time, although nothing has been changed in the source code, and the logstash test server was not restarted. This problem can not be precisely reproduced and randomly occurs. I already had console outputs, which were about 1.5 seconds apart with the timeout worked.

Does anyone know the problem and knows an idea that I can try to solve the problem?

Kind regards

The timeout is implemented using the flush method, which is called by logstash every five seconds, so if you set timeout to 1, the effective value will be anywhere from 1 to 5 seconds.

Thank you badger. Unfortunately, I realized that this wasn't my solution. I tried to set the timeout to 10 seconds and what can I say. Sometimes it works and sometimes logstash just ignores my timeout.

For more information see code and log below:

filter {
    #match My TestInputs
    grok {
        match => [ "message", "%{NOTSPACE:id} - %{NOTSPACE:status}" ]
    }


	# the Startpoint
    if ([status] == "start")
    {
        aggregate {
            task_id => "%{id}"
            code => "map['duartion_map'] = (event.get('@timestamp').to_f*1000).to_i"
            map_action => "create"
			timeout => 10
        }
    }


	# the Endpoint
    if [status] == "end"
    {
        #Fill the field duration_field, so that is filled every time, even if the time difference is over the timeout
        mutate{
            add_field => {duration_field => "-1"}
        }
        aggregate 
		{
			task_id => "%{id}"
            code => "event.set('duration_field', (event.get('@timestamp').to_f*1000).to_i - map['duartion_map'])"
			map_action => "update"
            end_of_task => true
        }
    }
}

And my output looks like

What went wrong? Any idea?

Once again, the timeout is evaluated by a function that is called every five seconds. So a ten second timeout will actually be somewhere between 10 and 15 seconds.

Ok thanks I understand. Then an example from another system. In this case, sometimes it happens that the timeout command does not work.

Usually the timeout is set to 120 seconds. If this is exceeded, the "edi_basic_latency" field is set to 120000. This seems to work so far as the most common value by far is 120000, as shown below.
Top values

In the following example you can now see that there are some records that are well over 120 seconds long. In this case at 180 seconds!

So what is the problem in this case? It is much more than 120seconds plus 5 seconds.

If the aggregate timeout is set to 120 seconds I cannot think of a reason why it would take more than 125 seconds to fire.

Maybe it is possible to change the update time so that the flush method doesn't call every 5 seconds? And if yes how can i check/change the update time from the flush method?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.