Dealing with delays between message receipt and watcher

So I'm trying to figure out a way to deal with delays from when a message is received by a collector, when a message is finally indexed in Elastic and when watcher finally runs the query that generates an alert.

As an example, a system generates two events that get received by logstash with the following timestamps:
2016-06-16T00:02:26.091Z
2016-06-16T00:02:26.092Z

It takes an indeterminate amount of time for these documents to finally land into elastic (>10s)

When a configured watcher runs, it runs its query with a filter that should be the last 10 seconds?

    "filter": {
      "range": {
        "@timestamp" : {
          "from": "{{ctx.trigger.scheduled_time}}||-10s",
          "to": "{{ctx.trigger.triggered_time}}"
        }
      }
    }

This results in a scheduled time of 2016-06-16T00:02:41.492Z, which -10 seconds should be 2016-06-16T00:02:31.492Z.

But because my events were received by logstash longer than 10 seconds ago, and took a long time to get into elastic, these events are missed by Watcher. The only way I'm able to receive events is if I set a long offset in my query filter. How do I work around this issue? I run the risk of triggering duplicate alerts as the latency between logstash and elastic varies.

It would be nice if you could set the last time a watcher ran as a value in your filter. Something like {{ctx.trigger.last_run_time}}. Then you could guarantee that watcher is catching everything that happened between runs? Maybe you can and I'm missing something?

"trigger" : {
    "schedule" : { "interval" : "10s" }
  },
  "input" : {
    "search" : {
      "request" : {
        "indices" : [
          "myindex*"
        ],
        "body" : {
          "query": {
                  "query_string": {
                    "analyze_wildcard": true,
                    "query": "query"
          }
        },
        "filter": {
          "range": {
            "@timestamp" : {
              "from": "{{ctx.trigger.scheduled_time}}||-10s",
              "to": "{{ctx.trigger.triggered_time}}"
            }
          }
        }
      }
    }
  }
  },
  "condition" : {
    "compare" : {
      "ctx.payload.hits.total" : { 
        "gte" : 1
      }
    }
  },
  "actions" : {
    "email_admin" : {
      "email" : {
      "from":"from@domain.com",
      "to" : "to@domain.com",
      "subject" : "Email Subject",
      "body" : {
        "html" : "Message body"
        }
      }
    }
  }

Hey,

we indeed plan to put information of the last couple of runs into the watch exection context (the ctx variable you are using), so you can easily extract the last execution time from there. However it is not there yet.

However this does not solve your basic problem. If you index documents that have a timestamp older than the last run of a watch, you are facing the same problem, that you might miss documents.

You should try to find out why those documents take a lot of time. Could be it that the refresh_interval has been increased and thus documents are made available for search only every 5 seconds instead of the default value of one second? The five seconds are the logstash default.

Two solutions here: You try and decrease the interval, but this will lead to more IO and merging on the Elasticsearch side or you increase your search interval.

Also, this is just a guess, but maybe getting the documents into ES is already slow, then you should pinpoint this problem first.

Hope this helps!

--Alex

I haven't made any changes to the refresh_interval setting anywhere, so it should still be the default of one second.

I'm honestly not sure what I can do to reduce the latency, other than flatten my deployment. But if I do that I fear I'll lose availability and redundancy. There are a number of hops between a generated event and elasticsearch.

My architecture looks something like this:

beats -> logstash -> rabbitmq -> logstash -> elastic

Hey,

then you should first measure how long it takes for an event to get from beats to elasticsearch and where it hangs. As the timestamp of the event should be its generation and not its index time, you might need to increase the time span you are searching for in order to catch more/the right events, completely independent from watcher. Get numbers first, then only adapt if needed.

--Alex