Wrong anomaly detection with X-pack ML?

Hello,

I'm starting to make some tests with Machine Learning on my data and I'm trying to understand how it works and why it identifies an anomaly, an unexpected zero value, when the same query did not return zero.

The data I'm running the machine learning job is based on logs from firewalls, and I'm running the job against a saved query.

For example, the machine learning identified an unexpected zero value anomaly in the interval of 12:45 and 13:00, but if I look at the saved query on discovery it does not have a zero value.

Machine Learning job.

Saved query.

The job bucket interval is 15 minutes, the datafeed has a query delay of 90s and the data is indexed almost in real time, with a index refresh delay of 30s.

What am I missing or doing wrong?

You must have caught a situation in which your ingest pipeline was slow enough so that when ML looked at that bucket of time, there really was nothing there (but now there is, of course).

Looks like you may need a larger query_delay

This may not be an isolated event. You could confirm this by using timelion to plot the number of docs in the raw data index and compare that to the anomaly results index and sum up the event_count field (which is a bucket by bucket count of how many records the ML job saw when each bucket was processed). So, for example:

.es(index=farequote,metric='sum:doc_count').label("raw data count"), .es(index='.ml-anomalies-my_farequote_job', timefield='timestamp', metric='sum:event_count').label("count ML processed"), .es(index=farequote).subtract(.es(index='.ml-anomalies-my_farequote_job', timefield='timestamp', metric='sum:event_count')).label("difference")

image

(the job name in this example is “my_farequote_job”, and the name of the index where the raw data exists is "farequote". Also, every job creates an index alias of .ml-anomalies-<job_id>.)

Here, we can see that the difference is zero, over time. If there were problems, the difference chart should show it.

Well, what does this mean?

Each 15 minutes (the job bucket span and aggregate interval) I have spikes in the count ML processed and in the difference, also the raw data count query does not return anything.

The anomaly happened in the interval where I do not have any spike.

Ha - I guess you cannot zoom in too far with timelion or you get weird looking results. I see the same artifact too if i zoom in too far:

however it looks fine when zoom out:

I think it's just because in timelion, you cannot set the aggregation interval for the points. It must make its own decisions on what time interval it aggregates the points over, depending on the time range. (ML's points are aggregated over an interval equal to bucket_span, obviously)

When you're zoomed out, you cannot notice the artifacts because timelion must choose an aggregation interval for the raw data that matches the ML data.

So, unless you have a large enough time range to avoid this weird visual artifact - maybe it wasn't a useful exercise. Sorry about that.

But, to the original point, you still may occasionally be missing data if your query_delay is not big enough. Again, if timelion isn't the right tool to compare the two (what the raw data is and how much data ML saw) - you could just manually compare via searches in Discover, for example.

My data is being ingested almost in real time, I have an index rate of around 2600 events/s and it is available for search on Kibana after 30s (index.refresh_interval: 30s on the template for the index).

I've set the query delay for 90s and I was looking to both the discovery search and the machine learning job (single metric viewer), both of them with an auto-refresh of 1 minute, and the machine learning job identified another zero value anomaly, but I had data in the discovery search.

I will try to increase the query delay and see if it solves the problem.

Curious as to why you've chosen to make index.refresh_interval to be 30s. The default is 1s. Were you having performance issues with it at 1s? If so, perhaps you need more data nodes to handle the ingest.

I was having performance issue because a problem in the underlying hardware of the coud provider I'm using (Azure), the problem was solved, but I didn't changed the index.refresh_interval yet, 30s is working fine for us.

This could influence the Machine Learning job even if I had a query delay that is larger than it?

Hello,

I'm having the same wrong unexpected zero value on another job running on another index with a different refresh interval index time (5s) and a query delay of 90s.

There is no zero values, what can cause this error on the machine learning? Is any configuration that I need to change?

Well, it is hard to know - but certainly in theory you've made the query_delay larger than the index.refresh_interval - but it still doesn't rule out the possibility that something else is delaying the ingest of information.

Perhaps you could use something other than ML to verify if this is occurring - like a Watch that runs every 5 mins (or whatever your bucket_span is) and, while also accounting for a query_delay, logs the number of docs in the index of your choosing. For example, here's a watch that reports the number of docs in an index for a 5min wide window from "now-5m-90s" to "now-90s":

{

  "trigger": {
    "schedule": {
      "interval": "5m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          ".monitoring-es*"
        ],
        "types": [],
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "timestamp": {
                      "gte": "now-5m-90s",
                      "lte": "now-90s"
                    }
                  }
                }
              ]
            }
          }
        }
      }
    }
  },
  "condition": {
      "script": {
        "source": "return true"
      }
    
  },
  "actions": {
    "my-logging-action": {
      "logging": {
        "level": "info",
        "text": "There are {{ctx.payload.hits.total}} documents in the index for last 5 mintues (with a delay of 90s) - measured at {{ctx.execution_time}}" 
      }
    }
  }
  }

The logging output (to elasticsearch.log) would look something like:

There are 5250 documents in the index for last 5 mintues (with a delay of 90s) - measured at 2018-04-04T19:40:16.027Z

Obviously, change the index to the one you care about, edit the time ranges appropriately, and let that run for a while...

Have you discovered anything new, @leandrojmp?

Hello,

I haven't had the time to make other tests, but the way that my data is ingested is basically a stream, it is a UDP input on logstash that receive log events from hundreds of devices, parse the data and send it to elasticsearch, if I had 'zero doc values' in an interval of 5 minutes I would see it much sooner, since it would generate a gap in other dashboards and in the discovery, that we keep looking.

When the ML detect the zero doc value, if I recreate the job using the same interval period, the zero doc value won't be there.

I will try to decrease the index refresh time and increase the query delay, but the query delay was 3 times higher than the index refresh time already (refresh time on 30s and query delay of 90s).

Ok great - keep us posted. What you observe is totally consistent with there not being enough delay between when data is ingested and searchable and when ML asks elasticsearch for that data.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.