Wrong anomaly detection with X-pack ML?

leandrojmp · March 16, 2018, 6:12pm

Hello,

I'm starting to make some tests with Machine Learning on my data and I'm trying to understand how it works and why it identifies an anomaly, an unexpected zero value, when the same query did not return zero.

The data I'm running the machine learning job is based on logs from firewalls, and I'm running the job against a saved query.

For example, the machine learning identified an unexpected zero value anomaly in the interval of 12:45 and 13:00, but if I look at the saved query on discovery it does not have a zero value.

Machine Learning job.

Saved query.

The job bucket interval is 15 minutes, the datafeed has a query delay of 90s and the data is indexed almost in real time, with a index refresh delay of 30s.

What am I missing or doing wrong?

richcollier · March 16, 2018, 6:41pm

You must have caught a situation in which your ingest pipeline was slow enough so that when ML looked at that bucket of time, there really was nothing there (but now there is, of course).

Looks like you may need a larger query_delay

richcollier · March 16, 2018, 7:02pm

This may not be an isolated event. You could confirm this by using timelion to plot the number of docs in the raw data index and compare that to the anomaly results index and sum up the event_count field (which is a bucket by bucket count of how many records the ML job saw when each bucket was processed). So, for example:

.es(index=farequote,metric='sum:doc_count').label("raw data count"), .es(index='.ml-anomalies-my_farequote_job', timefield='timestamp', metric='sum:event_count').label("count ML processed"), .es(index=farequote).subtract(.es(index='.ml-anomalies-my_farequote_job', timefield='timestamp', metric='sum:event_count')).label("difference")

(the job name in this example is “my_farequote_job”, and the name of the index where the raw data exists is "farequote". Also, every job creates an index alias of .ml-anomalies-<job_id>.)

Here, we can see that the difference is zero, over time. If there were problems, the difference chart should show it.

leandrojmp · March 16, 2018, 7:29pm

Well, what does this mean?

Each 15 minutes (the job bucket span and aggregate interval) I have spikes in the count ML processed and in the difference, also the raw data count query does not return anything.

The anomaly happened in the interval where I do not have any spike.

richcollier · March 16, 2018, 8:19pm

Ha - I guess you cannot zoom in too far with timelion or you get weird looking results. I see the same artifact too if i zoom in too far:

however it looks fine when zoom out:

I think it's just because in timelion, you cannot set the aggregation interval for the points. It must make its own decisions on what time interval it aggregates the points over, depending on the time range. (ML's points are aggregated over an interval equal to bucket_span, obviously)

When you're zoomed out, you cannot notice the artifacts because timelion must choose an aggregation interval for the raw data that matches the ML data.

So, unless you have a large enough time range to avoid this weird visual artifact - maybe it wasn't a useful exercise. Sorry about that.

But, to the original point, you still may occasionally be missing data if your query_delay is not big enough. Again, if timelion isn't the right tool to compare the two (what the raw data is and how much data ML saw) - you could just manually compare via searches in Discover, for example.

leandrojmp · March 16, 2018, 8:40pm

My data is being ingested almost in real time, I have an index rate of around 2600 events/s and it is available for search on Kibana after 30s (index.refresh_interval: 30s on the template for the index).

I've set the query delay for 90s and I was looking to both the discovery search and the machine learning job (single metric viewer), both of them with an auto-refresh of 1 minute, and the machine learning job identified another zero value anomaly, but I had data in the discovery search.

I will try to increase the query delay and see if it solves the problem.

richcollier · March 17, 2018, 11:21am

Curious as to why you've chosen to make index.refresh_interval to be 30s. The default is 1s. Were you having performance issues with it at 1s? If so, perhaps you need more data nodes to handle the ingest.

leandrojmp · March 19, 2018, 3:30pm

I was having performance issue because a problem in the underlying hardware of the coud provider I'm using (Azure), the problem was solved, but I didn't changed the index.refresh_interval yet, 30s is working fine for us.

This could influence the Machine Learning job even if I had a query delay that is larger than it?

leandrojmp · April 4, 2018, 4:12pm

Hello,

I'm having the same wrong unexpected zero value on another job running on another index with a different refresh interval index time (5s) and a query delay of 90s.

There is no zero values, what can cause this error on the machine learning? Is any configuration that I need to change?

richcollier · April 4, 2018, 7:50pm

Well, it is hard to know - but certainly in theory you've made the query_delay larger than the index.refresh_interval - but it still doesn't rule out the possibility that something else is delaying the ingest of information.

Perhaps you could use something other than ML to verify if this is occurring - like a Watch that runs every 5 mins (or whatever your bucket_span is) and, while also accounting for a query_delay, logs the number of docs in the index of your choosing. For example, here's a watch that reports the number of docs in an index for a 5min wide window from "now-5m-90s" to "now-90s":

{

  "trigger": {
    "schedule": {
      "interval": "5m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          ".monitoring-es*"
        ],
        "types": [],
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "timestamp": {
                      "gte": "now-5m-90s",
                      "lte": "now-90s"
                    }
                  }
                }
              ]
            }
          }
        }
      }
    }
  },
  "condition": {
      "script": {
        "source": "return true"
      }
    
  },
  "actions": {
    "my-logging-action": {
      "logging": {
        "level": "info",
        "text": "There are {{ctx.payload.hits.total}} documents in the index for last 5 mintues (with a delay of 90s) - measured at {{ctx.execution_time}}" 
      }
    }
  }
  }

The logging output (to elasticsearch.log) would look something like:

There are 5250 documents in the index for last 5 mintues (with a delay of 90s) - measured at 2018-04-04T19:40:16.027Z

Obviously, change the index to the one you care about, edit the time ranges appropriately, and let that run for a while...

richcollier · April 9, 2018, 8:49pm

Have you discovered anything new, @leandrojmp?

leandrojmp · April 11, 2018, 9:05pm

Hello,

I haven't had the time to make other tests, but the way that my data is ingested is basically a stream, it is a UDP input on logstash that receive log events from hundreds of devices, parse the data and send it to elasticsearch, if I had 'zero doc values' in an interval of 5 minutes I would see it much sooner, since it would generate a gap in other dashboards and in the discovery, that we keep looking.

When the ML detect the zero doc value, if I recreate the job using the same interval period, the zero doc value won't be there.

I will try to decrease the index refresh time and increase the query delay, but the query delay was 3 times higher than the index refresh time already (refresh time on 30s and query delay of 90s).

richcollier · April 23, 2018, 2:39am

Ok great - keep us posted. What you observe is totally consistent with there not being enough delay between when data is ingested and searchable and when ML asks elasticsearch for that data.

system · May 21, 2018, 2:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Anomaly Detection Kibana skipping data Kibana elastic-stack-machine-learning	12	945	July 15, 2020
Machine Learning datafeed skipping documents that seem to be there Elasticsearch elastic-stack-machine-learning	14	4288	April 9, 2019
Machine learning jobs not reflecting new data Elasticsearch elastic-stack-machine-learning	5	872	October 30, 2018
Intermittent transient anomaly at data's edge Kibana elastic-stack-machine-learning	5	574	February 25, 2019
Struggling to understand the value of ML for my data Elasticsearch	12	2115	January 10, 2018

Wrong anomaly detection with X-pack ML?

Related topics