Variation in data read by machine learning module to the actual data present in CSV file

Hi,

I was creating a multi metric job for a sample data with two metrics using the machine learning module. I have inserted anomalies manually for testing purpose by entering values of larger magnitude randomly - For ex: If all my metric values are in the range from 1 to 50, I inserted values like 98765432. I have the sample dataset in a csv file and inserted data from it into database using custom script.

The ML module was successful in detecting the anomalies but I found variation in values displayed in 'Anomaly Explorer' window to the actual ones present in csv file. I did checks and observed this behaviour is observed only for values more than 6 digits long. The values below 6 digits length matches perfectly.

Scenario example:

I have a demo data per day basis with two metrics under a range from 1 to 10. I introduced an anomaly value of '12345675' and '123459' in one of the metrics.

I created a multi metric job with a bucket span of one day for verification purpose. Upon running the analysis in the 'Anomaly Explorer' window I get the following result:

Untitled

As you can see, the value present in my csv file is '12345675' but the value displayed in explorer window is '12345700'. On the other hand the value '123459' is getting displayed correctly in the explorer window as below:

Untitled

I tried changing the datatype formats but I get the same behaviour. Could anyone help me understand why this behaviour is exhibited?

Please let me know if you need more details on this issue.

Thank you.

Hi,

Could you please paste the mappings of the index where you store the data?

{
"sample": {
"mappings": {
"logs": {
"properties": {
"Date": {
"type": "date",
"format": "dd-MM-yyyy"
},
"Metric_1": {
"type": "long"
},
"Metric_2": {
"type": "long"
},
"Server_Name": {
"type": "keyword"
}
}
}
}
}
}

Thank you for that. One more thing would be really helpful to understand the issue.

Could you do a search against that index with body:

{
	"docvalue_fields": ["Metric_1", "Metric_2"]
}

In particular, you will also need to narrow the time range in that search to hit the documents that contain those anomalous values. You can do that by including a range query.

If you post the response from that search, it will be great to figure out the issue.

Thank you for the reply. Kindly find below the search results:

{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "sample",
"_type": "logs",
"_id": "AV9wi6Et0ecAB0dXaQii",
"_score": 1,
"_source": {
"Date": "11-09-2017",
"Metric_1": 123455,
"Metric_2": 123459,
"Server_Name": "Demo"
},
"fields": {
"Metric_2": [
123459
],
"Metric_1": [
123455
]
}
},
{
"_index": "sample",
"_type": "logs",
"_id": "AV9wi6Et0ecAB0dXaQi0",
"_score": 1,
"_source": {
"Date": "29-09-2017",
"Metric_1": 12345675,
"Metric_2": 12345679,
"Server_Name": "Demo"
},
"fields": {
"Metric_2": [
12345679
],
"Metric_1": [
12345675
]
}
}
]
}
}

Hi,

Thank you for providing that search response. I have tried to reproduce the issue and I could not.

It seems that you are using a 30 minute bucket_span. My suspicion is that your data contain more than 1 document in each 30 minute bucket. You are using a sum function in the detector, so that means that the actual value for the bucket will be the sum the values for all documents that are within each 30 minute bucket. That could explain why the value is higher.

You could find that out by doing another search where you use a date_histogram aggregation with a 30m interval and then sum each metric. The search request body will look like:

{
  "size":0,
  "aggs": {
    "buckets": {
      "date_histogram": {
        "field":"Date",
        "interval":"30m"
      },
      "aggs": {
        "sum_1": {"sum":{"field":"Metric_1"}},
        "sum_2": {"sum":{"field":"Metric_2"}}
      }
    }
  }
}

Then, you can find the buckets in question in the response and check the values. Also, note doc_count which should be > 1 if my theory is right.

You could also create another job where instead of sum you use max and see what happens.

Let me know of your findings.

Hi,

The data is have is a daily data containing 153 values. I used '1d' as my bucket span with field aggregation as 'Sum' so that I will have the exact value captured. With '1d' as the bucket span, I believe all aggregation function will behave in a similar way and produce the same result. I tried to create another job with 'Max' aggregation, but I found the issue still repeating.

I have run the search as you said but with a bucket span of '1d' as it is the one I used to create my job. Please find below the complete result of it. The 'doc_count' field value is 1 and not as we expected.
{
"key_as_string": "29-09-2017",
"key": 1506643200000,
"doc_count": 1,
"sum_2": {
"value": 12345679
},
"sum_1": {
"value": 12345675
}
},

    {
      "key_as_string": "11-09-2017",
      "key": 1505088000000,
      "doc_count": 1,
      "sum_2": {
        "value": 123459
      },
      "sum_1": {
        "value": 123455
      }
    },

Kindly let me know if you need any other details.

Hi,

Thank you very much for that. OK, so it seems what I suspected is not the issue. One more thing that would be helpful is to get the actual result calling the ML API.

Could you call (you might have to edit the URL and job_id to match yours):

http://localhost:9200/_xpack/ml/anomaly_detectors/sample_1d/results/buckets?pretty&human&expand=true

You can use additional filters to narrow to the anomalous buckets as seen here.

Then, could you please paste the bucket at 29th of September?

Please find the result below:

{
"job_id": "sample_1d",
"timestamp": 1506643200000,
"anomaly_score": 72.8512,
"bucket_span": 86400,
"initial_anomaly_score": 58.9217,
"event_count": 1,
"is_interim": false,
"bucket_influencers": [
{
"job_id": "sample_1d",
"result_type": "bucket_influencer",
"influencer_field_name": "Server_Name",
"initial_anomaly_score": 58.9217,
"anomaly_score": 72.8512,
"raw_anomaly_score": 20.4013,
"probability": 6.11665e-23,
"timestamp": 1506643200000,
"bucket_span": 86400,
"is_interim": false
},
{
"job_id": "sample_1d",
"result_type": "bucket_influencer",
"influencer_field_name": "bucket_time",
"initial_anomaly_score": 58.9217,
"anomaly_score": 72.8512,
"raw_anomaly_score": 20.4013,
"probability": 6.11665e-23,
"timestamp": 1506643200000,
"bucket_span": 86400,
"is_interim": false
}
],
"processing_time_ms": 2,
"result_type": "bucket"
},

Also, below are the steps I followed to get this issue. It might be useful for you to recreate the same issue in your side.

  1. I took a daily data of 153 values. The data contains two metrics. For example, the 153 daily values has both the metrics values ranging between 1 to 50.
  2. I introduced anomaly values randomly in both the metrics. I make sure the anomaly values are both under and over 6 digit length. Ex: 123456 and 12345678.
  3. I feed this data from excel into the database and create mappings.
  4. I create a multi metric job with 'Sum' aggregation for both metrics and use '1d' as 'Bucket span' in order to see the individual values without any aggregation.
  5. Run the job and go to the single metric viewer tab. Select any one metrics and observe the anomaly values. Compare the actual value in the result with the one given in excel.

Thank you but could you run it one more time and ensure you append expand=true? That will include the anomaly records in the buckets.

Hi,

Sorry for the delayed response. Please find the records below:

{
  "job_id": "sample_1d",
  "timestamp_string": "2017-09-29T00:00:00.000Z",
  "timestamp": 1506643200000,
  "anomaly_score": 72.8512,
  "bucket_span": 86400,
  "initial_anomaly_score": 58.9217,
  "records": [
    {
      "job_id": "sample_1d",
      "result_type": "record",
      "probability": 9.64497e-22,
      "record_score": 72.8512,
      "initial_record_score": 58.9217,
      "bucket_span": 86400,
      "detector_index": 1,
      "is_interim": false,
      "timestamp_string": "2017-09-29T00:00:00.000Z",
      "timestamp": 1506643200000,
      "partition_field_name": "Server_Name",
      "partition_field_value": "Demo",
      "function": "sum",
      "function_description": "sum",
      "typical": [
        10.3234
      ],
      "actual": [
        12345700
      ],
      "field_name": "Metric_2",
      "influencers": [
        {
          "influencer_field_name": "Server_Name",
          "influencer_field_values": [
            "Demo"
          ]
        }
      ],
      "Server_Name": [
        "Demo"
      ]
    },
    {
      "job_id": "sample_1d",
      "result_type": "record",
      "probability": 2.20071e-18,
      "record_score": 72.8512,
      "initial_record_score": 58.9217,
      "bucket_span": 86400,
      "detector_index": 0,
      "is_interim": false,
      "timestamp_string": "2017-09-29T00:00:00.000Z",
      "timestamp": 1506643200000,
      "partition_field_name": "Server_Name",
      "partition_field_value": "Demo",
      "function": "sum",
      "function_description": "sum",
      "typical": [
        6.11926
      ],
      "actual": [
        12345700
      ],
      "field_name": "Metric_1",
      "influencers": [
        {
          "influencer_field_name": "Server_Name",
          "influencer_field_values": [
            "Demo"
          ]
        }
      ],
      "Server_Name": [
        "Demo"
      ]
    }
  ],
  "event_count": 1,
  "is_interim": false,
  "bucket_influencers": [
    {
      "job_id": "sample_1d",
      "result_type": "bucket_influencer",
      "influencer_field_name": "Server_Name",
      "initial_anomaly_score": 58.9217,
      "anomaly_score": 72.8512,
      "raw_anomaly_score": 20.4013,
      "probability": 6.11665e-23,
      "timestamp_string": "2017-09-29T00:00:00.000Z",
      "timestamp": 1506643200000,
      "bucket_span": 86400,
      "is_interim": false
    },
    {
      "job_id": "sample_1d",
      "result_type": "bucket_influencer",
      "influencer_field_name": "bucket_time",
      "initial_anomaly_score": 58.9217,
      "anomaly_score": 72.8512,
      "raw_anomaly_score": 20.4013,
      "probability": 6.11665e-23,
      "timestamp_string": "2017-09-29T00:00:00.000Z",
      "timestamp": 1506643200000,
      "bucket_span": 86400,
      "is_interim": false
    }
  ],
  "processing_time_ms": 2,
  "result_type": "bucket"
},

Hi,

Please find below the value observed for the same in Kibana 'Discover' tab:

Untitled

Hi,

Thank you for this. It is a strange case indeed.

I would like to ask you for one final thing to do before considering whether it would be possible to pass us the data so we can try to reproduce with the exact dataset you have.

Could you call the Datafeed Preview API and paste the response?

You should be able to find the datafeed_id if you look in the JSON tab of your job.

Hi,

Please find below the Datafeed preview as asked:

  {
    "Date": 1506643200000,
    "Metric_1": 12345675,
    "Metric_2": 12345679,
    "Server_Name": "Demo"
  },
  {
    "Date": 1505088000000,
    "Metric_1": 123455,
    "Metric_2": 123459,
    "Server_Name": "Demo"
  }

I have the data as csv file. Could you let me know how to pass it on to you?

Thanks.

Hi,

Maybe the below screenshot would help you:

Thanks.

What OS are you running on? And which version of Elasticsearch?

Hi,

Elastic: Version: 5.6.3
OS: CentOS Linux release 7.0.1406

Thanks

I have now managed to replicate the issue in 5.6. Digging further into it, the way results are serialized and written back to elasticsearch performs rounding to 6 significant digits. This is why you see this.

This issue is planned to be fixed in version 6.1.

Thank you for reporting this and all the quick responses to help us figure out the issue.

Hi Dimitris,

Nice to hear that. Thank you for your timely support.

Thanks.