Two subaggregation in datafeed

machine-learning

#1

Hi, I am trying to do ratio in my data feed. First I created date_histogram with max agg (couse I got an error that it's needed) and then I am doing subaggregation by attrs.src_ca_name and want to count ratio of success calls. Everything looks fine, no error just the data feed preview is empty. Can ML parse two subaggregation?

..."aggregations": {
"buckets": {
  "date_histogram": {
    "field": "@timestamp",
    "interval": "15m",
    "time_zone": "UTC"
  },
  "aggregations": {
      "@timestamp": {
        "max": {
          "field": "@timestamp"
        }
      },
    "by_src": {
      "terms": {
        "field": "attrs.src_ca_name",
        "size": 20,
        "order": {
          "_count": "desc"
        }
      },
      "aggregations": {
       "justattempts": { "filter": { "term": { "type": "call-attempt" } } },
       "ratio" : {
         "bucket_script" : {
           "buckets_path": {
              "atmptcnt": "justattempts>_count",
              "totalcnt": "_count"
           },
           "script" : "params.atmptcnt * 100 / params.totalcnt"
         }....

(rich collier) #2

Yes, the datafeed can use sub-aggregations. See this blog for insight: https://www.elastic.co/blog/custom-elasticsearch-aggregations-for-machine-learning-jobs


#3

But when I do search it returns correct data

{
  "took": 448,
   "timed_out": false,
 "_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
 },
"hits": {
"total": 1331042,
"max_score": 1,
"hits": [
  {
    "_index": "logstash-2018.06.19",
    "_type": "sbc_event",
    "_id": "AWQXBrJ3oA32JM6LbXea",
    "_score": 1,
    "_source": {
      "attrs": {
        "dst_ca_name": "AAAA"
      }
    }
  },
..........
 "aggregations": {
"buckets": {
  "buckets": [
    {
      "key_as_string": "2018-06-19T00:00:00.000Z",
      "key": 1529366400000,
      "doc_count": 274,
      "@timestamp": {
        "value": 1529366699000,
        "value_as_string": "2018-06-19T00:04:59.000Z"
      },
      "by_src": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "BBBBB",
            "doc_count": 224,
            "justattempts": {
              "doc_count": 217
            },
            "ratio": {
              "value": 96.875
            }
          },

But when I run this in ML job it returns nothing.


#4

Here is my whole ML datafeed ratio

PUT _xpack/ml/datafeeds/datafeed-ratio/
{
"job_id": "ratio_ca",
"indices": [
"logstash-2018.06.19"
],
"types": [
"doc"
],
"query": {
"bool": {
  "must": [
    {
      "terms": {"type":["call-attempt","call-end"]}
    }

  ],
  "must_not": []
}
},
"aggregations": {
"buckets": {
  "date_histogram": {
    "field": "@timestamp",
    "interval": "15m",
    "time_zone": "UTC"
  },
  "aggregations": {
      "@timestamp": {
        "max": {
          "field": "@timestamp"
        }
      },
    "by_src": {
      "terms": {
        "field": "attrs.src_ca_name",
        "size": 20,
        "order": {
          "_count": "desc"
        }
      },
      "aggregations": {
       "justattempts": { "filter": { "term": { "type": "call-attempt" } } },
       "ratio" : {
         "bucket_script" : {
           "buckets_path": {
              "atmptcnt": "justattempts>_count",
              "totalcnt": "_count"
           },
           "script" : "params.atmptcnt * 100 / params.totalcnt"
         }
       }
     }
    }
}}
   }
}

(rich collier) #5

Can you paste the output from the following command in DevTools Console?

GET _xpack/ml/datafeeds/datafeed-ratio/_preview

#6

it's empty

[]

(rich collier) #7

And please tell us what version of Elastic Stack you are using...


#8

Version: 6.1.1


(rich collier) #9

Ok - let me look more closely at your code and I'll see if I can replicate


(rich collier) #10

Please show me the config of the ML job itself:

GET _xpack/ml/anomaly_detectors/ratio_ca?pretty


#11
  {
 "count": 1,
"jobs": [
{
  "job_id": "ratio_ca",
  "job_type": "anomaly_detector",
  "job_version": "6.1.1",
  "description": "Ratio ca",
  "create_time": 1530537786467,
  "analysis_config": {
    "bucket_span": "15m",
    "summary_count_field_name": "doc_count",
    "detectors": [
      {
        "detector_description": "sum(ratio_ca)",
        "function": "sum",
        "field_name": "ratio_ca2",
        "detector_rules": [],
        "detector_index": 0
      }
    ],
    "influencers": []
  },
  "analysis_limits": {
    "model_memory_limit": "1024mb"
  },
  "data_description": {
    "time_field": "@timestamp",
    "time_format": "epoch_ms"
  },
  "model_plot_config": {
    "enabled": true
  },
  "model_snapshot_retention_days": 1,
  "results_index_name": "shared"
}
]
}

(rich collier) #12

Hmm...your detector references the field_name of ratio_ca2 but in your datafeed definition, the calculated field is just called ratio

They need to be the same


#13

Thanks but it doesn't help

{
 "count": 1,
"jobs": [
{
  "job_id": "ratio_ca",
  "job_type": "anomaly_detector",
  "job_version": "6.1.1",
  "description": "Ratio ca",
  "create_time": 1531144344777,
  "analysis_config": {
    "bucket_span": "15m",
    "summary_count_field_name": "doc_count",
    "detectors": [
      {
        "detector_description": "sum(ratio_ca)",
        "function": "sum",
        "field_name": "ratio",
        "detector_rules": [],
        "detector_index": 0
      }
    ],
    "influencers": []
  },
  "analysis_limits": {
    "model_memory_limit": "1024mb"
  },
  "data_description": {
    "time_field": "@timestamp",
    "time_format": "epoch_ms"
  },
  "model_plot_config": {
    "enabled": true
  },
  "model_snapshot_retention_days": 1,
  "results_index_name": "shared"
}
 ]
}

still GET _xpack/ml/datafeeds/datafeed-ratio/_preview is empty


(rich collier) #14

One other thing to notice is that you do a terms aggregation in the query, which implies you want separate analyses per by_src, but your detector makes no reference to this split. You might want to make your job config something like:

  "analysis_config": {
    "bucket_span": "15m",
    "summary_count_field_name": "doc_count",
    "detectors": [
      {
        "detector_description": "sum(ratio)",
        "function": "sum",
        "field_name": "ratio",
        "partition_field_name" : "by_src",
        "detector_rules": [],
        "detector_index": 0
      }
    ],
    "influencers": [ "by_src"]
  },

#15

Thank you I have set it but it's still not woking. :frowning:


(rich collier) #16

Sorry this is giving you trouble, but I cannot immediately see what your issue is and I cannot reproduce the problem given a similar situation - my setup works fine.

This hints at some subtle syntax error that's hard to spot.

May I suggest that you use the method of debugging by starting simple and progressing up to your desired end-state. So, for example, define the ML job, then define the ML datafeed without the bucket_script aggregation, just the date_histogram, the max on @timestamp and the terms aggregations.

Then run the datafeed _preview to see what you get (you should just get a bucketized count for each by_src similar to:

[
  {
    "@timestamp": 1486426496000,
    "by_src": "AAA",
    "doc_count": 15
  },
  {
    "@timestamp": 1486426496000,
    "by_src": "BBB",
    "doc_count": 11
  },

If you can get that, then move back to adding the bucket_script aggregation


#17

I have found an problem. It was the field

  "types": [
   "doc"
  ]

This should be the name of aggregation? If I run it without it, it works.


(rich collier) #18

It is an index property that is a hold-over from pre-v6.x of elasticsearch and will be removed fully in v7.x:

https://www.elastic.co/guide/en/elasticsearch/reference/6.2/ml-put-datafeed.html

types
(array) A list of types to search for within the specified indices. For example: []. This property is provided for backwards compatibility with releases earlier than 6.0.0. For more information, see Removal of mapping types.


#19

Thank you very much for your help!


(Mark Walkom) #20