Create jobs with field combinations

Hey, I have 2 fields: field1 and field2 in my data. Right now, Im filtering the data for some combinations of field1 and field2 and creating the jobs for those saved searches. What modifications/configurations are to be made such that my job automatically filters data for every combination of field1 and field2 and creates models for every such combination? Is it possible with multi metric job(or any other way) or must be implemented via any language client?

One possibility would be to dynamically create a script_field that is the concatenation of field1 and field2:

PUT _xpack/ml/anomaly_detectors/my_job
{
    "analysis_config": {
        "bucket_span": "1h",
        "detectors": [{
            "detector_description": "count per method_status",
            "function": "count",
            "partition_field_name": "method_status"
        }],
        "influencers": ["method", "status"]
    },
    "data_description": {
        "time_field": "@timestamp"
    }
}
PUT _xpack/ml/datafeeds/datafeed-my_job/
{
  "job_id": "my_job",
  "indices": [
    "gallery-*"
  ],
      "query": {
        "match_all": {
        }
      },
      "script_fields": {
        "method_status": {
          "script": {
            "source": "doc['method'].value + '_' + doc['status'].value",
            "lang": "painless"
          },
          "ignore_failure": false
        }
      }

}
GET _xpack/ml/datafeeds/datafeed-my_job/_preview/

...
 {
    "@timestamp" : 1483244920000,
    "method" : "POST",
    "method_status" : "POST_200",
    "status" : "200"
  },
  {
    "@timestamp" : 1483244949000,
    "method" : "GET",
    "method_status" : "GET_200",
    "status" : "200"
  },
  {
    "@timestamp" : 1483245000000,
    "method" : "GET",
    "method_status" : "GET_200",
    "status" : "200"
  },
...

Result:

I tried to create a job with the foll. request:

PUT /_ml/anomaly_detectors/my_job
{
"analysis_config": {
"bucket_span": "15m",
"detectors": [{
"detector_description": "count per method_status",
"function": "low_count"
}],
"influencers": ["SHIPPERID", "CARRIERID"]
},
"data_description": {
"time_field": "EVENTTIME"
}
}

But I got an error as follows:

{
"error": {
"root_cause": [
{
"type": "status_exception",
"reason": "This job would cause a mapping clash with existing field [CARRIERID] - avoid the clash by assigning a dedicated results index"
}
],
"type": "status_exception",
"reason": "This job would cause a mapping clash with existing field [CARRIERID] - avoid the clash by assigning a dedicated results index",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Can't merge a non object mapping [CARRIERID] with an object mapping [CARRIERID]"
}
},
"status": 400
}

Can you explain what is the cause for the error and how to resolve it?

The destination index for the results of your jobs is a shared index called .ml-anomalies-shared. There apparently is already a field within that index (from some other job you've run apparently) with the name CARRIERID. This field has a mapping (assignment to a data type) that is different than the CARRIERID field mapping type from your new job. A single index cannot have two fields with the same name with different mapping types.

To avoid this, add the following to make a dedicated new results index just for that job:

  "results_index_name": "mynewresultsindexname"

for example:

PUT _xpack/ml/anomaly_detectors/my_job
{
    "analysis_config": {
        "bucket_span": "1h",
        "detectors": [{
            "detector_description": "count per method_status",
            "function": "count",
            "partition_field_name": "method_status"
        }],
        "influencers": ["method", "status"]
    },
    "data_description": {
        "time_field": "@timestamp"
    },
    "results_index_name": "mynewresultsindexname"
}
1 Like

PUT /ml/datafeeds/datafeed-my_job/
{
"job_id": "my_job",
"indices": [
"ab-*"
],
"query": {
"match_all": {
}
},
"script_fields": {
"method_status": {
"script": {
"source": "doc['SHIPPERID'].value+ '
' + doc['CARRIERID'].value",
"lang": "painless"
},
"ignore_failure": false
}
}

}

In the above request, does Elasticsearch filters the documents for every combination of the fields given?
For eg, if SHIPPERID="abcd" and CARRIERID="efgh", then does it automatically filter the documents with the given field values?
Does it create separate model for every combination of the given fields? I need separate models to be created for every collection of documents filtered with a combination of the given fields?

As of right now, I'm directly adding some combinations of the given fields as filters and creating individual single metric jobs for each of the saved searches

The query in the datafeed does not filter - it simply creates a new field that is the concatenation of two other fields in the documents.

method: GET
status:200
method_status: GET_200

It is the ML job configuration, specifically the:

            "partition_field_name": "method_status"

that creates an independent baseline analysis for every instance of method_status, thus, every observed combination of those two fields

PUT /ml/datafeeds/datafeed-my_job/
{
"job_id": "my_job",
"indices": [
"ab-*"
],
"query": {
"match_all": {
}
},
"script_fields": {
"method_status": {
"script": {
"source": "doc['SHIPPERID'].value+ '
' + doc['CARRIERID'].value",
"lang": "painless"
},
"ignore_failure": false
}
}

}

I get the foll error:
"error": {
"root_cause": [
{
"type": "status_exception",
"reason": "[datafeed-my_job] cannot retrieve field [SHIPPERID_CARRIERID] because it has no mappings"
}
],

Is the error produced because the index does not have the field SHIPPERID_CARRIERID or any other reason?

it is because you called your scripted field method_status, not SHIPPERID_CARRIERID

This is where you did that:

"script_fields": {
"method_status": {
"script": {

You can see what your datafeed returns by:

GET _xpack/ml/datafeeds/datafeed-my_job/_preview/

1 Like

Can you suggest me a blog which explains creation of jobs like these via requests on console?

There is no specific blog on this, but the online API docs show everything

1 Like

Hey, as you suggested, I created the job and started the Datafeeds for the job. Under the Single Metric Viewer, I couldn't view my job. Is it because open jobs can't be viewed under Single Metric Viewer?

No, jobs can be viewed in Single Metric Viewer as long as they have results.

Check the Job Management page for your newly created job and check its status there. I'm guessing that if you tried to start the job from the API, you may have started the datafeed but neglected to "open" the job first.

By the way, even if you set the config of the job/datafeed with the API, you can still use the Job Management UI to start/stop the job.

No, I first opened the job with the request:
POST _ml/anomaly_detectors/my_job/_open

and started Datafeeds with request:
POST _ml/datafeeds/datafeed-my_job/_start

This is how my job appears in Job management page and as you can see Single Metric Viewer has been disabled

Ah yes - I forgot. This is because of the datafeed creates a script_field, making the Single Metric Viewer incapable of reconstructing the query to paint the time series.

We'll support that in v7.2: https://github.com/elastic/kibana/pull/34079

1 Like

Hey, I tried to fetch the anomaly results of a specific value from field SHIPPERID_CARRIERID. But I'm still getting all the results. What needs to be corrected in the following query:

GET .ml-anomalies-.write-my_job_low_sum/_search
{
    "size": 10000,
    "query": {
            "bool": {
              "should": [
                {
                  "match": {
                    "SHIPPERID_CARRIERID": "abcd"
                  }
                }
              ], 
              "filter": [
                  { "term" :  { "result_type" : "record"}},
                  { "range" : { "record_score" : { "gte": "75" } } },
                  { "range" : { "multi_bucket_impact" : { "lt": "-4" } } }
                  ]
            }
    }
}

How do I get only the results from the job which satisfies "SHIPPERID_CARRIERID": "abcd"

GET .ml-anomalies-my_job_low_sum/_search
{
    "size": 10000,
    "query": {
            "bool": {
              "filter": [
                  { "term" :  { "result_type" : "record"}},
                  { "term" :  { "partition_field_value" : "abcd"}},
                  { "range" : { "record_score" : { "gte": "75" } } },
                  { "range" : { "multi_bucket_impact" : { "lt": "-4" } } }
                  ]
            }
    }
}

Is it possible to create a scripted job for some particular values of a field alongside the fields SHIPPERID_CARRIERID? Like example, I have a value field3= "xyz" and I wanted to create job with the field combinations of SHIPPERID, CARRIERID and field3="xyz"?

Not sure I fully understand. You want to continue to use your scripted field of SHIPPERID_CARRIERID but only analyze this for values of field3="xyz"?

If so, then in your datafeed, you'd need to replace the match_all part with a query that limits to only that field value, i.e something like.

            "bool": {
              "filter": [         
                   { "term" :  { "field3" : "xyz"}}
               ]
           }
1 Like

Hey, I have a doubt regarding anomalies. Will the anomalies' score reduce or change as more and more data is input to the job. Right now, I'm getting over 2000 anomalies for the scripted job I ran for data with different combinations of 2 fields. Will the number of anomalies detected change over time?

In general, yes - with more data, the more mature the modeling of that data becomes, and the more accurate the anomaly detection results get.

Plus, keep in mind that not all anomalies are created equal - use the scoring ranges in order to rank the anomalies by severity and obtain the appropriate amount of anomalies you'd like to deal with.