ML Job on Scripted field

machine-learning

(Sarvendra Singh) #1

Hi Team,
Need your advice relating to creating machine learning job.
I am trying to create single metric job for detecting anomalies in mean of some scripted field(which is created with help of injected fields using lucence scripting) but we are getting below error while creating job.

Could not start datafeed for sadads [status_exception] datafeed [datafeed-sadads] cannot retrieve field [Scripted field] because it has no mappings
Could not start datafeed: [status_exception] datafeed [datafeed-sadads] cannot retrieve field [Scripted field]] because it has no mappings

Pls suggest that ML Jobs can be created only with normal fields and can not be created with scripted fields?

Thanks in advance

Thanks
Sarvendra


(David Kyle) #2

Hi,

If you use scripted fields in the ML Datafeed query then those fields can be used by the ML Job. There is an example of using scripted fields in the ML documentation


(Sarvendra Singh) #3

Hi David,

Thanks for reply..I am creating job from ML UI..by selecting aggregation from drop down and selecting scripted field from drop down..If scripted field is coming in drop down then does it mean scripted field is already in datafeed?

How can i use this scripted field in creatiing job by UI?

Sorry for asking too much..I am new to EL
Thanks in advance


(David Kyle) #4

Hi Sarvendra,

I'm a little confused sorry can you confirm the steps you took please. In ML you have clicked on 'create new job', selected an index pattern then clicked 'single metric' and you are presented with 2 drop down boxes, one is Aggregation and the Other Field. The Field drop down contains a field called scripted field so you literally have a field called scripted field in your index.

All fields that appear in the 'Field' drop down have mappings as they come from the index pattern so they must have a mapping so the error you see about the missing mapping is confusing.

How was your scripted field created? Do you have a small sample of the data and index mapping you could share so I can try to reproduce the error?
Thanks


(Sarvendra Singh) #5

Hi David,

Thanks for the reply and sorry for delay in reply.Now i created advance job by Edit Json method.below is JSON for scripted field.
here scripted field name is - Scripted_field
var1, var2 and var3 are normal(no scripted fields) which are used in formula calculation of scripted field.

But I am still getting below errros messgae while running job after selecting date range.Below are datafeed,JSON and job messages..Please suggets root cause of error.

Also one out of context query is that single/multi metric jobs can not be created using scripted field.

Error:Job messages

demo-es-1	Datafeed lookback retrieved no data

2018-01-29 15:33:56 demo-es-1 Datafeed stopped
2018-01-29 15:33:56 demo-es-1 Job is closing
2018-01-29 15:38:06 demo-es-1 Loading model snapshot [N/A], job latest_record_timestamp [N/A]
2018-01-29 15:38:06 demo-es-1 Opening job on node [{demo-es-1}{EkYGvk4xSrGXw4SlJHhIeg}{QEvMEcUuRf6YwdVwQiyYzg}{172.19.1.69}{172.19.1.69:9300}{ml.max_open_jobs=10, ml.enabled=true}]
2018-01-29 15:38:07 demo-es-1 Starting datafeed [datafeed-test2] on node [{demo-es-1}{EkYGvk4xSrGXw4SlJHhIeg}{QEvMEcUuRf6YwdVwQiyYzg}{172.19.1.69}{172.19.1.69:9300}{ml.max_open_jobs=10, ml.enabled=true}]
2018-01-29 15:38:07 demo-es-1 Datafeed started (from: 2017-11-30T18:30:00.000Z to: real-time)
2018-01-29 15:38:09 demo-es-1 Datafeed stopped
2018-01-29 15:38:21 demo-es-1 Starting datafeed [datafeed-test2] on node [{demo-es-1}{EkYGvk4xSrGXw4SlJHhIeg}{QEvMEcUuRf6YwdVwQiyYzg}{172.19.1.69}{172.19.1.69:9300}{ml.max_open_jobs=10, ml.enabled=true}]
2018-01-29 15:38:21 demo-es-1 Datafeed started (from: 2017-11-05T18:30:00.000Z to: real-time)
2018-01-29 15:43:17 demo-es-1 Datafeed stopped
2018-01-29 15:43:31 demo-es-1 Starting datafeed [datafeed-test2] on node [{demo-es-1}{EkYGvk4xSrGXw4SlJHhIeg}{QEvMEcUuRf6YwdVwQiyYzg}{172.19.1.69}{172.19.1.69:9300}{ml.max_open_jobs=10, ml.enabled=true}]
2018-01-29 15:43:31 demo-es-1 Datafeed started (from: 2017-10-31T18:30:00.000Z to: real-time)

JSON

{
"job_id": "test2",
"job_type": "anomaly_detector",
"job_version": "5.5.2",
"description": "test2",
"create_time": 1517219553362,
"finished_time": 1517220236980,
"analysis_config": {
"bucket_span": "5m",
"detectors": [
{
"detector_description": "high_mean(Scripted_field)",
"function": "high_mean",
"field_name": Scripted_field
"detector_rules": [],
"detector_index": 0
}
],
"influencers": []
},
"data_description": {
"time_field": "@timestamp",
"time_format": "epoch_ms"
},
"model_snapshot_retention_days": 1,
"model_snapshot_id": "1517220750",
"results_index_name": "custom-test2",
"state": "opened",
"data_counts": {
"job_id": "test2",
"processed_record_count": 0,
"processed_field_count": 0,
"input_bytes": 0,
"input_field_count": 0,
"invalid_date_count": 0,
"missing_field_count": 0,
"out_of_order_timestamp_count": 0,
"empty_bucket_count": 0,
"sparse_bucket_count": 0,
"bucket_count": 0,
"input_record_count": 0
},
"model_size_stats": {
"job_id": "test2",
"result_type": "model_size_stats",
"model_bytes": 0,
"total_by_field_count": 0,
"total_over_field_count": 0,
"total_partition_field_count": 0,
"bucket_allocation_failures_count": 0,
"memory_status": "ok",
"log_time": 1517220236000,
"timestamp": -300000
},
"open_time": "493s",
"datafeed_config": {
"datafeed_id": "datafeed-test2",
"job_id": "test2",
"query_delay": "60s",
"frequency": "150s",
"indices": [
"master_neo"
],
"types": [
" "
],
"query": {
"match_all": {
"boost": 1
}
},
"script_fields": {
"Scripted_field": {
"script": {
"inline":"(doc['var1'].sum().5)(-doc['var2'].avg()0.1)(doc['var3'].avg()*10)*0.0001",
"lang": "expression"
},
"ignore_failure": false
}
},
"scroll_size": 1000,
"chunking_config": {
"mode": "auto"
},
"state": "started"
}
}

Datafeed

datafeed_iddatafeed-test2
job_idtest2
query_delay60s
frequency150s
indices: master_leo
types:
query{"match_all":{"boost":1}}
script_fields{"Scripted_field":{"script":{"inline":"(doc['var1'].sum().5)(-doc['var2'].avg()0.1)(doc['var3'].avg()*10)*0.0001","lang":"expression"},"ignore_failure":false}}
scroll_size1000
chunking_config{"mode":"auto"}
state started


(rich collier) #6

To ensure your datafeed is working properly, and the scripted field is coming through (being calculated properly), use the _preview feature of the datafeed:

GET _xpack/ml/datafeeds/datafeed-test2/_preview

Run the above and let us know what that looks like...


(Sarvendra Singh) #7

Hi Rick..
Thanks for reply.

I executed this preview command in dev of Kibana and recieved only below.What does this suggests.

Result: []


(rich collier) #8

It means that your scripted field isn't being created properly by your query/datafeed configuration. As such, no information is getting passed to the ML job and the errors that you're getting from ML are justified.

You'll need to debug the way you're creating the scripted field. May I suggest that you just work on getting the following to work in console:

GET master_neo/_search
{
      "query": {
        "match_all": {
          "boost": 1
        }
      },
      "script_fields": {
        "Scripted_field": {
          "script": {
            "source": "(doc['var1'].sum().5)(-doc['var2'].avg()0.1)(doc['var3'].avg()*10)*0.0001",
            "lang": "painless"
          },
          "ignore_failure": false
        }
      }
}

(notice that source is now used instead of inilne)

Once you've debugged this, take the changes you had to make and apply those to the datafeed config. Then, again, test the datafeed config with the _preview option and ensure everything's working before starting the ML job.


(Sarvendra Singh) #9

Thanks..
few queries..GET master_neo/_search ..here master_neo should be name of index right ?
and Scripted_field should be replaced by scripted field name used in edit JSON tab of job.

with these 2 assumptions I executed this in console and got below error.

Error

{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "[script] unknown field [source], parser not found"
}
],
"type": "illegal_argument_exception",
"reason": "[script] unknown field [source], parser not found"
},
"status": 400
}

One more query that what does it mean of this line- "after debugging apply those to the datafeed config"

I have made changes in Edit JSON tab of job as mentioned in above comments so does datafeed config means same here?

Sorry for asking much as I am new to Elastic.

Thanks
Sarvendra


(rich collier) #10

Yes - that's what I assume the name of your index is based upon other info you've posted.

Ok...it is possible that you're using an older version of elasticsearch and that you'll need to revert to using inline instead of source. What version of elasticsearch are you using?

If you are new to Elastic, how did you come upon this very complicated use case of using a scripted field - this is well beyond the basics! :slight_smile:

And yes, when I was saying that whatever you fix by debugging the straight-up elasticsearch query to get your scripted field correct, you'll need to make sure you apply those same edits to your datafeed config - yes, that means if you are using the ML UI, you'll need to manually edit the JSON in the Edit JSON tab.


(Sarvendra Singh) #11

Actually I am from Machine learning background with exposure to R/SAS languages.now given assignment to do R&D on Elastic ML xpack jobs and in one of index there is scripted field for which they want to create job for anomaly detection :slight_smile:

Current elastic search version is 5.5.2 which we are using.Below query is working in Dev Kibana and giving results correctly.Only change we made is lang": "expression instead of painless above u suggested...using painless it is giving some error.

now same I was using in EDIT JSON tab of job using UI.So what may the issue for
Datafeed lookback retrieved no data

GET index_name/_search
{
"query": {
"match_all": {
"boost": 1
}
},
"script_fields": {
"Scripted_field": {
"script": {
"inline": "(doc['var1'].sum()0.5)(-doc['var2'].avg() * 0.1)*(doc['var3'].avg() * 10) * 0.0001",
"lang": "expression"
},
"ignore_failure": false
}
},
"size": 200
}

JSON

{
"job_id": "test2",
"job_type": "anomaly_detector",
"job_version": "5.5.2",
"description": "test2",
"create_time": 1517219553362,
"finished_time": 1517220236980,
"analysis_config": {
"bucket_span": "5m",
"detectors": [
{
"detector_description": "high_mean(Scripted_field)",
"function": "high_mean",
"field_name": Scripted_field
"detector_rules": [],
"detector_index": 0
}
],
"influencers": []
},
"data_description": {
"time_field": "@timestamp",
"time_format": "epoch_ms"
},
"model_snapshot_retention_days": 1,
"model_snapshot_id": "1517220750",
"results_index_name": "custom-test2",
"state": "opened",
"data_counts": {
"job_id": "test2",
"processed_record_count": 0,
"processed_field_count": 0,
"input_bytes": 0,
"input_field_count": 0,
"invalid_date_count": 0,
"missing_field_count": 0,
"out_of_order_timestamp_count": 0,
"empty_bucket_count": 0,
"sparse_bucket_count": 0,
"bucket_count": 0,
"input_record_count": 0
},
"model_size_stats": {
"job_id": "test2",
"result_type": "model_size_stats",
"model_bytes": 0,
"total_by_field_count": 0,
"total_over_field_count": 0,
"total_partition_field_count": 0,
"bucket_allocation_failures_count": 0,
"memory_status": "ok",
"log_time": 1517220236000,
"timestamp": -300000
},
"open_time": "493s",
"datafeed_config": {
"datafeed_id": "datafeed-test2",
"job_id": "test2",
"query_delay": "60s",
"frequency": "150s",
"indices": [
"master_neo"
],
"types": [
" "
],
"query": {
"match_all": {
"boost": 1
}
},
"script_fields": {
"Scripted_field": {
"script": {
"inline":"(doc['var1'].sum().5)(-doc['var2'].avg()0.1)(doc['var3'].avg()*10)*0.0001",
"lang": "expression"
},
"ignore_failure": false
}
},
"scroll_size": 1000,
"chunking_config": {
"mode": "auto"
},
"state": "started"
}
}


(rich collier) #12

Can you please post the output for the following?

GET index_name/_search
{
"query": {
"match_all": {
"boost": 1
}
},
"script_fields": {
"Scripted_field": {
"script": {
"inline": "(doc['var1'].sum()0.5)(-doc['var2'].avg() * 0.1)*(doc['var3'].avg() * 10) * 0.0001",
"lang": "expression"
},
"ignore_failure": false
}
},
"size": 200
}

I'm suspecting you have a syntax error in there, but I would appreciate being proved wrong! :smiley:


(Sarvendra Singh) #13

Very sorry for late reply..was out of town for some emergency.

in below line of your last post, multiplication operator(*) is missing ..

"inline": "(doc['var1'].sum()0.5)(-doc['var2'].avg() * 0.1)*(doc['var3'].avg() * 10) * 0.0001",

As already mentioned in my last post below query worked and executed without error.

GET master_neo/_search
{
"query": {
"match_all": {
"boost": 1
}
},
"script_fields": {
"country_stab_score": {
"script": {
"inline": "(doc['numarticles'].sum()0.5)(-doc['avgtone'].avg() * 0.1)*(doc['goldsteinscale'].avg() * 10) * 0.0001",
"lang": "expression"
},
"ignore_failure": false
}
},
"size": 200
}
pls suggest how to successful job run with this scripted field

Thanks
Sarvendra


(rich collier) #14

Hello Sarvendra,

Yes, I had suspected there was a syntax error, which is why I asked you to run the query and show me the output. The example query I posted was a copy/paste of the script that you posted - in other words, I didn't introduce the syntax error. Again, I wanted to see the OUTPUT of the query to see what the calculated value of the scripted field looks like - because I had suspected the syntax error.

Anyway, once you verified that your syntax of your scripted field is correct, then you can start to build your ML job. In your case, because of the scripted field definition in the ML job's datafeed, it might be easiest to do this via the Dev Tools Console:

First, define the job:

PUT _xpack/ml/anomaly_detectors/my_job
{
      "analysis_config": {
        "bucket_span": "5m",
        "detectors": [
            {
            "detector_description": "high_mean(Scripted_field)",
            "function": "high_mean",
            "field_name": "Scripted_field"
          }
        ],
        "influencers": [ ]
      },
      "data_description": {
        "time_field": "@timestamp"
      }
}

Then, define the datafeed:

PUT _xpack/ml/datafeeds/datafeed-my_job/
{
  "job_id": "my_job",
  "indices": [
    "master_neo"
  ],
      "query": {
        "match_all": {
          "boost": 1
        }
      },
      "script_fields": {
        "Scripted_field": {
          "script": {
            "inline":"(doc['numarticles'].sum()0.5)(-doc['avgtone'].avg() * 0.1)*(doc['goldsteinscale'].avg() * 10) * 0.0001",
"lang": "expression"
          },
          "ignore_failure": false
        }
      }
}

(Again, I suspect you're still missing a * between sum() and 0.5 but you need to validate this)

Then finally, preview the datafeed to make sure the that everything looks right.

GET _xpack/ml/datafeeds/datafeed-my_job/_preview/

It SHOULD look something like:

[
  {
    "@timestamp": 1486425600000,
    "Scripted_field": 264.4092102050781
  },
  {
    "@timestamp": 1486425600000,
    "Scripted_field": 1980.9256591796875
  },
...

(I'm just pretending I know the possible values of Scripted_field above)

Good luck!


(Sarvendra Singh) #15

Thanks For detailed reply..Sorry I missed * in my post :slight_smile:

Defining Job and deining datafeed was done perfectly with no error.

however this query giving below result only
GET _xpack/ml/datafeeds/datafeed-my_job/_preview/

[]

my_job is created when checked in UI with zero processed records and there is no error in Job messages tab also this time.but job status is closed daatfeed state is stopped.

Pls suggest why there is zero records processed..and during creating job by query in dev..I did not option to select date range for job processing.

Thanks


(rich collier) #16

Hi,

I think I've just about hit the limit in terms of how to help you. I've asked you a few times for the output of the query in console and you haven't proven that your scripted field syntax works outside of the ML job and several times you've posted things with syntax errors. It makes it hard to help you.

I've shown you front to back how to do this - I think at this point you just need to simplify your scripted field and prove to yourself that you can get it to work. Just define your scripted field as something simple like:

      "script_fields": {
        "Scripted_field": {
          "script": {
            "inline": "doc['numarticles'].value * 2",
            "lang": "expression"
          },
          "ignore_failure": false
        }
      }

If you are using v6.1 or v6.2, I believe the syntax has changed a little bit:

      "script_fields": {
        "Scripted_field": {
          "script": {
            "source": "doc['numarticles'].value *2",
            "lang": "painless"
          },
          "ignore_failure": false
        }
      }

And if you can get that to work, then move up to your more complex scripted field definition.

By the way, the datafeed _preview will return results regardless of timeframe - basically it just does a top N results.

Good luck!


(Sarvendra Singh) #17

Very Sorry for not providing query result..pleaase accept my apology.
Pls find below correct query and result.

though same steps i tried with simple formula of scripted field calculation still getting only same reply for below preview command.job closed with zero data records processed.

GET _xpack/ml/datafeeds/datafeed-my_job1/_preview/

[]

Query

GET index_name/_search
{
"query": {
"match_all": {
"boost": 1
}
},
"script_fields": {
"Scripted_field": {
"script": {
"inline": "(doc['var1'].sum()0.5)(-doc['var2'].avg() * 0.1)*(doc['var3'].avg() * 10) * 0.0001",
"lang": "expression"
},
"ignore_failure": false
}
},
"size": 200
}

Query result:

"took": 225,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 43931077,
"max_score": 1,
"hits": [
{
"_index": "index_name",
"_type": "xxxx",
"_id": "504661097",
"_score": 1,
"fields": {
"Scripted_field": [
-0.005797101449275361
]
}
},
{
"_index": "index_name",
"_type": "xxxx",
"_id": "504661086",
"_score": 1,
"fields": {
"Scripted_field": [
-0.004616895874263265
]
}
},
{
"_index": "index_name",
"_type": "xxxx",
"_id": "504661068",
"_score": 1,
"fields": {
"Scripted_field": [
-0.0019340974212034504
]
}
}


(rich collier) #18

Well, I must say, I have no idea now you get your query to work. I can only get your expression to work if I correct your syntax errors (add in the missing two * characters) and make it the following:

      "script_fields": {
        "Scripted_field": {
          "script": {
            "source": "(doc['responsetime'].sum()*0.5)*(-doc['responsetime'].avg()*0.1)*(doc['responsetime'].avg()*10)*0.0001",
            "lang": "expression"
          },
          "ignore_failure": false
        }
      }

I don't have 3 fields in my data (I have one) but the calculation is correct, as seen by the output:

[
  {
    "@timestamp": 1486425600000,
    "Scripted_field": -115.53398521657792,
    "airline": "AAL",
    "responsetime": 132.20460510253906
  },
  {
    "@timestamp": 1486425600000,
    "Scripted_field": -48583.02470747559,
    "airline": "JZA",
    "responsetime": 990.4628295898438
  },
...

By the way, the following expression is also equivalent:

       "script": {
            "source": "(doc['responsetime'].value*0.5)*(-doc['responsetime'].value*0.1)*(doc['responsetime'].value*10)*0.0001",
            "lang": "expression"
          },

(Sarvendra Singh) #19

Even I tried with simplified script field calculation formula, i put only one field inside inline field as well then also same error is coming.

PUT _xpack/ml/anomaly_detectors/my_job3
{
"analysis_config": {
"bucket_span": "5m",
"detectors": [
{
"detector_description": "high_mean(test1)",
"function": "high_mean",
"field_name": "test1"
}
],
"influencers": [ ]
},
"data_description": {
"time_field": "@timestamp"
}
}

Datafeed command:

PUT _xpack/ml/datafeeds/datafeed-my_job3/
{
"job_id": "my_job3",
"indices": [
"master_neo"
],
"query": {
"match_all": {
"boost": 1
}
},
"script_fields": {
"test1": {
"script": {
"inline": "doc['var1'].value",
"lang": "expression"
},
"ignore_failure": false
}
}
}

Query:

GET index_name/_search
{
"query": {
"match_all": {
"boost": 1
}
},
"script_fields": {
"test1": {
"script": {
"inline": "doc['var1'].value",
"lang": "expression"
},
"ignore_failure": false
}
},
"size": 200
}

Query result:
{
"took": 157,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 44036533,
"max_score": 1,
"hits": [
{
"_index": "index_name",
"_type": "xx",
"_id": "504661097",
"_score": 1,
"fields": {
"test1": [
-10
]
}
},
{
"_index": "index_name",
"_type": "xx",
"_id": "504661086",
"_score": 1,
"fields": {
"test1": [
-10
]
}
},

GET _xpack/ml/datafeeds/datafeed-my_job3/_preview/

result:
[]

JOB UI Errror on starting datafeed manually by selecting time range:

Job Messages Logs: As soon as I click on start button (datafeed start), Job immediately got closed with zero records processed. Below logs are showed in job messages tab of job.

2018-02-12 11:26:02 demo-es-2 Loading model snapshot [N/A], job latest_record_timestamp [N/A]
2018-02-12 11:26:03 demo-es-2 Datafeed started (from: 2017-12-31T18:30:00.000Z to: 2018-02-12T05:56:09.001Z)
2018-02-12 11:26:03 demo-es-3 Opening job on node [{demo-es-2}{w0sui-XBTEadW5l9YMWx2Q}{E7NGdgwGS0acir8euLHwDQ}{172.19.1.36}{172.19.1.36:9300}{ml.max_open_jobs=10, ml.enabled=true}]
2018-02-12 11:26:03 demo-es-2 Datafeed lookback retrieved no data
2018-02-12 11:26:03 demo-es-2 Datafeed stopped
2018-02-12 11:26:03 demo-es-2 Job is closing
2018-02-12 11:26:03 demo-es-3 Starting datafeed [datafeed-my_job] on node [{demo-es-2}{w0sui-XBTEadW5l9YMWx2Q}{E7NGdgwGS0acir8euLHwDQ}{172.19.1.36}{172.19.1.36:9300}{ml.max_open_jobs=10, ml.enabled=true}]

While with same field I am able to create job manually through UI and this jobs is resulting processing records with successfull datafeed.

Thanks
Sarvendra


(rich collier) #20

Thanks for the detailed information. I'm still a little puzzled and am now wondering if something else is fundamentally wrong and that I'm just overlooking something that should otherwise be obvious.

To that end, do you think you can send the output of the following query? It will show us the range of time for the data.

GET index_name/_search?size=0
{
    "aggs" : {
        "min_time" : { "min" : { "field" : "@timestamp" } },
        "max_time" : { "max" : { "field" : "@timestamp" } },
        "docscount": { "terms" : { "field" : "@timestamp" , "size": 1} }
    }
}

thanks

p.s. In the future, use the code formatting (or enclose the code in triple-backticks ``` which is a markdown syntax) in order to make the JSON easier to read (like I did above)