Data frame analytics on scripted fields ML possible?

I am currently playing around with Machine learning on Kibana and I successfully learned how to use Anomaly detection to use scripted fields by editing the JSON.

Now I want to use data frame analytics specifically currently on regression and clicked advance editor since there are no scripted fields dropdown in my dependant variables drop down.

So I tried to add "script_fields".... json query and it does not work using the query from

it works for Anomaly detection but does not work for data frame analytics.

and it says "an error occurred creating the data frame analytics job, Bad request [request.body.script_fields] definition for this key is missing

Is it possible to use data frame analytics on a scripted field?

Thank you.

Great question @MLsuper ,

The answer is no. script_fields are not supported in Data frame analytics. But, since version 7.13.0 data frame analytics has supported runtime_mappings which can be used similarly to script_fields.

Example:

{
  "source": {
    "index": [
      "kibana_sample_data_flights" 
    ],
   "runtime_mappings": {
    "day_of_week": {
      "type": "keyword",
      "script": {
        "source": "emit(doc['@timestamp'].value.dayOfWeekEnum.getDisplayName(TextStyle.FULL, Locale.ROOT))"
      }
    }
    },
    "query": { 
      "range": {
        "DistanceKilometers": {
          "gt": 0
        }
      }
    },
    "_source": { 
      "includes": [],
      "excludes": [
        "FlightDelay",
        "FlightDelayType"
      ]
    }
  },
  "dest": { 
    "index": "df-flight-delays",
    "results_field": "ml-results"
  },
  "analysis": {
  "regression": {
    "dependent_variable": "FlightDelayMin",
    "training_percent": 90
    }
  },
  "analyzed_fields": { 
    "includes": [],
    "excludes": [
      "FlightNum"
    ]
  },
  "model_memory_limit": "100mb"
}

day_of_week is a categorical feature indicating the day of week of the flight, calculated at runtime from the @timestamp field via a script

1 Like

I see, how is that runtime mapping used in this regression model? What does it do?

So currently I cannot use regression on my scripted fields?

Since it is not supported my plan is now to use Elasticsearch library in python and pull the data from ES->python and somehow manipulate the data and I can use a ML library(such as sklearn) and do regression that way?

Is it possible to get scripted field data from ES->python? Or will I have to do that on my own with python since scripted fields are only at run time?

EDIT:

right now my simple python code can grab data from ES using es.search, and getting the index pattern and playing around with it gives me the fields but not scripted fields. so if it is not possible I would have to loop all over the data and create the scripted field calculation for every point in python which seems like a difficult task?

thank you

@MLsuper

You can retrieve runtime_mapping fields at query type by suppling the runtime_mapping in your Elasticsearch query, and then suppling the fields option including all the fields you want to retrieve (including runtime ones).

How do runtime_mapping fields not satisfy your scripted field case? They can be used as features in data frame analytics and the scripts are simply painless scripts, just like scripted fields.

I see I will try this, and can I use this field as a dependent variable?

I don't see anything in the documentation that indicates if using it as the dependent variable is limited or not.

So, I am on the side that runtime_mapping fields should be able to be used as a dependent variable. :slight_smile:

1 Like

@BenTrent When I try to do the above it says

An error occurred creating the data frame analytics job:

Bad Request: [request body.runtime_mappings]: definition for this key is missing

{
  "description": "",
  "source": {
    "index": "rotate-data"
  },
  
   "runtime_mappings": {
    "TimeDiff": {
      "type": "keyword",
      "script": {
        "source": "doc['@timestamp.max'].value.getMillis() - doc['@timestamp.min'].value.getMillis()"
      }
    }
    },

  "dest": {
    "index": "results-data"
  },
  "analyzed_fields": {
    "excludes": [

    ]
  },
  "analysis": {
    "regression": {
      "dependent_variable": "TimeDiff",
      "num_top_feature_importance_values": 2,
      "training_percent": 80
    }
  },
  "model_memory_limit": "73mb"
}

You must use an emit to expose the script value to the runtime field.

Also, for regression the dependent variable must be a numerical variable, so the output type should be a long

Also, runtime_mappings needs to be in the source definition object.

If this still doesn't work, please try via the Kibana Dev Console with the

PUT _ml/data_frame_analytics/my_analytics
{
  "description": "",
  "source": {
    "index": "rotate-data",
"runtime_mappings": {
    "TimeDiff": {
      "type": "long",
      "script": {
        "source": "emit(doc['@timestamp.max'].value.getMillis() - doc['@timestamp.min'].value.getMillis()_"
      }
    }
    }
  },
  "dest": {
    "index": "results-data"
  },
  "analyzed_fields": {
    "excludes": [
      "transaction_id.keyword",
      "transaction_id.keyword.keyword"
    ]
  },
  "analysis": {
    "regression": {
      "dependent_variable": "TimeDiff",
      "num_top_feature_importance_values": 2,
      "training_percent": 80
    }
  },
  "model_memory_limit": "73mb"
}

I get this error

  "error" : {
    "root_cause" : [
      {
        "type" : "invalid_index_name_exception",
        "reason" : "Invalid index name [_ml], must not start with '_', '-', or '+'",
        "index_uuid" : "_na_",
        "index" : "_ml"
      }
    ],
    "type" : "invalid_index_name_exception",
    "reason" : "Invalid index name [_ml], must not start with '_', '-', or '+'",
    "index_uuid" : "_na_",
    "index" : "_ml"
  },
  "status" : 400
}

and if i remove the _ in front of ml i get

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Rejecting mapping update to [ml] as the final mapping would have more than 1 type: [_doc, data_frame_analytics]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "Rejecting mapping update to [ml] as the final mapping would have more than 1 type: [_doc, data_frame_analytics]"
  },
  "status" : 400
}Preformatted text

when i try to put in kibana dev console since it does not work in data frame analytics same error as above

WOOPS! I did the wrong URL: Create data frame analytics jobs API | Elasticsearch Guide [7.15] | Elastic

PUT _ml/data_frame/analytics/<data_frame_analytics_id>

Now I get this error

{
  "error" : {
    "root_cause" : [
      {
        "type" : "x_content_parse_exception",
        "reason" : "[5:1] [data_frame_analytics_source] unknown field [runtime_mappings]"
      }
    ],
    "type" : "x_content_parse_exception",
    "reason" : "[5:21] [data_frame_analytics_config] failed to parse field [source]",
    "caused_by" : {
      "type" : "x_content_parse_exception",
      "reason" : "[5:1] [data_frame_analytics_source] unknown field [runtime_mappings]"
    }
  },
  "status" : 400
}

I also tried instead of runtime mappings, I try replacing it with mappings and dynamic runtime like this:

PUT _ml/data_frame/analytics/<data_frame_analytics_id>
{
  "description": "",
  "source": {
    "index": "-qa",
"mappings": {
    "dynamic": "runtime",
    "TimeDiff": {
      "type": "long",
      "script": {
        "source": "emit(doc['@timestamp.max'].value.getMillis() - doc['@timestamp.min'].value.getMillis())"
      }
    }
    }
  },
  "dest": {
    "index": "results-regression"
  },
  "analyzed_fields": {
    "excludes": [
    ]
  },
  "analysis": {
    "regression": {
      "dependent_variable": "TimeDiff",
      "num_top_feature_importance_values": 2,
      "training_percent": 80
    }
  },
  "model_memory_limit": "73mb"
}

You have to have at least Elasticsearch version 7.13: Create data frame analytics jobs API | Elasticsearch Guide [7.13] | Elastic

What version are you running?

Doing
Get /
gives me version 7.8.1

but minimum wire compatibility version is 6.8.0 and minimum index compatibility version is 6.0.0-beta1

Ah, well, the only way to get scripted fields for data frame analytics is to have version 7.13+ (and use runtime_mappings).

1 Like

Ahh, I see thank you, I will try this after updating

Hello Ben the cluster is upgraded and while doing regression I get this error, is it possible to do float in rune time mappings? for type:float? (it says invalid runtime mappings) because if i do long I get this error

cannot merge [properties] mappings because of differences for field [@timestamp]; mapped as [{properties={max={type=date}, min={type=date}, value_count={type=long}}}] in index [p200]; mapped as [{properties={max={type=date}, min={type=date}, value_count={type=float}}}] in index [p200-2021.10]

Yes, you can indicate float via your runtime mapping if you wish. The key is that the field type has to be the same for ALL indices matching your index pattern or it may error.

I see because when I try to enter "float" it doesn't let me click apply changes but if I change it to "double" or "long" it lets me click it.

cannot merge [properties] mappings because of differences for field [@timestamp]; mapped as [{properties={max={type=date}, min={type=date}, value_count={type=long}}}] in index [p200]; mapped as [{properties={max={type=date}, min={type=date}, value_count={type=float}}}] in index [p200-2021.10]

so this error the value_count is long and is mapped to value_count float but it has to be the same right[Screen Shot 2021-10-25 at 10.18.14 AM|690x182]

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.