Error creating Machine Learning job

Hi,

I'm trying to create a Machine Learning job multi-metric job. Upon the last step of clicking the Create Job button, I get the error in the red box at the top of the page.

Save failed: [remote_transport_exception][someinstance][anIP and port][cluster:admin/xpack/ml/job/put]

What do I do with this information?
How do I fix this?

thanks,
Tim

Hi Tim, what's the full error message seen in elasticsearch.log when this error occurs? There might be better clues in there. Also, briefly describe your setup (are you running a test cluster on your laptop, etc.?)

Hi Rich,

I'm running this in my production cluster, the actual error is this,
Save failed: [remote_transport_exception] [ldxx90elk16-isgeis][204.54.165.114:9300][cluster:admin/xpack/ml/job/put]

I have looked on this server and I do not see an error logged. Is there somewhere else that I should look for an error.

This appears to be a problem the more data that the job is looking at. In this example, I'm looking at stats from all Queues on a IBM MQ Queue manager. If reduce my initial query to look at one specific queue it works well. So, is this something with my ES cluster, and if so what do I need to do to make it work.

thanks,
Tim

Tim,

There indeed should be the similar error listed in elasticsearch.log - not sure why you aren't seeing it.

Anyway - if you're successfully running a job with one specific queue, but it isn't working for more than one queue, then I think something must be wrong with either:

  1. the configuration of the job itself
  2. the query for the raw data for the job (a.k.a the "datafeed")

To help understand things, to get information on 1 and 2 above, you please execute and post the results of the following two commands:

  1. curl -u elastic:changeme -XGET 'localhost:9200/_xpack/ml/anomaly_detectors/jobname?pretty'
  2. curl -u elastic:changeme -XGET 'localhost:9200/_xpack/ml/datafeeds/datafeed-jobname?pretty'

replacing "jobname" with the name of the job that isn't working for you.

Post the response JSON (redact any private info as necessary) - Then, we'll take a look

Rich,
The problem is that the job isn't even being created the "Save Failed" appears at the top of the page.

I have included the job output you requested but it's for the job that works. This is only looking at one queue out of hundreds. I'm trying to create a job that looks at all queues. I have a suspicion this data set size is what's causing this issue. When you say look at the log, on which system would that be? I have looked on the Kibana servers, in their ES instance and the one in the error message and haven't found any activity.

http://isgeis-logcentral.dx.deere.com:9200/_xpack/ml/anomaly_detectors/ibm_mq_jdlink_cs_ssa_stats?pretty
{
"count" : 1,
"jobs" : [
{
"job_id" : "ibm_mq_jdlink_cs_ssa_stats",
"job_type" : "anomaly_detector",
"description" : "IBM MQ JDLINK SSA Stats",
"create_time" : 1495637414056,
"finished_time" : 1495637724762,
"analysis_config" : {
"bucket_span" : "5m",
"detectors" : [
{
"detector_description" : "mean(curdepth)",
"function" : "mean",
"field_name" : "curdepth",
"partition_field_name" : "qmgr.raw",
"detector_rules" : [ ]
},
{
"detector_description" : "mean(ipprocs)",
"function" : "mean",
"field_name" : "ipprocs",
"partition_field_name" : "qmgr.raw",
"detector_rules" : [ ]
},
{
"detector_description" : "mean(opprocs)",
"function" : "mean",
"field_name" : "opprocs",
"partition_field_name" : "qmgr.raw",
"detector_rules" : [ ]
},
{
"detector_description" : "mean(dequeue)",
"function" : "mean",
"field_name" : "dequeue",
"partition_field_name" : "qmgr.raw",
"detector_rules" : [ ]
},
{
"detector_description" : "mean(enqueue)",
"function" : "mean",
"field_name" : "enqueue",
"partition_field_name" : "qmgr.raw",
"detector_rules" : [ ]
}
],
"influencers" : [
"qmgr.raw"
]
},
"data_description" : {
"time_field" : "@timestamp",
"time_format" : "epoch_ms"
},
"model_snapshot_retention_days" : 1,
"model_snapshot_id" : "1495637722",
"results_index_name" : "shared"
}
]
}

http://isgeis-logcentral.dx.deere.com:9200/_xpack/ml/datafeeds/datafeed-ibm_mq_jdlink_cs_ssa_stats?pretty

{
"count" : 1,
"datafeeds" : [
{
"datafeed_id" : "datafeed-ibm_mq_jdlink_cs_ssa_stats",
"job_id" : "ibm_mq_jdlink_cs_ssa_stats",
"query_delay" : "60s",
"frequency" : "150s",
"indexes" : [
"logstash-mq-*"
],
"types" : [
"default",
"mq-channel-stats",
"mq-queue-stats",
"mq-error",
"mq-channel",
"mq-system",
"mq-event"
],
"query" : {
"match_all" : {
"boost" : 1.0
}
},
"scroll_size" : 1000,
"chunking_config" : {
"mode" : "auto"
}
}
]
}

TIm,

I've done a little digging and have discovered that in v5.4.0, if a job is mis-configured, you'll get this very uninformative error. When v5.4.1 gets released, this problem is corrected and the user will see the "real" error that is driving the exception.

In the two times this problem has been raised by support, the problem was related to the actual job name the user was attempting to use. Job names must be formed from letters, numbers and underscores only, and those underscores cannot be at the beginning or end of the jobname. There can be no spaces in a job name as well.

Other than that, there may be something else configuration-wise that you might not be getting right on that job. Please let me know what you were trying to configure, specifically:

jobname
detector configuration
query inside the datafeed
bucket_span
any other items you add to the UI config

That detail might help diagnose where you might be going wrong in the config.

In the job that is working - you mention that you're selecting metrics for only "one queue out of hundreds." - However, I see nothing in your config that seems to align with that. The index that you're querying (logstash-mq-*) seems generic, and the query itself (match_all) is obviously not selecting any specific data. Combined with that, your detectors are partitioning using

"partition_field_name" : "qmgr.raw"

So, depending on what that field contains, its looking as if you're really running analysis across all values of qmgr.raw

Please clarify

this is what I'm putting in the machine learning create job page.
Based on discover saved query IBM MQ Queue Stats
this is essentially logstash-mq-* and type:"mq-queue-stats"

From this index checking fields: curdepth,dequeue,enqueue,ipprocs,opprocs,uncommit (mean) for all

Split by qmgr.raw (this is the logical grouping of queues, that is they belong to a given queue manager)

No key fields selected

Name: ibmmq_all_queues_stats
Description: IBM MQ All queue stats

Ok... now this worked. I was documenting this as I was going and now it saved it without issue.
????

Well, glad it is working! Must have been a slight error in information entry the first time. Like I said, v5.4.1 will make the error reporting more sensible in this kind of situation.

Hi Rich,

After it has been running I'm not seeing any data in the anomally explorer. I started the collection for April 10th to Now, and it just doesn't look like that is happening.

Counts looks like this,
job_idibmmq_all_queues_stats
processed_record_count222,987
processed_field_count920,162
input_bytes18.6 MB
input_field_count920,162
invalid_date_count0
missing_field_count640,747
out_of_order_timestamp_count0
empty_bucket_count0
sparse_bucket_count0
bucket_count237
earliest_record_timestamp2017-05-25 10:45:51
latest_record_timestamp2017-05-25 12:19:59
last_data_time2017-05-25 12:22:32
input_record_count222,987

Stats
job_idibmmq_all_queues_stats
result_typemodel_size_stats
model_bytes1.2 MB
total_by_field_count44
total_over_field_count0
total_partition_field_count43
bucket_allocation_failures_count0
memory_statusok
log_time2017-05-25 12:10:00
timestamp2017-05-25 12:05:00

This below, is troubling:

It means that most of the data encountered is missing the field names that are expected for the analysis. By the way, when creating a job (an advanced one), after you define a set of detectors, you can click on the "Data Preview" tab to ensure the fields you expect in the analysis are coming through in the search.

Also, in the datafeed tab, ensure that you ONLY pick the _type that has the data you want:

and not extraneous types. I'm looking at your info you posted before and I see

"types" : [
"default",
"mq-channel-stats",
"mq-queue-stats",
"mq-error",
"mq-channel",
"mq-system",
"mq-event"
],

which to me, looks like a bunch of possible extraneous information

When I created these jobs I used the "Multi-metric job " link. I had no option to limit the _types. I have tried to redo the job with the Advanced Job, but then that fails to save with the same original problem.

If I could edit what is out there and remove the _types that are not needed that would work. How can I do that in the Dev Console, because the ML Gui doesn't give you those options for that kind of change.

--Tim

Tim,

I noticed that if I take the job JSON you were trying:

{
  "job_id": "mqgw_queue_stats",
  "job_type": "anomaly_detector",
  "description": "IBM MQ All queue stats",
  "analysis_config": {
    "bucket_span": "5m",
    "detectors": [
      {
        "detector_description": "mean(curdepth) (ibmmq_all_queues_stats)",
        "function": "mean",
        "field_name": "curdepth",
        "partition_field_name": "qmgr",
        "detector_rules": [],
        "by_field_name": "queue"
      },
      {
        "detector_description": "mean(dequeue) (ibmmq_all_queues_stats)",
        "function": "mean",
        "field_name": "dequeue",
        "partition_field_name": "qmgr",
        "detector_rules": [],
        "by_field_name": "queue"
      },
      {
        "detector_description": "mean(enqueue) (ibmmq_all_queues_stats)",
        "function": "mean",
        "field_name": "enqueue",
        "partition_field_name": "qmgr",
        "detector_rules": [],
        "by_field_name": "queue"
      },
      {
        "detector_description": "mean(ipprocs) (ibmmq_all_queues_stats)",
        "function": "mean",
        "field_name": "ipprocs",
        "partition_field_name": "qmgr",
        "detector_rules": [],
        "by_field_name": "queue"
      },
      {
       "detector_description": "mean(opprocs) (ibmmq_all_queues_stats)",
        "function": "mean",
        "field_name": "opprocs",
        "partition_field_name": "qmgr",
        "detector_rules": [],
        "by_field_name": "queue"
      },
      {
        "detector_description": "mean(uncommit)",
        "function": "mean",
        "field_name": "uncommit",
        "partition_field_name": "qmgr",
        "detector_rules": [],
        "by_field_name": "queue"
      }
    ],
    "influencers": [
      "qmgr.raw",
      "queue"
    ]
  },
  "data_description": {
    "time_field": "@timestamp"
  },
  "model_snapshot_retention_days": 1,
  "datafeed_config": {
    "query_delay": "60s",
    "frequency": "150s",
    "indexes": [
      "logstash-mq-*"
    ],
    "types": [
      "mq-queue-stats"
    ],
    "query": {
      "match_all": {
        "boost": 1
      }
    },
    "scroll_size": 1000,
    "chunking_config": {
      "mode": "auto"
    }
  }
}

and then try it locally, I can reproduce your problem.

However, if I simply remove the "qmgr.raw" specifier from the influencers, I can get the job to save:

Might be something to try on your end

Tim,

I've determined that the proper config should be that, given your mappings, to use queue.raw and qmgr.raw as the fields to use for by_field_name, partition_field_name, and influencers - because those fields are of type keyword.

Additionally, if you're still having problems, check the "use dedicated index" option as this also prevents a situation where other jobs have defined fields in the .ml-anomalies-shared results index having competing mapping types.

The JSON below both uses the .raw version of the field and specifies the dedicated results index for the job.

{
  "job_id": "mqgw_queue_stats",
  "job_type": "anomaly_detector",
  "description": "IBM MQ All queue stats",
  "analysis_config": {
    "bucket_span": "5m",
    "detectors": [
      {
        "detector_description": "mean(curdepth) (ibmmq_all_queues_stats)",
        "function": "mean",
        "field_name": "curdepth",
        "partition_field_name": "qmgr.raw",
        "detector_rules": [],
        "by_field_name": "queue.raw"
      },
      {
        "detector_description": "mean(dequeue) (ibmmq_all_queues_stats)",
        "function": "mean",
        "field_name": "dequeue",
        "partition_field_name": "qmgr.raw",
        "detector_rules": [],
        "by_field_name": "queue.raw"
      },
      {
        "detector_description": "mean(enqueue) (ibmmq_all_queues_stats)",
        "function": "mean",
        "field_name": "enqueue",
        "partition_field_name": "qmgr.raw",
        "detector_rules": [],
        "by_field_name": "queue.raw"
      },
      {
        "detector_description": "mean(ipprocs) (ibmmq_all_queues_stats)",
        "function": "mean",
        "field_name": "ipprocs",
        "partition_field_name": "qmgr.raw",
        "detector_rules": [],
        "by_field_name": "queue.raw"
      },
      {
       "detector_description": "mean(opprocs) (ibmmq_all_queues_stats)",
        "function": "mean",
        "field_name": "opprocs",
        "partition_field_name": "qmgr.raw",
        "detector_rules": [],
        "by_field_name": "queue.raw"
      },
      {
        "detector_description": "mean(uncommit)",
        "function": "mean",
        "field_name": "uncommit",
        "partition_field_name": "qmgr.raw",
        "detector_rules": [],
        "by_field_name": "queue.raw"
      }
    ],
    "influencers": [
      "qmgr.raw",
      "queue.raw"
    ]
  },
  "data_description": {
    "time_field": "@timestamp"
  },
  "model_snapshot_retention_days": 1,
  "results_index_name": "custom-mqgw_queue_stats",
  "datafeed_config": {
    "query_delay": "60s",
    "frequency": "150s",
    "indexes": [
      "logstash-mq-*"
    ],
    "types": [
      "mq-queue-stats"
    ],
    "query": {
      "match_all": {
        "boost": 1
      }
    },
    "scroll_size": 1000,
    "chunking_config": {
      "mode": "auto"
    }
  }
}

Hi Rich,

I’m still seeing save failures when I select dedicated index.
I did have saving success when I did this.

  1.   Advanced job
    
  2.   Picked my stats without .raw (in the advanced GUI it is not making available the indices fields with .raw)
    
  3.   Save worked.
    
  4.   When I scheduled it, it ran into an error because it couldn’t invert on the field environment (this is where I would need to specify .raw) , but I’m not able to after it’s saved.
    

I have time today until 4:00 if you want to try something else.

Thanks,
Tim

Tim,

What if you do the following:

  1. Click on Advanced job, pick an index and click next (actually doesn't matter which one you pick because of what happens in next step)
  2. immediately go to the "Edit JSON" tab in the UI and replace all of the existing text there with the JSON text in my last response (repeated below for clarity).
  3. once pasted, save and schedule the job

This process avoids the situation where the UI tries to "encourage" you to pick the non- .raw fields.

Job JSON:

{
  "job_id": "mqgw_queue_stats",
  "job_type": "anomaly_detector",
  "description": "IBM MQ All queue stats",
  "analysis_config": {
    "bucket_span": "5m",
    "detectors": [
      {
        "detector_description": "mean(curdepth) (ibmmq_all_queues_stats)",
        "function": "mean",
        "field_name": "curdepth",
        "partition_field_name": "qmgr.raw",
        "detector_rules": [],
        "by_field_name": "queue.raw"
      },
      {
        "detector_description": "mean(dequeue) (ibmmq_all_queues_stats)",
        "function": "mean",
        "field_name": "dequeue",
        "partition_field_name": "qmgr.raw",
        "detector_rules": [],
        "by_field_name": "queue.raw"
      },
      {
        "detector_description": "mean(enqueue) (ibmmq_all_queues_stats)",
        "function": "mean",
        "field_name": "enqueue",
        "partition_field_name": "qmgr.raw",
        "detector_rules": [],
        "by_field_name": "queue.raw"
      },
      {
        "detector_description": "mean(ipprocs) (ibmmq_all_queues_stats)",
        "function": "mean",
        "field_name": "ipprocs",
        "partition_field_name": "qmgr.raw",
        "detector_rules": [],
        "by_field_name": "queue.raw"
      },
      {
       "detector_description": "mean(opprocs) (ibmmq_all_queues_stats)",
        "function": "mean",
        "field_name": "opprocs",
        "partition_field_name": "qmgr.raw",
        "detector_rules": [],
        "by_field_name": "queue.raw"
      },
      {
        "detector_description": "mean(uncommit)",
        "function": "mean",
        "field_name": "uncommit",
        "partition_field_name": "qmgr.raw",
        "detector_rules": [],
        "by_field_name": "queue.raw"
      }
    ],
    "influencers": [
      "qmgr.raw",
      "queue.raw"
    ]
  },
  "data_description": {
    "time_field": "@timestamp"
  },
  "model_snapshot_retention_days": 1,
  "results_index_name": "custom-mqgw_queue_stats",
  "datafeed_config": {
    "query_delay": "60s",
    "frequency": "150s",
    "indexes": [
      "logstash-mq-*"
    ],
    "types": [
      "mq-queue-stats"
    ],
    "query": {
      "match_all": {
        "boost": 1
      }
    },
    "scroll_size": 1000,
    "chunking_config": {
      "mode": "auto"
    }
  }
}

That worked!

Great! The dev team is aware, by the way, of the issues raised in this thread. We will make progress towards a better configuration experience in the next version(s).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.