ML Throws errors as soon as click on it

I've just upgraded to 5.4 so I can play with the ML features. When I click on Machine Learning in Kibana, I get three error banners at the top of the page, 'Jobs list could not be created', 'An internal server error occured', and 'Job details could not be retrieved'.

In the kibana logs, I get the following output every 30 seconds:
error [21:22:16.932] [null_pointer_exception] null :: {"path":"/_xpack/ml/anomaly_detectors/_stats","query":{},"statusCode":500,"response":"{"error":{"root_cause":[{"type":"null_pointer_exception","reason":null}],"type":"null_pointer_exception","reason":null},"status":500}"}
at respond (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:295:15)
at checkRespForFailure (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:254:7)
at HttpConnector. (/usr/share/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:157:7)
at IncomingMessage.bound (/usr/share/kibana/node_modules/elasticsearch/node_modules/lodash/dist/lodash.js:729:21)
at emitNone (events.js:91:20)
at IncomingMessage.emit (events.js:185:7)
at endReadableNT (_stream_readable.js:974:12)
at _combinedTickCallback (internal/process/next_tick.js:80:11)
at process._tickDomainCallback (internal/process/next_tick.js:128:9)

I've tried removing and re-installing kibana and x-pack from scratch but no difference. Any suggestions?

Hi Steve,

Sorry to hear you're having trouble! Can you confirm that you gave installed X-Pack into all Elasticsearch nodes (all must be 5.4) and have restarted the nodes in the cluster?

Have you changed any settings in your elasticsearch.yml related to ML (e.g. to set up dedicated ML nodes)?

Can you share the output of this request from Dev Tools -> Console

GET /_xpack/usage

That should show that ML is available and enabled.

Hi Steve,

I've performed a rolling upgrade to 5.4 this evening so all nodes have been restarted at 5.4. I haven't changed anything in the elaticsearch.yml. From what I read, all nodes are ML nodes by default, so I shouldn't need to change anything, or am I misunderstanding something here?

{
  "security": {
    "available": true,
    "enabled": true,
    "realms": {
      "file": {
        "available": true,
        "enabled": false
      },
      "ldap": {
        "load_balance_type": [
          "failover"
        ],
        "size": [
          0
        ],
        "name": [
          "ldap1"
        ],
        "available": true,
        "ssl": [
          false
        ],
        "enabled": true,
        "order": [
          1
        ],
        "user_search": [
          true
        ]
      },
      "native": {
        "name": [
          "native1"
        ],
        "available": true,
        "size": [
          0
        ],
        "enabled": true,
        "order": [
          0
        ]
      },
      "active_directory": {
        "available": true,
        "enabled": false
      },
      "pki": {
        "available": true,
        "enabled": false
      }
    },
    "roles": {
      "native": {
        "size": 2,
        "fls": true,
        "dls": false
      },
      "file": {
        "size": 0,
        "fls": false,
        "dls": false
      }
    },
    "ssl": {
      "http": {
        "enabled": false
      },
      "transport": {
        "enabled": false
      }
    },
    "audit": {
      "outputs": [
        "index",
        "logfile"
      ],
      "enabled": true
    },
    "ipfilter": {
      "http": false,
      "transport": false
    },
    "system_key": {
      "enabled": false
    },
    "anonymous": {
      "enabled": false
    }
  },
  "watcher": {
    "available": true,
    "enabled": true,
    "count": {
      "active": 0,
      "total": 0
    },
    "execution": {
      "actions": {
        "_all": {
          "total": 0,
          "total_time_in_ms": 0
        }
      }
    }
  },
  "monitoring": {
    "available": true,
    "enabled": true,
    "enabled_exporters": {
      "http": 1
    }
  },
  "graph": {
    "available": true,
    "enabled": true
  },
  "ml": {
    "available": true,
    "enabled": true,
    "jobs": {},
    "datafeeds": {}
  }
}

Hi Steve

It sounds like the steps you have taken are correct.. to enable ML you would need to upgrade the cluster, as well as Kibana and the X-Pack plugins. Updates to elasticsearch.yml are not needed specifically for ML.

X-Pack usage shows that ML is enabled, which is good.

Next steps would be to try to create an ML job using the API, rather than the Kibana UI. In Dev Tools, please can you try this:

PUT _xpack/ml/anomaly_detectors/test-job
{
    "analysis_config" : {
        "bucket_span":"5m",
        "detectors" :[{"function":"count"}]
    },
    "data_description" : {
       "time_field":"time"
    }
}

It will create a basic job (but without a datafeed), so let's see if this can work.

You can then try to list available jobs using:

GET _xpack/ml/anomaly_detectors

Whilst I would expect a different error message, other aspects to consider are that ML requires a trial or platinum license. Also, with security enabled, you would need monitor_ml or monitor cluster privileges to get job info, and additionally manage_ml or manage cluster privileges to create a job. More info here https://www.elastic.co/guide/en/x-pack/current/security-privileges.html#privileges-list-cluster. If you are using the elastic superuser, then you should be ok however.

Thanks

Hi Sophie,

The response I get is the same:

{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[bds-esm-06][10.0.38.236:9300][cluster:admin/xpack/ml/job/put]"
}
],
"type": "null_pointer_exception",
"reason": null
},
"status": 500
}

I am logged is as the 'elastic' superuser, and I have a 'Dev -> Platinum' license installed on this cluster.

P.S. Are you still moonlighting as an Uber driver? :wink:

Thanks,
Steve

1 Like

Hi Steve,

Thanks for that, still a few more things to explore here.

Can you share any error messages that occur in the ES logs on the ES node that Kibana is pointing at, the Master node at the time, and bds-esm-06, if that isn't one of those two?

Are you running in Docker? If yes, what is the Base OS? Until the 5.4 release, our images used Alpine, which uses musl instead of glibc, and unfortunately musl is less mature and can cause some really odd issues.

Can you share the output of:

GET /_xpack/ml/anomaly_detectors

This should return something boring like below, but will confirm that ML actions are being properly routed/executed"

{
  "count": 0,
  "jobs": []
}

Can you also share the output of

GET /_nodes/stats

And feel free to redact any sensitive info, like IP addresses, if you'd like.

Thanks,
Steve

Hi,

I'm having trouble pasting the output of these commands on here as they are quite large and apparently you can only have 7000 characters on a post, so they may be split over several posts..

So

GET /_xpack/ml/anomaly_detectors

Returns:

{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[bds-esm-06][10.0.38.236:9300][cluster:monitor/xpack/ml/job/get]"
}
],
"type": "null_pointer_exception",
"reason": null
},
"status": 500
}

I am not using Docker. The OS is Centos7.

BDS-ESM-06 is the management node.

The log output is quite large, but contains a lot of lines referencing 'security'. I can't share the full entry as it breaks the 7000 character limit.

java.lang.NullPointerException: null
at org.elasticsearch.xpack.ml.action.GetJobsStatsAction$TransportAction.doExecute(GetJobsStatsAction.java:384) ~[?:?]
at org.elasticsearch.xpack.ml.action.GetJobsStatsAction$TransportAction.doExecute(GetJobsStatsAction.java:362) ~[?:?]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$apply$1(SecurityActionFilter.java:128) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$authorizeRequest$4(SecurityActionFilter.java:203) ~[?:?]
at org.elasticsearch.xpack.security.authz.AuthorizationUtils$AsyncAuthorizer.maybeRun(AuthorizationUtils.java:127) ~[?:?]
at org.elasticsearch.xpack.security.authz.AuthorizationUtils$AsyncAuthorizer.setRunAsRoles(AuthorizationUtils.java:121) ~[?:?]
at org.elasticsearch.xpack.security.authz.AuthorizationUtils$AsyncAuthorizer.authorize(AuthorizationUtils.java:109) ~[?:?]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.authorizeRequest(SecurityActionFilter.java:205) ~[?:?]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$3(SecurityActionFilter.java:181) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$0(AuthenticationService.java:192) ~[?:?]
at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$2(AuthenticationService.java:212) ~[?:?]
at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:224) ~[?:?]
at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:190) ~[?:?]
at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.access$000(AuthenticationService.java:147) ~[?:?]
at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:118) ~[?:?]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.applyInternal(SecurityActionFilter.java:180) ~[?:?]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.apply(SecurityActionFilter.java:140) ~[?:?]

As I have a support license, I could raise an official support request and that way I can upload a cluster diagnostic to you, but thought that as this was classed as 'beta', that the public forums would be the best place to discuss the issues? Let me know how you wish to proceed as I can't share the log output on here as a single output is greater than 7000 characters, and the GET /_nodes/stats is also very large.

Hi Steve,

That's great - glad to hear you're a customer! Why don't you open a ticket, so we can dig deeper into the diagnostics and figure out what's going on here quickly. Then we can update this thread with whatever summarized conclusions we discover.

Thanks,
Steve

Based on the stack trace you've posted, it looks like the output of

GET /_cluster/state?pretty&metric=metadata

could also be helpful in diagnosing the issue. Could you include that output when you raise the ticket?

Thanks.

Hi Steve,

So I've raised a support ticket (24 hours ago now), and its still waiting to be looked at. Frustrating that you can often get faster support on the free forums than you do for having a paid subscription.

Hello I just upgraded as well,

i also installed x-pack on the logstash for the first time, which started to give errors on log story short to fix i added the below line to by elastic config, and later realized that that was messing with ML, so if you do have a line like below removing the might fix your issue, i had the same issue :slight_smile:

#action.auto_create_index: .security,.monitoring*,.watches,.triggered_watches,.watcher-history*

nope it helps

Thanks for the tip. I don't have anything like that in my config though. It does raise the question, what new indices does ML generate on install and expect to be present?

GET _cat/indices/.ml* responds with just these two on my cluster:

green open .ml-anomalies-shared 1yCT5983Sjm1gm9DtpxmVg 5 1 0 0 1.2kb 650b
green open .ml-notifications JU8cHu-2Qd2YyDFOk3_Xig 1 1 3 0 29.1kb 14.5kb

compared to you i also have an index called ".ml-state" which is empty :frowning:
ps i have more indexes but they are to keep result so not mentioning them

i also couldn't delete the jobs, i tried the below to list and delete the jobs (didnt work) but was able to debug with the out come over kibana dev, give them a try :slight_smile:

get _xpack/ml/datafeeds/
DELETE _xpack/ml/datafeeds/feed_id

and

GET _xpack/ml/anomaly_detectors/
DELETE _xpack/ml/anomaly_detectors/job_id

For info, ML will create the following indices:

  • .ml-state - contains ML model state, stored in proprietary format
  • .ml-notifications - contains audit notifications e.g. datafeed started on node1
  • .ml-anomalies-shared - contains results, this is the default results index
  • .ml-anomalies-custom-foo - contains results when a custom results_index_name is specified in the job configuration

These indices are only created once a job has been created. Upon install, we load ML templates, and this would require the ability to automatically create indicies beginning with .ml-.

@crickes From your output, this implies that a job has been created at some point. Do you know when this happened? From the discussions it has appeared that none of the _xpack/ml APIs were working.

Another thought... is it possible for you to point Kibana at a master node? Not sure if you can, but this might avoid the remote_transport_exception and allow you to evaluate in the short term.

@sophie_chang

The only attempt to create a job was using the instruction in your earlier post:

PUT _xpack/ml/anomaly_detectors/test-job

{
"analysis_config" : {
"bucket_span":"5m",
"detectors" :[{"function":"count"}]
},
"data_description" : {
"time_field":"time"
}
}

There is no difference in behaviour when pointing Kibana to a master node, I still get:

[2017-05-09T13:51:29,927][WARN ][r.suppressed ] path: /_xpack/ml/anomaly_detectors/_stats, params: {}
java.lang.NullPointerException: null

Followed by a massive entry in the log file, which is too big to paste on here.

Just to update for anyone else who comes across similar issues, they were resolved by restarting elasticsearch on all the master nodes.

This was caused by a bug in 5.4.0 that will be fixed in 5.4.1. Once 5.4.1 is available it will be possible to use ML after doing a rolling upgrade from an earlier version without having to restart the master nodes twice.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.