Kibana X-pack Machine Learning Jobs list could not be retrieved

Versions being used ( all are installed on a single node).
logstash-6.0.1
elasticsearch-6.1.1
kibana-6.1.1-linux-x86_64

I am getting the error in Machine learning -> Job Management Page.
Error: Job list could not be retrieved
[search_phase_execution_exception] all shards failed

Even If I try to create new Job also, I get below error

When I try to get below stuff
GET /_xpack/usage
the output is

{
  "error": {
    "root_cause": [],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": []
  },
  "status": 503
}

Please let me know how I can debug this. I tried restarting all services, but did not help. Thanks

What errors are seen in elasticsearch.log when you try to save a job? I suspect the root cause of what's going on will be seen there...

This is what I am seeing in ealsticsearch.log when I am trying to save the Job.

[2018-01-29T11:31:31,389][WARN ][r.suppressed             ] path: /_xpack/ml/anomaly_detectors/testing2, params: {job_id=testing2}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:274) ~[elasticsearch
-6.1.1.jar:6.1.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:132) ~[elasticsear
ch-6.1.1.jar:6.1.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:243) ~[elasticsearch-6.
1.1.jar:6.1.1]
        at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:107) ~[elasticsearch-6.1.1.jar:6.1
.1]
        at org.elasticsearch.action.search.InitialSearchPhase.lambda$performPhaseOnShard$4(InitialSearchPhase.java:205) ~[elasticsearch
-6.1.1.jar:6.1.1]
        at org.elasticsearch.action.search.InitialSearchPhase$1.doRun(InitialSearchPhase.java:184) ~[elasticsearch-6.1.1.jar:6.1.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:637) [elas
ticsearch-6.1.1.jar:6.1.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.1.1.jar:6.1.1]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) [elasticsearch-6.1.1.jar:6.1.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.1.1.jar:6.1.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2018-01-29T11:31:31,639][DEBUG][o.e.a.s.TransportSearchAction] [8QF1u_C] All shards failed for phase: [query]
[2018-01-29T11:31:31,641][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [8QF1u_C] collector [cluster_stats] failed to collect data
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:274) ~[elasticsearch
-6.1.1.jar:6.1.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:132) ~[elasticsear
ch-6.1.1.jar:6.1.1]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:243) ~[elasticsearch-6.
1.1.jar:6.1.1]

[2018-01-29T11:31:31,647][DEBUG][o.e.a.s.TransportSearchAction] [8QF1u_C] All shards failed for phase: [query]
[2018-01-29T11:31:31,831][DEBUG][o.e.a.s.TransportSearchAction] [8QF1u_C] All shards failed for phase: [query]
[2018-01-29T11:31:31,832][DEBUG][o.e.a.s.TransportSearchAction] [8QF1u_C] All shards failed for phase: [query]
[2018-01-29T11:31:31,831][ERROR][o.e.x.m.c.m.JobStatsCollector] [8QF1u_C] collector [job_stats] failed to collect data
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed

Hmm...this seems to me that something fundamental to elasticsearch is not working (not just ML). Are you actively using this cluster for other things or did you stand this up just to test ML?

What do you see when your run the following in Console (Dev Tools):?

GET _cluster/health

We mainly want to explore ML features, so built this node and installed all here. Regular Discover/Visualization, ingestion of data using logstash is working fine.

This is what I get when I run GET _cluster/health

{
  "cluster_name": "elasticsearch",
  "status": "red",
  "timed_out": false,
  "number_of_nodes": 1,
  "number_of_data_nodes": 1,
  "active_primary_shards": 824,
  "active_shards": 824,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 833,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 49.728424864212435
}

Glad to hear you're interested in ML!

Unfortunately, your cluster status is red and that will certainly hinder things...you'll need to figure out why that's the case before I can help. Perhaps restart elasticsearch and watch the logging for reasons why the cluster won't go into at least a yellow state.

Ok, Thanks for your response. I am doubting it could be because of no.of shards, as we have a single node. Any pointers to debug in that route? How can I delete all the data (and indices/shards) and start from scratch?

Yes you have a lot of shards, half of which are unassigned.

If you want to delete all the data and start again:

  1. stop elasticsearch
  2. rm -rf /path/to/elasticsearch/data
  3. start elasticsearch

Thanks. For now this has fixed my issue.

1 Like

You can also use curl -XDELETE HOST:9200/* to remove all indices. It's a bit safer than removing things from the filesystem as well :slight_smile:

1 Like

Thanks @warkolm Good to know

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.