Elastic search errors after OutOfMemoryError

I restarted the elasticsearch service after I saw OutOfMemoryError - errors.
But in the logging I see these errors:

[2020-07-07T09:43:52,654][ERROR][o.e.x.m.c.i.IndexStatsCollector] [mPGPZUq] collector [index-stats] failed to collect data
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];

Kibana -logging:

Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"error","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"level":"error","error":{"message":"[search_phase_execution_exception] all shards failed","name":"Error","stack":"[search_phase_execution_exception] all shards failed :: {\"path\":\"/.kibana/_search\",\"query\":{\"size\":10000,\"ignore_unavailable\":true,\"filter_path\":\"hits.hits._source.canvas-workpad\"},\"body\":\"{\\\"query\\\":{\\\"bool\\\":{\\\"filter\\\":{\\\"term\\\":{\\\"type\\\":\\\"canvas-workpad\\\"}}}}}\",\"statusCode\":503,\"response\":\"{\\\"error\\\":{\\\"root_cause\\\":[],\\\"type\\\":\\\"search_phase_execution_exception\\\",\\\"reason\\\":\\\"all shards failed\\\",\\\"phase\\\":\\\"query\\\",\\\"grouped\\\":true,\\\"failed_shards\\\":[]},\\\"status\\\":503}\"}\n    at respond (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:308:15)\n    at checkRespForFailure (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:267:7)\n    at HttpConnector.<anonymous> (/usr/share/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:166:7)\n    at IncomingMessage.wrapper (/usr/share/kibana/node_modules/elasticsearch/node_modules/lodash/lodash.js:4935:19)\n    at IncomingMessage.emit (events.js:194:15)\n    at endReadableNT (_stream_readable.js:1103:12)\n    at process._tickCallback (internal/process/next_tick.js:63:19)"},"message":"[search_phase_execution_exception] all shards failed"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"log","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"message":"Unable to fetch data from canvas collector"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"error","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"level":"error","error":{"message":"[search_phase_execution_exception] all shards failed","name":"Error","stack":"[search_phase_execution_exception] all shards failed :: {\"path\":\"/.kibana/_search\",\"query\":{\"ignore_unavailable\":true,\"filter_path\":\"aggregations.types.buckets\"},\"body\":\"{\\\"size\\\":0,\\\"query\\\":{\\\"terms\\\":{\\\"type\\\":[\\\"dashboard\\\",\\\"visualization\\\",\\\"search\\\",\\\"index-pattern\\\",\\\"graph-workspace\\\",\\\"timelion-sheet\\\"]}},\\\"aggs\\\":{\\\"types\\\":{\\\"terms\\\":{\\\"field\\\":\\\"type\\\",\\\"size\\\":6}}}}\",\"statusCode\":503,\"response\":\"{\\\"error\\\":{\\\"root_cause\\\":[],\\\"type\\\":\\\"search_phase_execution_exception\\\",\\\"reason\\\":\\\"all shards failed\\\",\\\"phase\\\":\\\"query\\\",\\\"grouped\\\":true,\\\"failed_shards\\\":[]},\\\"status\\\":503}\"}\n    at respond (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:308:15)\n    at checkRespForFailure (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:267:7)\n    at HttpConnector.<anonymous> (/usr/share/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:166:7)\n    at IncomingMessage.wrapper (/usr/share/kibana/node_modules/elasticsearch/node_modules/lodash/lodash.js:4935:19)\n    at IncomingMessage.emit (events.js:194:15)\n    at endReadableNT (_stream_readable.js:1103:12)\n    at process._tickCallback (internal/process/next_tick.js:63:19)"},"message":"[search_phase_execution_exception] all shards failed"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"log","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"message":"Unable to fetch data from kibana collector"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"error","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"level":"error","error":{"message":"[no_shard_available_action_exception] No shard available for [get [.kibana][doc][kql-telemetry:kql-telemetry]: routing [null]]","name":"Error","stack":"[no_shard_available_action_exception] No shard available for [get [.kibana][doc][kql-telemetry:kql-telemetry]: routing [null]] :: {\"path\":\"/.kibana/doc/kql-telemetry%3Akql-telemetry\",\"query\":{},\"statusCode\":503,\"response\":\"{\\\"error\\\":{\\\"root_cause\\\":[{\\\"type\\\":\\\"no_shard_available_action_exception\\\",\\\"reason\\\":\\\"No shard available for [get [.kibana][doc][kql-telemetry:kql-telemetry]: routing [null]]\\\"}],\\\"type\\\":\\\"no_shard_available_action_exception\\\",\\\"reason\\\":\\\"No shard available for [get [.kibana][doc][kql-telemetry:kql-telemetry]: routing [null]]\\\"},\\\"status\\\":503}\"}\n    at respond (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:308:15)\n    at checkRespForFailure (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:267:7)\n    at HttpConnector.<anonymous> (/usr/share/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:166:7)\n    at IncomingMessage.wrapper (/usr/share/kibana/node_modules/elasticsearch/node_modules/lodash/lodash.js:4935:19)\n    at IncomingMessage.emit (events.js:194:15)\n    at endReadableNT (_stream_readable.js:1103:12)\n    at process._tickCallback (internal/process/next_tick.js:63:19)"},"message":"[no_shard_available_action_exception] No shard available for [get [.kibana][doc][kql-telemetry:kql-telemetry]: routing [null]]"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"log","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"message":"Unable to fetch data from kql collector"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"error","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"level":"error","error":{"message":"[search_phase_execution_exception] all shards failed","name":"Error","stack":"[search_phase_execution_exception] all shards failed :: {\"path\":\"/.kibana/_search\",\"query\":{\"size\":1000,\"ignore_unavailable\":true,\"filter_path\":\"hits.hits._id\"},\"body\":\"{\\\"query\\\":{\\\"bool\\\":{\\\"filter\\\":{\\\"term\\\":{\\\"index-pattern.type\\\":\\\"rollup\\\"}}}}}\",\"statusCode\":503,\"response\":\"{\\\"error\\\":{\\\"root_cause\\\":[],\\\"type\\\":\\\"search_phase_execution_exception\\\",\\\"reason\\\":\\\"all shards failed\\\",\\\"phase\\\":\\\"query\\\",\\\"grouped\\\":true,\\\"failed_shards\\\":[]},\\\"status\\\":503}\"}\n    at respond (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:308:15)\n    at checkRespForFailure (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:267:7)\n    at HttpConnector.<anonymous> (/usr/share/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:166:7)\n    at IncomingMessage.wrapper (/usr/share/kibana/node_modules/elasticsearch/node_modules/lodash/lodash.js:4935:19)\n    at IncomingMessage.emit (events.js:194:15)\n    at endReadableNT (_stream_readable.js:1103:12)\n    at process._tickCallback (internal/process/next_tick.js:63:19)"},"message":"[search_phase_execution_exception] all shards failed"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"log","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"message":"Unable to fetch data from rollups collector"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"error","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"level":"error","error":{"message":"[no_shard_available_action_exception] No shard available for [get [.kibana][doc][config:6.6.1]: routing [null]]","name":"Error","stack":"[no_shard_available_action_exception] No shard available for [get [.kibana][doc][config:6.6.1]: routing [null]] :: {\"path\":\"/.kibana/doc/config%3A6.6.1\",\"query\":{},\"statusCode\":503,\"response\":\"{\\\"error\\\":{\\\"root_cause\\\":[{\\\"type\\\":\\\"no_shard_available_action_exception\\\",\\\"reason\\\":\\\"No shard available for [get [.kibana][doc][config:6.6.1]: routing [null]]\\\"}],\\\"type\\\":\\\"no_shard_available_action_exception\\\",\\\"reason\\\":\\\"No shard available for [get [.kibana][doc][config:6.6.1]: routing [null]]\\\"},\\\"status\\\":503}\"}\n    at respond (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:308:15)\n    at checkRespForFailure (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:267:7)\n    at HttpConnector.<anonymous> (/usr/share/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:166:7)\n    at IncomingMessage.wrapper (/usr/share/kibana/node_modules/elasticsearch/node_modules/lodash/lodash.js:4935:19)\n    at IncomingMessage.emit (events.js:194:15)\n    at endReadableNT (_stream_readable.js:1103:12)\n    at process._tickCallback (internal/process/next_tick.js:63:19)"},"message":"[no_shard_available_action_exception] No shard available for [get [.kibana][doc][config:6.6.1]: routing [null]]"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"log","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"message":"Unable to fetch data from kibana_settings collector"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"log","@timestamp":"2020-07-07T09:09:43Z","tags":["license","info","xpack"],"pid":11767,"message":"Imported license information from Elasticsearch for the [monitoring] cluster: mode: basic | status: active"}
Jul  7 09:09:47 elk-sr1-01 logstash[1235]: [2020-07-07T09:09:47,432][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[apache-access-2020.07.07][2] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[apache-access-2020.07.07][2]] containing [17] requests]"})
Jul  7 09:09:47 elk-sr1-01 logstash[1235]: [2020-07-07T09:09:47,432][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[apache-access-2020.07.07][2] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[apache-access-2020.07.07][2]] containing [17] requests]"})
Jul  7 09:09:47 elk-sr1-01 logstash[1235]: [2020-07-07T09:09:47,432][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[apache-access-2020.07.07][1] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[apache-access-2020.07.07][1]] containing [17] requests]"})
Jul  7 09:09:47 elk-sr1-01 logstash[1235]: [2020-07-07T09:09:47,432][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[apache-access-2020.07.07][2] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[apache-access-2020.07.07][2]] containing [17] requests]"})

Cluster status

 curl -X GET "localhost:9200/_cluster/health"
{"cluster_name":"elasticsearch","status":"red","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":4145,"active_shards":4145,"relocating_shards":0,"initializing_shards":4,"unassigned_shards":12173,"delayed_unassigned_shards":0,"number_of_pending_tasks":7,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":1049,"active_shards_percent_as_number":25.395172160274477}

I am new to elasticsearch/kibana (6.6.1). Can someone help me how to troubleshoot this ?
Thanks in advance.

You have far, far too many shards in your cluster. You will need to reduce that dramatically, quite possibly by deleting data or closing indices.

How did you get so many shards? Try tomerge or delete them if possible.

If I run this command:

curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

I see a lot of CLUSTER_RECOVERD as unassigned.reason

apache-error-2019.07.21         0 p UNASSIGNED   CLUSTER_RECOVERED
apache-error-2019.07.21         0 r UNASSIGNED   CLUSTER_RECOVERED
app-webframework-6-2019.04.13   1 p UNASSIGNED   CLUSTER_RECOVERED
app-webframework-6-2019.04.13   1 r UNASSIGNED   CLUSTER_RECOVERED
app-webframework-6-2019.04.13   3 p UNASSIGNED   CLUSTER_RECOVERED
app-webframework-6-2019.04.13   3 r UNASSIGNED   CLUSTER_RECOVERED
app-webframework-6-2019.04.13   4 p UNASSIGNED   CLUSTER_RECOVERED
app-webframework-6-2019.04.13   4 r UNASSIGNED   CLUSTER_RECOVERED

Following command:

curl -XGET localhost:9200/_cluster/allocation/explain?pretty

gives

{
  "index" : "apache-access-2020.06.03",
  "shard" : 1,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2020-07-07T11:35:23.861Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "mPGPZUq5TrmM32Rc1SXYSQ",
      "node_name" : "mPGPZUq",
      "transport_address" : "127.0.0.1:9300",
      "node_attributes" : {
        "ml.machine_memory" : "67554115584",
        "xpack.installed" : "true",
        "ml.max_open_jobs" : "20",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[apache-access-2020.06.03][1], node[mPGPZUq5TrmM32Rc1SXYSQ], [P], s[STARTED], a[id=9dURKIHSTCC6USinN8gJlQ]]"
        },
        {
          "decider" : "throttling",
          "decision" : "THROTTLE",
          "explanation" : "reached the limit of incoming shard recoveries [4], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
        }
      ]
    }
  ]
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.