Elastic search errors after OutOfMemoryError

cm-vdp · July 7, 2020, 10:23am

I restarted the elasticsearch service after I saw OutOfMemoryError - errors.
But in the logging I see these errors:

[2020-07-07T09:43:52,654][ERROR][o.e.x.m.c.i.IndexStatsCollector] [mPGPZUq] collector [index-stats] failed to collect data
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];

Kibana -logging:

Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"error","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"level":"error","error":{"message":"[search_phase_execution_exception] all shards failed","name":"Error","stack":"[search_phase_execution_exception] all shards failed :: {\"path\":\"/.kibana/_search\",\"query\":{\"size\":10000,\"ignore_unavailable\":true,\"filter_path\":\"hits.hits._source.canvas-workpad\"},\"body\":\"{\\\"query\\\":{\\\"bool\\\":{\\\"filter\\\":{\\\"term\\\":{\\\"type\\\":\\\"canvas-workpad\\\"}}}}}\",\"statusCode\":503,\"response\":\"{\\\"error\\\":{\\\"root_cause\\\":[],\\\"type\\\":\\\"search_phase_execution_exception\\\",\\\"reason\\\":\\\"all shards failed\\\",\\\"phase\\\":\\\"query\\\",\\\"grouped\\\":true,\\\"failed_shards\\\":[]},\\\"status\\\":503}\"}\n    at respond (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:308:15)\n    at checkRespForFailure (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:267:7)\n    at HttpConnector.<anonymous> (/usr/share/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:166:7)\n    at IncomingMessage.wrapper (/usr/share/kibana/node_modules/elasticsearch/node_modules/lodash/lodash.js:4935:19)\n    at IncomingMessage.emit (events.js:194:15)\n    at endReadableNT (_stream_readable.js:1103:12)\n    at process._tickCallback (internal/process/next_tick.js:63:19)"},"message":"[search_phase_execution_exception] all shards failed"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"log","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"message":"Unable to fetch data from canvas collector"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"error","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"level":"error","error":{"message":"[search_phase_execution_exception] all shards failed","name":"Error","stack":"[search_phase_execution_exception] all shards failed :: {\"path\":\"/.kibana/_search\",\"query\":{\"ignore_unavailable\":true,\"filter_path\":\"aggregations.types.buckets\"},\"body\":\"{\\\"size\\\":0,\\\"query\\\":{\\\"terms\\\":{\\\"type\\\":[\\\"dashboard\\\",\\\"visualization\\\",\\\"search\\\",\\\"index-pattern\\\",\\\"graph-workspace\\\",\\\"timelion-sheet\\\"]}},\\\"aggs\\\":{\\\"types\\\":{\\\"terms\\\":{\\\"field\\\":\\\"type\\\",\\\"size\\\":6}}}}\",\"statusCode\":503,\"response\":\"{\\\"error\\\":{\\\"root_cause\\\":[],\\\"type\\\":\\\"search_phase_execution_exception\\\",\\\"reason\\\":\\\"all shards failed\\\",\\\"phase\\\":\\\"query\\\",\\\"grouped\\\":true,\\\"failed_shards\\\":[]},\\\"status\\\":503}\"}\n    at respond (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:308:15)\n    at checkRespForFailure (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:267:7)\n    at HttpConnector.<anonymous> (/usr/share/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:166:7)\n    at IncomingMessage.wrapper (/usr/share/kibana/node_modules/elasticsearch/node_modules/lodash/lodash.js:4935:19)\n    at IncomingMessage.emit (events.js:194:15)\n    at endReadableNT (_stream_readable.js:1103:12)\n    at process._tickCallback (internal/process/next_tick.js:63:19)"},"message":"[search_phase_execution_exception] all shards failed"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"log","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"message":"Unable to fetch data from kibana collector"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"error","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"level":"error","error":{"message":"[no_shard_available_action_exception] No shard available for [get [.kibana][doc][kql-telemetry:kql-telemetry]: routing [null]]","name":"Error","stack":"[no_shard_available_action_exception] No shard available for [get [.kibana][doc][kql-telemetry:kql-telemetry]: routing [null]] :: {\"path\":\"/.kibana/doc/kql-telemetry%3Akql-telemetry\",\"query\":{},\"statusCode\":503,\"response\":\"{\\\"error\\\":{\\\"root_cause\\\":[{\\\"type\\\":\\\"no_shard_available_action_exception\\\",\\\"reason\\\":\\\"No shard available for [get [.kibana][doc][kql-telemetry:kql-telemetry]: routing [null]]\\\"}],\\\"type\\\":\\\"no_shard_available_action_exception\\\",\\\"reason\\\":\\\"No shard available for [get [.kibana][doc][kql-telemetry:kql-telemetry]: routing [null]]\\\"},\\\"status\\\":503}\"}\n    at respond (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:308:15)\n    at checkRespForFailure (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:267:7)\n    at HttpConnector.<anonymous> (/usr/share/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:166:7)\n    at IncomingMessage.wrapper (/usr/share/kibana/node_modules/elasticsearch/node_modules/lodash/lodash.js:4935:19)\n    at IncomingMessage.emit (events.js:194:15)\n    at endReadableNT (_stream_readable.js:1103:12)\n    at process._tickCallback (internal/process/next_tick.js:63:19)"},"message":"[no_shard_available_action_exception] No shard available for [get [.kibana][doc][kql-telemetry:kql-telemetry]: routing [null]]"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"log","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"message":"Unable to fetch data from kql collector"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"error","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"level":"error","error":{"message":"[search_phase_execution_exception] all shards failed","name":"Error","stack":"[search_phase_execution_exception] all shards failed :: {\"path\":\"/.kibana/_search\",\"query\":{\"size\":1000,\"ignore_unavailable\":true,\"filter_path\":\"hits.hits._id\"},\"body\":\"{\\\"query\\\":{\\\"bool\\\":{\\\"filter\\\":{\\\"term\\\":{\\\"index-pattern.type\\\":\\\"rollup\\\"}}}}}\",\"statusCode\":503,\"response\":\"{\\\"error\\\":{\\\"root_cause\\\":[],\\\"type\\\":\\\"search_phase_execution_exception\\\",\\\"reason\\\":\\\"all shards failed\\\",\\\"phase\\\":\\\"query\\\",\\\"grouped\\\":true,\\\"failed_shards\\\":[]},\\\"status\\\":503}\"}\n    at respond (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:308:15)\n    at checkRespForFailure (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:267:7)\n    at HttpConnector.<anonymous> (/usr/share/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:166:7)\n    at IncomingMessage.wrapper (/usr/share/kibana/node_modules/elasticsearch/node_modules/lodash/lodash.js:4935:19)\n    at IncomingMessage.emit (events.js:194:15)\n    at endReadableNT (_stream_readable.js:1103:12)\n    at process._tickCallback (internal/process/next_tick.js:63:19)"},"message":"[search_phase_execution_exception] all shards failed"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"log","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"message":"Unable to fetch data from rollups collector"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"error","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"level":"error","error":{"message":"[no_shard_available_action_exception] No shard available for [get [.kibana][doc][config:6.6.1]: routing [null]]","name":"Error","stack":"[no_shard_available_action_exception] No shard available for [get [.kibana][doc][config:6.6.1]: routing [null]] :: {\"path\":\"/.kibana/doc/config%3A6.6.1\",\"query\":{},\"statusCode\":503,\"response\":\"{\\\"error\\\":{\\\"root_cause\\\":[{\\\"type\\\":\\\"no_shard_available_action_exception\\\",\\\"reason\\\":\\\"No shard available for [get [.kibana][doc][config:6.6.1]: routing [null]]\\\"}],\\\"type\\\":\\\"no_shard_available_action_exception\\\",\\\"reason\\\":\\\"No shard available for [get [.kibana][doc][config:6.6.1]: routing [null]]\\\"},\\\"status\\\":503}\"}\n    at respond (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:308:15)\n    at checkRespForFailure (/usr/share/kibana/node_modules/elasticsearch/src/lib/transport.js:267:7)\n    at HttpConnector.<anonymous> (/usr/share/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:166:7)\n    at IncomingMessage.wrapper (/usr/share/kibana/node_modules/elasticsearch/node_modules/lodash/lodash.js:4935:19)\n    at IncomingMessage.emit (events.js:194:15)\n    at endReadableNT (_stream_readable.js:1103:12)\n    at process._tickCallback (internal/process/next_tick.js:63:19)"},"message":"[no_shard_available_action_exception] No shard available for [get [.kibana][doc][config:6.6.1]: routing [null]]"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"log","@timestamp":"2020-07-07T09:09:43Z","tags":["warning","stats-collection"],"pid":11767,"message":"Unable to fetch data from kibana_settings collector"}
Jul  7 09:09:43 elk-sr1-01 kibana[11767]: {"type":"log","@timestamp":"2020-07-07T09:09:43Z","tags":["license","info","xpack"],"pid":11767,"message":"Imported license information from Elasticsearch for the [monitoring] cluster: mode: basic | status: active"}
Jul  7 09:09:47 elk-sr1-01 logstash[1235]: [2020-07-07T09:09:47,432][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[apache-access-2020.07.07][2] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[apache-access-2020.07.07][2]] containing [17] requests]"})
Jul  7 09:09:47 elk-sr1-01 logstash[1235]: [2020-07-07T09:09:47,432][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[apache-access-2020.07.07][2] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[apache-access-2020.07.07][2]] containing [17] requests]"})
Jul  7 09:09:47 elk-sr1-01 logstash[1235]: [2020-07-07T09:09:47,432][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[apache-access-2020.07.07][1] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[apache-access-2020.07.07][1]] containing [17] requests]"})
Jul  7 09:09:47 elk-sr1-01 logstash[1235]: [2020-07-07T09:09:47,432][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[apache-access-2020.07.07][2] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[apache-access-2020.07.07][2]] containing [17] requests]"})

Cluster status

 curl -X GET "localhost:9200/_cluster/health"
{"cluster_name":"elasticsearch","status":"red","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":4145,"active_shards":4145,"relocating_shards":0,"initializing_shards":4,"unassigned_shards":12173,"delayed_unassigned_shards":0,"number_of_pending_tasks":7,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":1049,"active_shards_percent_as_number":25.395172160274477}

I am new to elasticsearch/kibana (6.6.1). Can someone help me how to troubleshoot this ?
Thanks in advance.

Christian_Dahlqvist · July 7, 2020, 11:03am

You have far, far too many shards in your cluster. You will need to reduce that dramatically, quite possibly by deleting data or closing indices.

defalt · July 7, 2020, 11:24am

How did you get so many shards? Try tomerge or delete them if possible.

cm-vdp · July 7, 2020, 11:48am

If I run this command:

curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

I see a lot of CLUSTER_RECOVERD as unassigned.reason

apache-error-2019.07.21         0 p UNASSIGNED   CLUSTER_RECOVERED
apache-error-2019.07.21         0 r UNASSIGNED   CLUSTER_RECOVERED
app-webframework-6-2019.04.13   1 p UNASSIGNED   CLUSTER_RECOVERED
app-webframework-6-2019.04.13   1 r UNASSIGNED   CLUSTER_RECOVERED
app-webframework-6-2019.04.13   3 p UNASSIGNED   CLUSTER_RECOVERED
app-webframework-6-2019.04.13   3 r UNASSIGNED   CLUSTER_RECOVERED
app-webframework-6-2019.04.13   4 p UNASSIGNED   CLUSTER_RECOVERED
app-webframework-6-2019.04.13   4 r UNASSIGNED   CLUSTER_RECOVERED

Following command:

curl -XGET localhost:9200/_cluster/allocation/explain?pretty

gives

{
  "index" : "apache-access-2020.06.03",
  "shard" : 1,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2020-07-07T11:35:23.861Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "mPGPZUq5TrmM32Rc1SXYSQ",
      "node_name" : "mPGPZUq",
      "transport_address" : "127.0.0.1:9300",
      "node_attributes" : {
        "ml.machine_memory" : "67554115584",
        "xpack.installed" : "true",
        "ml.max_open_jobs" : "20",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[apache-access-2020.06.03][1], node[mPGPZUq5TrmM32Rc1SXYSQ], [P], s[STARTED], a[id=9dURKIHSTCC6USinN8gJlQ]]"
        },
        {
          "decider" : "throttling",
          "decision" : "THROTTLE",
          "explanation" : "reached the limit of incoming shard recoveries [4], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
        }
      ]
    }
  ]
}

system · August 4, 2020, 11:48am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.