Elasticsearch continually fails after upgrade

I recently upgraded from 6.3.2 to 6.7.1 and everything initially worked appropriately. Currently though elasticsearch fails shortly after restarting it and now the Kibana page isn't ready. Not exactly sure where is the best place to start so I ran the cluster health command:
[root@ip-10-0-1-207 ec2-user]# curl -XGET 'http://localhost:9200/_cluster/health'{"cluster_name":"elasticsearch","status":"red","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":1053,"active_shards":1053,"relocating_shards":0,"initializing_shards":4,"unassigned_shards":1367,"delayed_unassigned_shards":0,"number_of_pending_tasks":9,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":47898,"active_shards_percent_as_number":43.44059405940594}
[root@ip-10-0-1-207 ec2-user]#

Any ideas?

What do the logs show?

That's a lot of shards for a single node. How many indices? How much data?

This system is a single-host ELK setup. The clients are 75% Redhat Linux 7.6 instances and 25% Windows Server 2016 hosted in AWS. We've used several types of information delivered from various types of beats: filebeat, auditbeat, metricbeat, Winlogbeat, etc. Currently there are a lot of indices listed and I'm wondering if I should be purging those indices and starting over.

Attached are the logs that I've found:
Jun 12 19:30:20 ip-10-0-1-207 kibana: {"type":"log","@timestamp":"2019-06-12T23:30:20Z","tags":["error","task_manager"],"pid":12326,"message":"Failed to poll for work: [search_phase_execution_exception] all shards failed :: {"path":"/.kibana_task_manager/_doc/_search","query":{"ignore_unavailable":true},"body":"{\"query\":{\"bool\":{\"must\":[{\"term\":{\"type\":\"task\"}},{\"bool\":{\"must\":[{\"terms\":{\"task.taskType\":[\"maps_telemetry\",\"vis_telemetry\"]}},{\"range\":{\"task.attempts\":{\"lte\":3}}},{\"range\":{\"task.runAt\":{\"lte\":\"now\"}}},{\"range\":{\"kibana.apiVersion\":{\"lte\":1}}}]}}]}},\"size\":10,\"sort\":{\"task.runAt\":{\"order\":\"asc\"}},\"seq_no_primary_term\":true}","statusCode":503,"response":"{\"error\":{\"root_cause\":,\"type\":\"search_phase_execution_exception\",\"reason\":\"all shards failed\",\"phase\":\"query\",\"grouped\":true,\"failed_shards\":},\"status\":503}"}"}
Jun 12 19:30:23 ip-10-0-1-207 kibana: {"type":"log","@timestamp":"2019-06-12T23:30:23Z","tags":["error","task_manager"],"pid":12326,"message":"Failed to poll for work: [search_phase_execution_exception] all shards failed :: {"path":"/.kibana_task_manager/_doc/_search","query":{"ignore_unavailable":true},"body":"{\"query\":{\"bool\":{\"must\":[{\"term\":{\"type\":\"task\"}},{\"bool\":{\"must\":[{\"terms\":{\"task.taskType\":[\"maps_telemetry\",\"vis_telemetry\"]}},{\"range\":{\"task.attempts\":{\"lte\":3}}},{\"range\":{\"task.runAt\":{\"lte\":\"now\"}}},{\"range\":{\"kibana.apiVersion\":{\"lte\":1}}}]}}]}},\"size\":10,\"sort\":{\"task.runAt\":{\"order\":\"asc\"}},\"seq_no_primary_term\":true}","statusCode":503,"response":"{\"error\":{\"root_cause\":,\"type\":\"search_phase_execution_exception\",\"reason\":\"all shards failed\",\"phase\":\"query\",\"grouped\":true,\"failed_shards\":},\"status\":503}"}"}
Jun 12 19:30:26 ip-10-0-1-207 kibana: {"type":"log","@timestamp":"2019-06-12T23:30:26Z","tags":["error","task_manager"],"pid":12326,"message":"Failed to poll for work: [search_phase_execution_exception] all shards failed :: {"path":"/.kibana_task_manager/_doc/_search","query":{"ignore_unavailable":true},"body":"{\"query\":{\"bool\":{\"must\":[{\"term\":{\"type\":\"task\"}},{\"bool\":{\"must\":[{\"terms\":{\"task.taskType\":[\"maps_telemetry\",\"vis_telemetry\"]}},{\"range\":{\"task.attempts\":{\"lte\":3}}},{\"range\":{\"task.runAt\":{\"lte\":\"now\"}}},{\"range\":{\"kibana.apiVersion\":{\"lte\":1}}}]}}]}},\"size\":10,\"sort\":{\"task.runAt\":{\"order\":\"asc\"}},\"seq_no_primary_term\":true}","statusCode":503,"response":"{\"error\":{\"root_cause\":,\"type\":\"search_phase_execution_exception\",\"reason\":\"all shards failed\",\"phase\":\"query\",\"grouped\":true,\"failed_shards\":},\"status\":503}"}"}

[2019-06-12T19:29:05,544][WARN ][logstash.outputs.elasticsearch] Attempted to resurrect connection to dead ES instance, but got an error. {:url=>"http://localhost:9200/", :error_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :error=>"Elasticsearch Unreachable: [http://localhost:9200/][Manticore::SocketException] Connection refused (Connection refused)"}
[2019-06-12T19:29:10,556][WARN ][logstash.outputs.elasticsearch] Attempted to resurrect connection to dead ES instance, but got an error. {:url=>"http://localhost:9200/", :error_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :error=>"Elasticsearch Unreachable: [http://localhost:9200/][Manticore::SocketException] Connection refused (Connection refused)"}
[2019-06-12T19:29:15,621][WARN ][logstash.outputs.elasticsearch] Attempted to resurrect connection to dead ES instance, but got an error. {:url=>"http://localhost:9200/", :error_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::BadResponseCodeError, :error=>"Got response code '503' contacting Elasticsearch at URL 'http://localhost:9200/'"}
[2019-06-12T19:29:20,670][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://localhost:9200/"}

Jun 12 18:25:29 ip-10-0-1-207.ec2.internal systemd[1]: Started Elasticsearch.
Jun 12 19:15:28 ip-10-0-1-207.ec2.internal systemd[1]: elasticsearch.service: main process exited, code=exited, status=127/n/a
Jun 12 19:15:28 ip-10-0-1-207.ec2.internal systemd[1]: Unit elasticsearch.service entered failed state.
Jun 12 19:15:28 ip-10-0-1-207.ec2.internal systemd[1]: elasticsearch.service failed.
Jun 12 19:28:48 ip-10-0-1-207.ec2.internal systemd[1]: Started Elasticsearch.

I ran the following command and it outputs 307 indices (101 that contain data ranging from 50 kb to 900 mb):
curl -X GET 'http://localhost:9200/_cat/indices?v'

Please format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you :slight_smile:

Ok, you definitely need to reduce your shard count. Look at using the _shrink API to do that, or use the _reindex API to "merge" your daily indices into monthly ones.

I'm pretty new to ELK so will have to look into the use of the _shrink API.
Wanting to do something basic, which I thought could be accomplished by a single index or maybe a single index per filebeat type. Not yet sure where all the indexes came from or what led to such a high shard count. I didn't configure a shard count anywhere. If I do a complete reinstall, is there a way for me to set the shard count?

Have a look at Kibana only display logo no content -elasticsearch error shard - #11 by warkolm to help with reindexing.

Indices are where the data is stored, they are made of shards. In the 6.X release we defaulted to 5 shards per index with 1 replica set, so a total of 10 per index. We changed that in 7.X to be 1 per index with 1 replica (ie 2).

No need to do that! :slight_smile: Though I would suggest using the latest 7.X release, as it reduces the burden here.

Otherwise you can set the shard count in the filebeat config file.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.