Elasticsearch continually fails after upgrade

Emersumbigens · June 12, 2019, 11:20pm

I recently upgraded from 6.3.2 to 6.7.1 and everything initially worked appropriately. Currently though elasticsearch fails shortly after restarting it and now the Kibana page isn't ready. Not exactly sure where is the best place to start so I ran the cluster health command:
[root@ip-10-0-1-207 ec2-user]# curl -XGET 'http://localhost:9200/_cluster/health'{"cluster_name":"elasticsearch","status":"red","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":1053,"active_shards":1053,"relocating_shards":0,"initializing_shards":4,"unassigned_shards":1367,"delayed_unassigned_shards":0,"number_of_pending_tasks":9,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":47898,"active_shards_percent_as_number":43.44059405940594}
[root@ip-10-0-1-207 ec2-user]#

Any ideas?

warkolm · June 12, 2019, 11:25pm

What do the logs show?

That's a lot of shards for a single node. How many indices? How much data?

Emersumbigens · June 13, 2019, 12:14am

This system is a single-host ELK setup. The clients are 75% Redhat Linux 7.6 instances and 25% Windows Server 2016 hosted in AWS. We've used several types of information delivered from various types of beats: filebeat, auditbeat, metricbeat, Winlogbeat, etc. Currently there are a lot of indices listed and I'm wondering if I should be purging those indices and starting over.

Emersumbigens · June 13, 2019, 12:14am

Attached are the logs that I've found:
Jun 12 19:30:20 ip-10-0-1-207 kibana: {"type":"log","@timestamp":"2019-06-12T23:30:20Z","tags":["error","task_manager"],"pid":12326,"message":"Failed to poll for work: [search_phase_execution_exception] all shards failed :: {"path":"/.kibana_task_manager/_doc/_search","query":{"ignore_unavailable":true},"body":"{\"query\":{\"bool\":{\"must\":[{\"term\":{\"type\":\"task\"}},{\"bool\":{\"must\":[{\"terms\":{\"task.taskType\":[\"maps_telemetry\",\"vis_telemetry\"]}},{\"range\":{\"task.attempts\":{\"lte\":3}}},{\"range\":{\"task.runAt\":{\"lte\":\"now\"}}},{\"range\":{\"kibana.apiVersion\":{\"lte\":1}}}]}}]}},\"size\":10,\"sort\":{\"task.runAt\":{\"order\":\"asc\"}},\"seq_no_primary_term\":true}","statusCode":503,"response":"{\"error\":{\"root_cause\":,\"type\":\"search_phase_execution_exception\",\"reason\":\"all shards failed\",\"phase\":\"query\",\"grouped\":true,\"failed_shards\":},\"status\":503}"}"}
Jun 12 19:30:23 ip-10-0-1-207 kibana: {"type":"log","@timestamp":"2019-06-12T23:30:23Z","tags":["error","task_manager"],"pid":12326,"message":"Failed to poll for work: [search_phase_execution_exception] all shards failed :: {"path":"/.kibana_task_manager/_doc/_search","query":{"ignore_unavailable":true},"body":"{\"query\":{\"bool\":{\"must\":[{\"term\":{\"type\":\"task\"}},{\"bool\":{\"must\":[{\"terms\":{\"task.taskType\":[\"maps_telemetry\",\"vis_telemetry\"]}},{\"range\":{\"task.attempts\":{\"lte\":3}}},{\"range\":{\"task.runAt\":{\"lte\":\"now\"}}},{\"range\":{\"kibana.apiVersion\":{\"lte\":1}}}]}}]}},\"size\":10,\"sort\":{\"task.runAt\":{\"order\":\"asc\"}},\"seq_no_primary_term\":true}","statusCode":503,"response":"{\"error\":{\"root_cause\":,\"type\":\"search_phase_execution_exception\",\"reason\":\"all shards failed\",\"phase\":\"query\",\"grouped\":true,\"failed_shards\":},\"status\":503}"}"}
Jun 12 19:30:26 ip-10-0-1-207 kibana: {"type":"log","@timestamp":"2019-06-12T23:30:26Z","tags":["error","task_manager"],"pid":12326,"message":"Failed to poll for work: [search_phase_execution_exception] all shards failed :: {"path":"/.kibana_task_manager/_doc/_search","query":{"ignore_unavailable":true},"body":"{\"query\":{\"bool\":{\"must\":[{\"term\":{\"type\":\"task\"}},{\"bool\":{\"must\":[{\"terms\":{\"task.taskType\":[\"maps_telemetry\",\"vis_telemetry\"]}},{\"range\":{\"task.attempts\":{\"lte\":3}}},{\"range\":{\"task.runAt\":{\"lte\":\"now\"}}},{\"range\":{\"kibana.apiVersion\":{\"lte\":1}}}]}}]}},\"size\":10,\"sort\":{\"task.runAt\":{\"order\":\"asc\"}},\"seq_no_primary_term\":true}","statusCode":503,"response":"{\"error\":{\"root_cause\":,\"type\":\"search_phase_execution_exception\",\"reason\":\"all shards failed\",\"phase\":\"query\",\"grouped\":true,\"failed_shards\":},\"status\":503}"}"}

[2019-06-12T19:29:05,544][WARN ][logstash.outputs.elasticsearch] Attempted to resurrect connection to dead ES instance, but got an error. {:url=>"http://localhost:9200/", :error_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :error=>"Elasticsearch Unreachable: [http://localhost:9200/][Manticore::SocketException] Connection refused (Connection refused)"}
[2019-06-12T19:29:10,556][WARN ][logstash.outputs.elasticsearch] Attempted to resurrect connection to dead ES instance, but got an error. {:url=>"http://localhost:9200/", :error_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :error=>"Elasticsearch Unreachable: [http://localhost:9200/][Manticore::SocketException] Connection refused (Connection refused)"}
[2019-06-12T19:29:15,621][WARN ][logstash.outputs.elasticsearch] Attempted to resurrect connection to dead ES instance, but got an error. {:url=>"http://localhost:9200/", :error_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::BadResponseCodeError, :error=>"Got response code '503' contacting Elasticsearch at URL 'http://localhost:9200/'"}
[2019-06-12T19:29:20,670][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://localhost:9200/"}

Jun 12 18:25:29 ip-10-0-1-207.ec2.internal systemd[1]: Started Elasticsearch.
Jun 12 19:15:28 ip-10-0-1-207.ec2.internal systemd[1]: elasticsearch.service: main process exited, code=exited, status=127/n/a
Jun 12 19:15:28 ip-10-0-1-207.ec2.internal systemd[1]: Unit elasticsearch.service entered failed state.
Jun 12 19:15:28 ip-10-0-1-207.ec2.internal systemd[1]: elasticsearch.service failed.
Jun 12 19:28:48 ip-10-0-1-207.ec2.internal systemd[1]: Started Elasticsearch.

Emersumbigens · June 13, 2019, 1:00am

I ran the following command and it outputs 307 indices (101 that contain data ranging from 50 kb to 900 mb):
curl -X GET 'http://localhost:9200/_cat/indices?v'

warkolm · June 13, 2019, 1:09am

Please format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you

Ok, you definitely need to reduce your shard count. Look at using the _shrink API to do that, or use the _reindex API to "merge" your daily indices into monthly ones.

Emersumbigens · June 13, 2019, 3:01am

I'm pretty new to ELK so will have to look into the use of the _shrink API.
Wanting to do something basic, which I thought could be accomplished by a single index or maybe a single index per filebeat type. Not yet sure where all the indexes came from or what led to such a high shard count. I didn't configure a shard count anywhere. If I do a complete reinstall, is there a way for me to set the shard count?

warkolm · June 13, 2019, 3:52am

Have a look at Kibana only display logo no content -elasticsearch error shard - #11 by warkolm to help with reindexing.

Indices are where the data is stored, they are made of shards. In the 6.X release we defaulted to 5 shards per index with 1 replica set, so a total of 10 per index. We changed that in 7.X to be 1 per index with 1 replica (ie 2).

No need to do that! Though I would suggest using the latest 7.X release, as it reduces the burden here.

Otherwise you can set the shard count in the filebeat config file.

system · July 11, 2019, 4:01am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticserach6.1.1 restart and i got "all shards failed" Elasticsearch	4	51594	February 23, 2018
[error] All shards failed after migration to 6.7.1 Elasticsearch	5	5218	May 13, 2019
Shards Failing Elasticsearch	2	8155	August 18, 2018
KIbana errors after upgrade from 6.6 to 6.7 Kibana	4	3972	May 7, 2019
Retrying failed action with response code: 503 unavailable_shards_exception, reason - logstash primary shard is not active Elasticsearch	6	10192	May 4, 2020

Elasticsearch continually fails after upgrade

Related topics