Unassigned shards, crashed cluster recovery


(WARREN WEEDER) #1

Hello,

I am trying to get a production ELK stack working again. It is stuck on 'indexing' in kibana. I do not care about past log data at all. I just want it to work again for new data. Here are the details:

I have recently had this ELK stack dropped in my lap to support. I did not set this stack up. It runs one elasticsearch master (es-master) and 2 data nodes (es-data01, es-data02) and also includes kafka.

This stack was in a broken state when I got into it. One of the elasticsearch nodes , es-data02, had been close to running out of disk, and another person had expanded the volume it was on. Elasticsearch was not able to start on that node. That is all the history I have.

After some basic troubleshooting, I found the entire /var/lib/elasticsearch/nodes directory was missing, and this prevented elasticsearch from started. I created a new empty node directory and it was able to start. However, kibana remains stuck in red and indexing.

On logstash, I found it was throwing errors like 'unavailable_shards_exception' for two indexes: vpc-logs and eb-logs.

I found that all shards for all indexes are in an UNASSIGNED state. I attempted to force allocate shards for vpc-logs and eb-logs using _cluster/reroute API with ""allow_primary": "true"" but I received "Unknown AllocationCommand [allocate]" response and found a thread suggesting this functionality may have been removed (https://github.com/elastic/elasticsearch/issues/18819).

I looked at the docs for Shard Allocation Filtering as an alternative but it is not clear to me how I could use this to solve my problem.

Please help me out. I just want kibana back into a working state with the same indexes it already has. I don't care at all about past log data, or which nodes are involved with any given shard.

I looked at _cat/shards and every single shard for every index is as follows:
UNASSIGNED CLUSTER_RECOVERED


root@es-master-1:/work# curl -XGET 10.10.53.10:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
red open cloudtrail-logs Lpuh176DRaSlepqZaMEkWA 5 1
red open eb-logs-1 NVGrsqPbQMSZHd38i1KWZA 5 1
red open kinesis-test1 el4fwJtvTPuG3EAAmglyEQ 1 1
red open .kibana ccysz1KaT2GMS1AJb-hzRw 1 1
red open s3-logs 8rRbxB7gSf6moU9xNtDTnw 2 1
red open elastalert_status wrPiv0IWThaen_xZZhCUmg 5 1
red open vpc-logs mfYgK5CxT3eoCl7X-UkZ4A 5 1
red open .watches t9BkX6n4TBmidPytJGVGYQ 1 1
red open elb-logs PfTxzluiSwuUUKxRhsfSHg 5 1


ubuntu@es-data-02:/$ curl -XGET 10.10.30.143:9200/_cluster/health?pretty
{
"cluster_name" : "jdp-production",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 60,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 0.0
}

From Kibana (permanently stuck state):

ui settings Elasticsearch plugin is red
plugin:kibana@5.6.3 Ready
plugin:elasticsearch@5.6.3 Elasticsearch is still initializing the kibana index.
plugin:console@5.6.3 Ready
plugin:metrics@5.6.3 Ready
plugin:timelion@5.6.3 Ready

Also should have added:
curl -XGET 10.10.53.10:9200
{
"name" : "es-master-01",
"cluster_name" : "companyname-production",
"cluster_uuid" : "9v9dm7wIQJe53O4GWWYqkg",
"version" : {
"number" : "5.6.3",
"build_hash" : "1a2f265",
"build_date" : "2017-10-06T20:33:39.012Z",
"build_snapshot" : false,
"lucene_version" : "6.6.1"
},
"tagline" : "You Know, for Search"
}


(WARREN WEEDER) #2

Would deleting the indexes that logstash complains about be a viable option? I don't mind all index data being lost, but I wouldn't want the index to disappear altogether unless it would be automatically recreated somehow. If it is recreated empty, and just has to get new data added over time, that would be fine for me.


(andy_zhou) #3

show the cluster setting
_cluster/settings


(Mark Walkom) #4

You can delete the missing data if you want, it will remove the indices but keep the template if it exists, which will apply to any new indices.

You should check the mappings and templates first though :slight_smile:


(WARREN WEEDER) #5

curl -XGET 10.10.53.10:9200/_cluster/settings
{"persistent":{},"transient":{}}


(WARREN WEEDER) #6

Could you give some more details on what this process would look like? I would like to end with all the same indices existing with any current configuration in place, but I don't care if all data is lost. Just want to get this thing rolling again.


(andy_zhou) #7

show the disk use
change to this and try restart cluster
curl -XPUT 'XXX.XXX.XXX.XXX:9200/_cluster/settings?pretty' -H 'Content-Type: application/json' -d'
{
"persistent" : {
"cluster.routing.allocation.enable": "all",
"cluster.routing.rebalance.enable": "all"
},
"transient": {
"cluster.routing.allocation.enable": "all",
"cluster.routing.rebalance.enable": "all"
}
}
'


(Mark Walkom) #8

GET _mapping
And
GET _template


(WARREN WEEDER) #9

Thanks for the help everyone.

I eventually got a handle on this using allocate_empty_primary with _cluster/reroute for each shard of each index, as shown by _cat/shards

curl -XPOST '10.10.53.10:9200/_cluster/reroute?pretty' -d '{
"commands" : [ {
"allocate_empty_primary" :
{
"index" : "indexName", "shard" : 1,
"node" : "es-data-02",
"accept_data_loss" : true
}
}
]
}'


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.