Fresh cluster all shards are unavailable

Hi,

When my cluster is started, his status is in yellow:

{
"cluster_name" : "datawarehouse",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 0,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 9,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 0.0
}

Also in parallel I check the kibana logs (because the service can't start), the error is:

{"type":"log","@timestamp":"2019-12-02T15:58:39Z","tags":["security","error"],"pid":6,"message":"Error registering Kibana Privileges with Elasticsearch for kibana-.kibana: [unavailable_shards_exception] at least one primary shard for the index [.security-7] is unavailable"}

I check my shards:

curl -k -u "elastic:xxxxx" "https://datawarehouse-es-http:9200/_c
at/shards?h=index,shard,prirep,state,unassigned.reason" --silent
.security-7 0 p UNASSIGNED INDEX_CREATED
.kibana_task_manager_1 0 p UNASSIGNED INDEX_CREATED
.kibana_task_manager_1 0 r UNASSIGNED INDEX_CREATED
.kibana_1 0 p UNASSIGNED INDEX_CREATED
.kibana_1 0 r UNASSIGNED INDEX_CREATED
.apm-agent-configuration 0 p UNASSIGNED INDEX_CREATED
.apm-agent-configuration 0 r UNASSIGNED INDEX_CREATED

Then I check the cluster allocation:

curl -k -u "elastic:xxxxx" "https://datawarehouse-es-http:9200/_c
luster/allocation/explain?pretty"
{
"index" : ".kibana_1",
"shard" : 0,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "INDEX_CREATED",
"at" : "2019-12-02T15:34:31.969Z",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes"
}

Also I found this error in the elasticsearch logs: org.elasticsearch.action.UnavailableShardsException: at least one primary shard for the index [.security-7] is unavailable

I check the cluster settings:

curl -k -u "elastic:xxxxx" https://datawarehouse-es-http:9200/_cl
uster/settings?pretty
{
"persistent" : { },
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"exclude" : {
"_name" : "none_excluded"
}
}
}
}
}
}

ECK version: 1.0.0-beta
elasticsearch version: 7.4.2
elasticsearch config: eck elastic config · GitHub (it's the config of 1 nodes)
cluster: 3 master & 3 data nodes

I tried to boot a new cluster, the problem persists

If someone can help me to understand / fix the issue it would be awesome.

What is the status of the ES resources? kubectl get elasticsearch (or describe)

Also, the explain API can provide useful information for why shards are not allocated:

https://www.elastic.co/guide/en/elasticsearch/reference/6.0/cluster-allocation-explain.html#_explain_api_response

@dg_hivebrite are you using PersistentVolumes? Can you share the Elasticsearch yaml manifest?
I'm wondering if one of your volume hosting data has been lost.

@Anya_Sabo the kubectl get elasticsearch:

NAME HEALTH NODES VERSION PHASE AGE
datawarehouse yellow 6 7.4.2 Ready 19h

And the describe: es_decribe.yaml · GitHub

@sebgl
Yes I'm using persitent volume. Acually the cluser is managed in GKE.
The output of my pv:

NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-022edc16-1519-11ea-9b78-4201c0a8000a 10Gi RWO Delete Bound default/elasticsearch-data-datawarehouse-es-master-europe-west1-a-0 standard 19h
pvc-02dc04aa-1519-11ea-9b78-4201c0a8000a 10Gi RWO Delete Bound default/elasticsearch-data-datawarehouse-es-master-europe-west1-b-0 standard 19h
pvc-03745ee2-1519-11ea-9b78-4201c0a8000a 10Gi RWO Delete Bound default/elasticsearch-data-datawarehouse-es-master-europe-west1-c-0 standard 19h
pvc-03ed030a-1519-11ea-9b78-4201c0a8000a 10Gi RWO Delete Bound default/elasticsearch-data-datawarehouse-es-data-europe-west1-a-0 standard 19h
pvc-04639049-1519-11ea-9b78-4201c0a8000a 10Gi RWO Delete Bound default/elasticsearch-data-datawarehouse-es-data-europe-west1-b-0 standard 19h
pvc-04da5b0d-1519-11ea-9b78-4201c0a8000a 10Gi RWO Delete Bound default/elasticsearch-data-datawarehouse-es-data-europe-west1-c-0 standard 19h

You want the yaml file before it send to kubernetes or an output of an object of kubernetes ?

I don't understand why on the allocation explain call there is no nodes:

curl "https://datawarehouse-es-http:9200/_cluster/alloc
ation/explain?pretty&include_disk_info=true&include_yes_decisions=true"
{
"index" : ".kibana_1",
"shard" : 0,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "INDEX_CREATED",
"at" : "2019-12-02T15:34:31.969Z",
"last_allocation_status" : "no_attempt"
},
"cluster_info" : {
"nodes" : { },
"shard_sizes" : { },
"shard_paths" : { }
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes"
}

In the ES config I can see:

cluster.routing.allocation.awareness.attributes:  all

The value here should match one of the existing node attribute. Based on the rest of the Elasticsearch spec I can see you're using the attribute zone to distinguish group of nodes.
You probably need to change the configuration to:

cluster.routing.allocation.awareness.attributes:  zone

Ok, so I tried the change it doesn't change anything.
Then I trash my cluster and reboot a new one without cluster.routing.allocation.awareness.attributes, the cluster is yellow with the same issue

Looking at your cluster again: can you double check it has at least one data node?
I can see 3 master nodes, and another master node with:

 Config:
      cluster.routing.allocation.awareness.attributes:  all
      node.attr.zone:                                   europe-west1-a
      node.data:                                        false
      node.master:                                      true
    Count:                                              1
    Name:                                               data-europe-west1-a

which I guess was intended to have node.data: true looking at its name?

Oh yes good catch, it was that the issue thank you very much :slight_smile: