Unassigned shards in 10 node cluster?

Hi,

I have a ELK setup with 10 node elasticsearch cluster (1 master, 1 coordinator/ingest and 8 data nodes). It all works fine, but today I noticed that the cluster health is yellow showing that there are unassigned shards. Why is this happening when there are plenty of nodes to replicate to? One of the shards is from kibana's .security index which is only 27kb in size.

curl 0.0.0.0:9200/_cluster/health?pretty
{
  "cluster_name" : "elasticsearch-prod",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 10,
  "number_of_data_nodes" : 8,
  "active_primary_shards" : 247,
  "active_shards" : 498,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 2,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.6
}

    curl 0.0.0.0:9200/_cat/nodes?v   
ip        heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.0.0.5            20          96   2    1.28    1.36     1.11 d         -      94cfee8e7588
10.0.0.9            54          99   3    1.07    0.64     0.46 d         -      ddf883c874de
10.0.0.34           40         100   5    3.12    3.50     4.03 d         -      0dce58fc735d
10.0.0.35           37         100   4    3.12    3.50     4.03 d         -      1597137362a5
10.0.0.33           71         100   7    3.12    3.50     4.03 d         -      cb7432ff0db2
10.0.0.39           33         100   7    3.12    3.50     4.03 d         -      bc0ef1d00d00
10.0.0.3            52          96   2    1.28    1.36     1.11 d         -      0c192ecb6a47
10.0.0.7            13          99   7    1.07    0.64     0.46 m         *      60bae37cb34c
10.0.0.37           53         100   5    3.12    3.50     4.03 d         -      3e969e0bf0a0
10.0.0.38            8         100   5    3.12    3.50     4.03 i         -      6d6defe36a72


    curl 0.0.0.0:9200/_cat/shards?v | grep UNASS
.security                        0     r      UNASSIGNED
prod-web-2016.12.21             2     r      UNASSIGNED

   curl 0.0.0.0:9200/_cat/indices?v | grep yellow
    yellow open   prod-web-2016.12.21             m8vhFKhdSXmCITNaFH06mQ   5   1    5308060            0      5.5gb            3gb
    yellow open   .security                       La_kerbmRAaMShzZVT5h3g   1   7          8            0    191.6kb         27.3kb

What version you on?

You should have at least 3.

Everything is running version 5.1.1. Isn't it that only one master is active at a given time? If I had 3, the rest would be master-eligible nodes (like a hot backup). I don't see how this cloud influence the unassigned shards?

Please also note that all nodes run inside docker in swarm mode with volumes mounted outside of the containers. Therefore I have only one master as if this master fails, the docker swarm will spawn a new one using the same external volume. However the master hasn't been down when the problem with the unassigned shards occured.

The storage is a SAN mounted over iscsi and in the master logs I see this which pretty much shows that the file got corrupted:
[2016-12-27T08:05:37,424][WARN ][o.e.c.a.s.ShardStateAction] [60bae37cb34c] [.security][0] received shard failed for shard id [[.security][0]], allocation id [2y7uMXaOSX6NXz2FsTb5Xw], primary term [0], message [failed to create shard], failure [FileSystemException[/usr/share/elasticsearch/data/nodes/0/indices/La_kerbmRAaMShzZVT5h3g/0: Structure needs cleaning]]

I knew that there might be some latency issues with SAN, which is tolerable but corrupting the filesystem is definitely a problem.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.