Unassigned shards due to FailedNodeException

Ivan · March 20, 2017, 4:54pm

Started a couple of simple 5.2.0 clusters of three identical nodes, where
each node is master eligible and has data. There are no restarts or
rebalancing going on. Data set is relatively small.

The indexing process for a new index is the standard of setting number of
replicas to 0 at the start and then increasing it to the proper amount when
done. The cluster has an overallocation of shards, 5, for the number of
nodes. 3. The number of replicas is set to the number of nodes minus 1,
which is 2.

Upon increasing the number of replicas, most shards will initialized and
assigned their replicas shards, except for 2 shards. Any attempts to set
the number of replicas to 0 and back to 2 will cause the same shards not to
replicate:

Nodes
ip heap.percent ram.percent cpu load_1m load_5m load_15m
node.role master name
ip.ip.ip.1 25 98 3 0.03 0.02 0.05 mdi

```
 host1
```

ip.ip.ip.2 54 99 1 0.05 0.03 0.05 mdi

```
 host2
```

ip.ip.ip.3 32 98 2 0.01 0.02 0.05 mdi

```
 host3
```

Two sample indices:
health status index uuid pri rep docs.count
docs.deleted store.size pri.store.size
yellow open index1 6n2fqIziSe-vMYvrI_HXYQ 5 2 9957799
0 24.2gb 9.3gb
yellow open index2 A_VHoPMxRj-OnRCa4tqA9g 5 2 9957799
0 24.2gb 9.3gb

Shard status for said indicies:
index shard prirep state docs store ip node
index2 0 p STARTED 1990082 1.8gb ip.ip.ip.1 host1
index2 0 r STARTED 1990082 1.8gb ip.ip.ip.2 host2
index2 0 r STARTED 1990082 1.8gb ip.ip.ip.3 host3
index2 1 p STARTED 1990050 1.8gb ip.ip.ip.3 host3
index2 1 r STARTED 1990050 1.8gb ip.ip.ip.2 host2
index2 1 r STARTED 1990050 1.8gb ip.ip.ip.1 host1
index2 2 p STARTED 1996938 1.8gb ip.ip.ip.2 host2
index2 2 r UNASSIGNED
index2 2 r UNASSIGNED
index2 3 p STARTED 1989843 1.8gb ip.ip.ip.1 host1
index2 3 r STARTED 1989843 1.8gb ip.ip.ip.2 host2
index2 3 r STARTED 1989843 1.8gb ip.ip.ip.3 host3
index2 4 p STARTED 1990886 1.8gb ip.ip.ip.3 host3
index2 4 r STARTED 1990886 1.8gb ip.ip.ip.2 host2
index2 4 r STARTED 1990886 1.8gb ip.ip.ip.1 host1

index1 0 p STARTED 1990082 1.8gb ip.ip.ip.1 host1
index1 0 r STARTED 1990082 1.8gb ip.ip.ip.2 host2
index1 0 r STARTED 1990082 1.8gb ip.ip.ip.3 host3
index1 1 p STARTED 1990050 1.8gb ip.ip.ip.3 host3
index1 1 r STARTED 1990050 1.8gb ip.ip.ip.2 host2
index1 1 r STARTED 1990050 1.8gb ip.ip.ip.1 host1
index1 2 p STARTED 1996938 1.8gb ip.ip.ip.2 host2
index1 2 r UNASSIGNED
index1 2 r UNASSIGNED
index1 3 p STARTED 1989843 1.8gb ip.ip.ip.1 host1
index1 3 r STARTED 1989843 1.8gb ip.ip.ip.2 host2
index1 3 r STARTED 1989843 1.8gb ip.ip.ip.3 host3
index1 4 p STARTED 1990886 1.8gb ip.ip.ip.3 host3
index1 4 r STARTED 1990886 1.8gb ip.ip.ip.2 host2
index1 4 r STARTED 1990886 1.8gb ip.ip.ip.1 host1

The most relevant stack trace in the logs are:
Caused by: org.elasticsearch.ElasticsearchException: Failed to list store
metadata for shard [[index1][1]]
at
org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:114)
~[elasticsearch-5.2.0.jar:5.2.0]
...
Caused by: java.io.FileNotFoundException: no segments* file found in
store(mmapfs(/elasticsearchdata/nodes/0/indices/6n2fqIziSe-vMYvrI_HXYQ/1/index)):
files: [recovery.AVrepWc_SlfGaVklv4AX._0.cfe,
recovery.AVrepWc_SlfGaVklv4AX._0.cfs, recovery.AVrepWc_SlfGaVklv4AX._0.si,
recovery.AVrepWc_SlfGaVklv4AX._1.cfe, recovery.AVrepWc_SlfGaVklv4AX._1.si,
recovery.AVrepWc_SlfGaVklv4AX._2.cfe, recovery.AVrepWc_SlfGaVklv4AX._2.cfs,
recovery.AVrepWc_SlfGaVklv4AX._2.si, recovery.AVrepWc_SlfGaVklv4AX._3.cfe,
recovery.AVrepWc_SlfGaVklv4AX._3.cfs, recovery.AVrepWc_SlfGaVklv4AX._3.si,
recovery.AVrepWc_SlfGaVklv4AX._4.cfe, recovery.AVrepWc_SlfGaVklv4AX._4.cfs,
recovery.AVrepWc_SlfGaVklv4AX._4.si, recovery.AVrepWc_SlfGaVklv4AX._5.cfe,
recovery.AVrepWc_SlfGaVklv4AX._5.cfs, recovery.AVrepWc_SlfGaVklv4AX._5.si,
recovery.AVrepWc_SlfGaVklv4AX._6.cfe, recovery.AVrepWc_SlfGaVklv4AX._6.cfs,
recovery.AVrepWc_SlfGaVklv4AX._6.si, recovery.AVrepWc_SlfGaVklv4AX._7.cfe,
recovery.AVrepWc_SlfGaVklv4AX._7.cfs, recovery.AVrepWc_SlfGaVklv4AX._7.si,
recovery.AVrepWc_SlfGaVklv4AX._8.cfe, recovery.AVrepWc_SlfGaVklv4AX._8.cfs,
recovery.AVrepWc_SlfGaVklv4AX._8.si, recovery.AVrepWc_SlfGaVklv4AX._9.cfe,
recovery.AVrepWc_SlfGaVklv4AX._9.cfs, recovery.AVrepWc_SlfGaVklv4AX._9.si,
recovery.AVrepWc_SlfGaVklv4AX._a.cfe, recovery.AVrepWc_SlfGaVklv4AX._a.cfs,
recovery.AVrepWc_SlfGaVklv4AX._a.si, recovery.AVrepWc_SlfGaVklv4AX._b.cfe,
recovery.AVrepWc_SlfGaVklv4AX._b.cfs, recovery.AVrepWc_SlfGaVklv4AX._b.si,
recovery.AVrepWc_SlfGaVklv4AX._c.cfe, recovery.AVrepWc_SlfGaVklv4AX._c.si,
recovery.AVrepWc_SlfGaVklv4AX._d.cfe, recovery.AVrepWc_SlfGaVklv4AX._d.cfs,
recovery.AVrepWc_SlfGaVklv4AX._d.si, recovery.AVrepWc_SlfGaVklv4AX._e.cfe,
recovery.AVrepWc_SlfGaVklv4AX._e.cfs, recovery.AVrepWc_SlfGaVklv4AX._e.si,
recovery.AVrepWc_SlfGaVklv4AX.segments_4, write.lock]

Larger stack

1. Nodes

ip ip.ip.ip.1 ip.ip.ip.2 ip.ip.ip.3

2. Indices
 3. Shards index index2    0     p index2    0     r index2    0     r index2    1     p index2    1     r index2    1     r index2    2     p index2    2     r index2    2     r This file

 There

trace and the above outputs:
 href="https://gist.github.com/anonymous/1a35b911a5a9163693bf659b701e2f02" target="_blank" rel="nofollow noopener">gist.github.com href="https://gist.github.com/anonymous/1a35b911a5a9163693bf659b701e2f02" target="_blank" rel="nofollow noopener">https://gist.github.com/anonymous/1a35b911a5a9163693bf659b701e2f02 heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name 25          98   3    0.03    0.02     0.05 mdi       *      host1 54          99   1    0.05    0.03     0.05 mdi       -      host2 32          98   2    0.01    0.02     0.05 mdi       -      host3
 class="">health status index     uuid                   pri rep docs.count docs.deleted store.size pri.store.size 6n2fqIziSe-vMYvrI_HXYQ   5   2    9957799            0     24.2gb          9.3gb A_VHoPMxRj-OnRCa4tqA9g   5   2    9957799            0     24.2gb          9.3gb
 shard prirep state         docs store ip            node STARTED    1990082 1.8gb ip.ip.ip.1 host1 STARTED    1990082 1.8gb ip.ip.ip.2 host2 STARTED    1990082 1.8gb ip.ip.ip.3  host3 STARTED    1990050 1.8gb ip.ip.ip.3  host3 STARTED    1990050 1.8gb ip.ip.ip.2 host2 STARTED    1990050 1.8gb ip.ip.ip.1 host1 STARTED    1996938 1.8gb ip.ip.ip.2 host2 UNASSIGNED UNASSIGNED

has been truncated. show original are more than three files. show original

This behavior only occurs on one of the clusters consistently for each
index, while the other cluster works as expected. Both are provisioned
identically.

Cheers,

Ivan

ywelsch · March 24, 2017, 9:16am

Closed in https://github.com/elastic/elasticsearch/issues/23676

Ivan · March 24, 2017, 4:34pm

Yannick, thanks for helping and closing. I was not able to comment since
the mailing list does not send out an email to the sender when posted, so I
had nothing to reply to.

Ivan

system · April 21, 2017, 4:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unassigned shardsafter nodes restart Elasticsearch	4	391	July 6, 2017
Shards stuck on initialising Elasticsearch	8	1300	July 5, 2017
Shards UNASSIGNED even tho they exist on disk Elasticsearch	2	555	July 6, 2017
Shards unassigned after node restarts - reason: NODE_LEFT Elasticsearch	16	37509	December 28, 2016
Unassigned shards, v2 Elasticsearch	5	1345	July 6, 2017

Unassigned shards due to FailedNodeException

Related topics