I have a three node ES cluster running 5.6.5. Last week I restarted the servers in the cluster to make changes to their disk configuration, I did not follow any process to do this but after 12 hours the cluster and indexes were back into a green state. Yesterday I restarted one server, following the rolling upgrade process but without upgrading the server. Indexing (logstash) is still stopped and no documents are being added to the system.
I ended up in a state with sync issues, which appears to be identical to this post, I are also running a similarly antiquated version, a situation I are working to address.
I attempted to reroute the bad index, then to remove and re-add replicas, which resulted in a wealth of failures and us being in the following state:
corp_b 0 r STARTED 125766000 131.5gb 162.211.235.20 pcorplog3
corp_b 0 p STARTED 125766000 131.5gb 162.211.235.11 pcorplog2
corp_b 1 p STARTED 125746770 128.6gb 162.211.235.11 pcorplog2
corp_b 1 r UNASSIGNED
corp_b 2 p STARTED 125765953 130.9gb 162.211.235.20 pcorplog3
corp_b 2 r STARTED 125765953 130.9gb 162.211.235.10 pcorplog1
corp_b 3 r STARTED 125765360 132.1gb 162.211.235.11 pcorplog2
corp_b 3 p STARTED 125765360 132.1gb 162.211.235.10 pcorplog1
corp_b 4 r STARTED 125752747 133.2gb 162.211.235.11 pcorplog2
corp_b 4 p STARTED 125752747 133.2gb 162.211.235.10 pcorplog1
corp_b 5 p STARTED 125771560 128gb 162.211.235.20 pcorplog3
corp_b 5 r STARTED 125771560 128gb 162.211.235.11 pcorplog2
This report, from _cat/shards, indicates that the primary shard for this index is located on pcorplog2 and the replica is unassigned. I see the same information when querying _cat/shards on all three cluster members.
However, when I look at _cluster/allocation/explain?pretty I seem to get a different view as to the primary index of this shard; I receive failure reports from the three nodes as follows:
pcorplog1: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)
pcorplog2: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)
pcorplog3: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)
I'm very eager to get this issue solved, thanks in advance for any assistance you can offer, full allocation explain follows in a second post.