index shard prirep state docs store ip node
my-index 3 p STARTED 9514635 29.2gb IP-1 prod
my-index 3 r STARTED 9498839 28.2gb IP-2 prod
my-index 2 r STARTED 9487703 29.1gb IP-3 prod
my-index 2 p STARTED 9500567 29gb IP-2 prod
my-index 4 p STARTED 9563803 29.5gb IP-3 prod
my-index 4 r STARTED 9548596 29.4gb IP-1 prod
my-index 1 p STARTED 9547593 28.4gb IP-3 prod
my-index 1 r STARTED 9529737 29.6gb IP-1 prod
my-index 0 p STARTED 9506551 29.1gb IP-1 prod
my-index 0 r STARTED 9490168 29gb IP-2 prod
While i was expecting the same number between primary and replica for a particular shard.
I can also confirm the same by running a match_all query multiple times only to get different doc count each time.
This is what i've done till now as part of debugging:
Confirmed that primary doc count > replica doc count
Found out that the difference of doc count b/w primary and replica is decreasing over time
On inspecting a particular doc i found that primary has the right copy of data implying the higher number is correct.
Tried setting up the preference to primary only and replica only. The observation was that all extra docs are contained in primary but not in replica.
The system is used in a way that there are hourly updates on the same documents. Each hourly update within a day targets the same type under given index. Before each bulk update(reindexing), we delete all the documents contained in a particular type(to cleanup) using the following procedure:
Get all document ids contained in the given type(id_lst).
Issue a bulk delete operation on all id retrieved as part of #1 as follows:
bulk_body = ['{{"delete": {{"_index": "{}", "_type": "{}", "_id": "{}"}}}}' .format(my-index, given-type, x) for x in id_lst]
es.bulk(bulk_body, request_timeout=50)
I'm pretty sure that the problem is with the 2nd step i'm doing - deleting the documents by bulk operation. I think that the delete operation happens in primary and then tries to do the same in replica which creates out of sync problem.
I'd still like to understand 2 things:
What could have caused this? In the above procedure which step can be rectified?
If you are certain that the primary shards contain all the updates, a quick fix would be to recreate the replica shards for that index. First drop them by setting number_of_replicas to 0:
As for how you ended up in this strange situation I have no idea. In my company we also do a lot of updates and deletes of documents in a large ES cluster with 600+ million documents (our shards typically contain 10-20 million documents) but I've never experienced our replica shards go out of synch like that.
The only time I've ever managed to get invalid shards was when I accidentally took down a node when another node was still initializing, causing the primary shard to go offline while it was still updating its replica. After starting up the "primary" node again the index with that primary shard remained in a red state because Elasticsearch could not determine which shard should be the new primary; the not-up-to-date replica had been promoted new primary so when the original primary came back online with its larger doc count I ended up with a stale copy and had to force the issue with a forced allocation.
Given your replica situation I'm surprised your index is still green. Are you using an older version of Elasticsearch? Is the cluster stable or do the nodes drop out from time to time?
2.3.3 is a fairly old by now so I'm not sure how well it handles shard updating, you could try to update to newer version of Elasticsearch and see if that solves your problem.
As for your cluster stability you could have a look at the log files under ~/logs/ on the nodes, they will contain information about the nodes joining the cluster (after a drop-out or restart), typically something like this:
We're using all default settings per index. The refresh_interval should be equal to default value of 1s.
The output of cat indices for this particular index is:
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open <my-index> 5 1 47728984 2898212 291.5gb 145.7gb
I can see the logs of type detected_master during the time we restarted ES server. These logs are followed by the following log:
[2017-04-22 05:55:13,344][WARN ][discovery.zen ] [prod] master left (reason = shut_down), current nodes: {{prod}{Y4FT84HmRRKsf6DdnJQ51w}{IP-1}{IP-1:9300},}
[2017-04-22 05:53:35,088][INFO ][discovery.zen ] [prod] failed to send join request to master [{prod}{LNT-uKCKQ_yaW5rV-AxlMA}{<ip-3>}{<ip-3>:9300}], reason [NodeDisconnectedException[[prod][<ip-3>:9300][internal:discovery/zen/join] disconnected]]
[2017-04-22 05:53:39,003][INFO ][cluster.service ] [prod] detected_master {prod}{M40n64iQQjGZg86px-08XA}{<ip-2>}{<ip-2>:9300}, added {{prod}{M40n64iQQjGZg86px-08XA}{<ip-2>}{<ip-2>:9300},}, reason: zen-disco-receive(from master [{prod}{M40n64iQQjGZg86px-08XA}{<ip-2>}{<ip-2>:9300}])
[2017-04-22 05:53:39,148][INFO ][http ] [prod] publish_address {IP-1:9200}, bound_addresses {[::]:9200}
[2017-04-22 05:53:39,149][INFO ][node ] [prod] started
[2017-04-22 05:53:58,790][INFO ][discovery.zen ] [prod] master_left [{prod}{M40n64iQQjGZg86px-08XA}{<ip-2>}{<ip-2>:9300}], reason [shut_down]
[2017-04-22 05:53:58,798][INFO ][discovery.zen ] [prod] master_left [{prod}{M40n64iQQjGZg86px-08XA}{<ip-2>}{<ip-2>:9300}], reason [transport disconnected]
[2017-04-22 05:53:59,223][DEBUG][action.admin.cluster.state] [prod] connection exception while trying to forward request with action name [cluster:monitor/state] to master node [{prod}{M40n64iQQjGZg86px-08XA}{<ip-2>}{<ip-2>:9300}], scheduling a retry. Error: [SendRequestTransportException[[prod][<ip-2>:9300][cluster:monitor/state]]; nested: NodeNotConnectedException[[prod][<ip-2>:9300] Node not connected];; org.elasticsearch.transport.NodeNotConnectedException: [prod][<ip-2>:9300] Node not connected]
[2017-04-22 05:54:57,123][WARN ][rest.suppressed ] /_cat/nodes Params: {v=}
RemoteTransportException[[prod][<ip-2>:9300][cluster:monitor/state]]; nested: MasterNotDiscoveredException;
Caused by: MasterNotDiscoveredException[null]
Also immediately after server restart, i can see the following for one of the index:
Caused by: RemoteTransportException[[prod][<ip-2>:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [my-index][2] from primary shard with sync id but number of docs differ: 2179956 (prod, primary) vs 2181894(prod)];
These logs are now making me think that something went wrong during the restart and led to data loss.
At the point you when got the IllegalStateException you already had a corrupt index because the primary shard was out of sync (the replica had documents not found in the primary).
I have gotten this exception too, both by accidental shutdown of a node and after a split brain episode. The only way to solve it, unless you can rebuild the index from the original source, is to manually force one of the shards - either the old primary or the old replica - to become the new primary by using the reroute API and accept data loss.
Here's the command I ran to force the copy of shard number 3 of my_big_index, which was in an illegal state, on node-20 to become the new primary:
I have gotten this exception too, both by accidental shutdown of a node and after a split brain episode.
A node can shut down itself without prior notification due to various reasons. It can just run out of memory and reboot.
How did a split brain start episode in your case?
I'm eager to know what caused the problem in the first place. If its a node unexpected shutdown, replica-primary shouldn't go out of sync. Otherwise what's the point of maintaining 3 nodes in production?
Our split brain happened in an old cluster that was just used for archiving, somehow we had forgotten to set discovery.zen.minimum_master_nodes so when we briefly lost communication between our two data centers the nodes at the non-master location elected their own master. Which just goes to show how important it is to set discovery.zen.minimum_master_nodes to half the number of master eligible nodes + 1.
That is correct.
What I referred to was a situation where I had restarted one node in a cluster (as part of a rolling upgrade) and didn't wait for the cluster state to become green before taking down the next one. The cluster was in a yellow state because shards on the restarted node was still being updated from primary shards on the other nodes. When I took down the next node, for upgrading, I accidentally took down the node on which a primary shard resided which was still trying to sync its replica on the node I had just upgraded. When that happened the primary shard was shut down and the cluster had nothing else to do than promote the not-up-to-date replica to the new primary shard. When I finished the upgrade and started the node I ended up with an IllegalStateException because of the conflict between the former and the current primary shard.
So as long as only one node falls out of the cluster there shouldn't be any problem but if a second falls out shortly after the first one joins and while the cluster state is yellow, it is possible to get a shard out of sync.
Honestly speaking i don't even know what discovery.zen.minimum_master_nodes means. I should get myself educated
The 2nd part of your answer has brilliant explanation and my guess is that something similar happend in our case as well. Before i took over, we had a restart several months back and we restarted all the nodes manually which explains our cluster current situation.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.