We've been observing some strange behaviour on our ES Cluster for the last few weeks with the primary for a shard having a much higher size than the replica.
Here's the output for shard details API. Notice shard 2 in particular where the primary is 32.5 Gb while the replica is ~19 Gb. Also, I noticed that the Global Checkpoint and the Local Checkpoint seem off for the 2nd shard.
I also checked the Index stats at shard level and could find the following information for shard 2. I noticed that the translog size is around 22 Gb for both the primary and replica which seems too high when compared to the other shards which are more or less in MBs. Here's the full data for shard level stats.
There are a couple of instances of RecoveryFailedException on the node which holds the replica of shard 2. Also there is a spike in JVM Memory pressure just before the problem started.
Here are some of the error logs that I found corresponding to the time at which the problem started.
Do let me know if anymore information is needed.
Any help would be greatly appreciated!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.