{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"unknown value for [index.shard.check_on_startup] must be one of [true, false, checksum] but was: fix"}],"type":"illegal_argument_exception","reason":"unknown value for [index.shard.check_on_startup] must be one of [true, false, checksum] but was: fix"},"status":400}
Thanks Glen! Interestingly enough, I ran that command on the correct index / shard and it reported there were no errors. It then said I needed to run this command:
You should run the following command to allocate this shard:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"[allocate_stale_primary] allocating an empty primary for [rc_2019-11][0] can result in data loss. Please confirm by setting the accept_data_loss parameter to true"}],"type":"illegal_argument_exception","reason":"[allocate_stale_primary] allocating an empty primary for [rc_2019-11][0] can result in data loss. Please confirm by setting the accept_data_loss parameter to true"},"status":400}
Ahhh ... I didn't read far enough. I changed false to true for accept data loss. For some reason, the shard is still showing as unassigned even though the tool said the data was intact.
Making more progress -- when I ran the reallocate command, I noticed this:
Recovery failed on {es3}{QfClkKNITaOU83ZPhl_DVw}{vNBPdcFhQniHn-6rzi5bDw}{192.168.1.205}{192.168.1.205:9300}{dilm}{ml.machine_memory=134832025600, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogException[failed to create new translog file]; nested: AccessDeniedException[/var/lib/elasticsearch/nodes/0/indices/RRrcNKjmRZSU4_xZEtBnbQ/0/translog/translog.ckp]
It appears that something else wrote to that file as root (I ran the tool as root, so that may be it). I will try to chown the files back to elasticsearch and retry the allocation.
Perhaps the tool should be run under the elasticsearch user or something?
One last comment. This is a very powerful tool that should probably get a bit more exposure in the documentation. I know it is a tool of last resort, but in my situation, it was able to restore the index without any data loss.
I'd recommend slipping a reference / link to the tool in a few sections of the documentation that cover recovery options.
Also, for this situation, the sequence of events was that I ran out of drive space (I turned off watermark checks and forgot to re-enable them) and in the logs, apparently a merge operation failed. It looks like ES retried and then gave up and marked the shard as corrupted (there was a file starting with corrupt in the nodes data directory for that index / shard).
My gut feeling is that ES is very liberal when it comes to marking a shard / index as corrupted (perhaps to prevent further data loss) and even though it was marked as corrupted / stale, the data itself was still preserved with no data loss when using the tool.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.