Option missing "fix" on startup

There used to be an option "fix" for index.shard.check_on_startup but that doesn't work with ES 7.x

Has this functionality been removed or moved to a different type of endpoint?

root@es3:/var/tmp# curl -H 'content-type: application/json' -XPUT localhost:9200/rc_2019-11/_settings -d '{"index.shard":{"check_on_startup":"fix"}}'

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"unknown value for [index.shard.check_on_startup] must be one of [true, false, checksum] but was: fix"}],"type":"illegal_argument_exception","reason":"unknown value for [index.shard.check_on_startup] must be one of [true, false, checksum] but was: fix"},"status":400}

See the elasticsearch-shard command line tool.

In case you are interested, here is a discussion that led to this change.

1 Like

Thanks Glen! Interestingly enough, I ran that command on the correct index / shard and it reported there were no errors. It then said I needed to run this command:

You should run the following command to allocate this shard:

POST /_cluster/reroute
{
  "commands" : [
    {
      "allocate_stale_primary" : {
        "index" : "rc_2019-11",
        "shard" : 0,
        "node" : "QfClkKNITaOU83ZPhl_DVw",
        "accept_data_loss" : false
      }
    }
  ]
}

However, running that command gave the error:

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"[allocate_stale_primary] allocating an empty primary for [rc_2019-11][0] can result in data loss. Please confirm by setting the accept_data_loss parameter to true"}],"type":"illegal_argument_exception","reason":"[allocate_stale_primary] allocating an empty primary for [rc_2019-11][0] can result in data loss. Please confirm by setting the accept_data_loss parameter to true"},"status":400}

Ahhh ... I didn't read far enough. I changed false to true for accept data loss. For some reason, the shard is still showing as unassigned even though the tool said the data was intact.

Very strange.

Making more progress -- when I ran the reallocate command, I noticed this:

Recovery failed on {es3}{QfClkKNITaOU83ZPhl_DVw}{vNBPdcFhQniHn-6rzi5bDw}{192.168.1.205}{192.168.1.205:9300}{dilm}{ml.machine_memory=134832025600, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogException[failed to create new translog file]; nested: AccessDeniedException[/var/lib/elasticsearch/nodes/0/indices/RRrcNKjmRZSU4_xZEtBnbQ/0/translog/translog.ckp]

It appears that something else wrote to that file as root (I ran the tool as root, so that may be it). I will try to chown the files back to elasticsearch and retry the allocation.

Perhaps the tool should be run under the elasticsearch user or something?

Success !!!

I owe you a beer or coffee Glen!

1 Like

One last comment. This is a very powerful tool that should probably get a bit more exposure in the documentation. I know it is a tool of last resort, but in my situation, it was able to restore the index without any data loss.

I'd recommend slipping a reference / link to the tool in a few sections of the documentation that cover recovery options.

Also, for this situation, the sequence of events was that I ran out of drive space (I turned off watermark checks and forgot to re-enable them) and in the logs, apparently a merge operation failed. It looks like ES retried and then gave up and marked the shard as corrupted (there was a file starting with corrupt in the nodes data directory for that index / shard).

My gut feeling is that ES is very liberal when it comes to marking a shard / index as corrupted (perhaps to prevent further data loss) and even though it was marked as corrupted / stale, the data itself was still preserved with no data loss when using the tool.

Hope this helps others out there.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.