Option missing "fix" on startup

pushshift · November 12, 2019, 7:58pm

There used to be an option "fix" for index.shard.check_on_startup but that doesn't work with ES 7.x

Has this functionality been removed or moved to a different type of endpoint?

root@es3:/var/tmp# curl -H 'content-type: application/json' -XPUT localhost:9200/rc_2019-11/_settings -d '{"index.shard":{"check_on_startup":"fix"}}'

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"unknown value for [index.shard.check_on_startup] must be one of [true, false, checksum] but was: fix"}],"type":"illegal_argument_exception","reason":"unknown value for [index.shard.check_on_startup] must be one of [true, false, checksum] but was: fix"},"status":400}

Glen_Smith · November 12, 2019, 8:19pm

See the elasticsearch-shard command line tool.

In case you are interested, here is a discussion that led to this change.

pushshift · November 12, 2019, 8:43pm

Thanks Glen! Interestingly enough, I ran that command on the correct index / shard and it reported there were no errors. It then said I needed to run this command:

You should run the following command to allocate this shard:

POST /_cluster/reroute
{
  "commands" : [
    {
      "allocate_stale_primary" : {
        "index" : "rc_2019-11",
        "shard" : 0,
        "node" : "QfClkKNITaOU83ZPhl_DVw",
        "accept_data_loss" : false
      }
    }
  ]
}

However, running that command gave the error:

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"[allocate_stale_primary] allocating an empty primary for [rc_2019-11][0] can result in data loss. Please confirm by setting the accept_data_loss parameter to true"}],"type":"illegal_argument_exception","reason":"[allocate_stale_primary] allocating an empty primary for [rc_2019-11][0] can result in data loss. Please confirm by setting the accept_data_loss parameter to true"},"status":400}

pushshift · November 12, 2019, 8:54pm

Ahhh ... I didn't read far enough. I changed false to true for accept data loss. For some reason, the shard is still showing as unassigned even though the tool said the data was intact.

Very strange.

pushshift · November 12, 2019, 8:58pm

Making more progress -- when I ran the reallocate command, I noticed this:

Recovery failed on {es3}{QfClkKNITaOU83ZPhl_DVw}{vNBPdcFhQniHn-6rzi5bDw}{192.168.1.205}{192.168.1.205:9300}{dilm}{ml.machine_memory=134832025600, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogException[failed to create new translog file]; nested: AccessDeniedException[/var/lib/elasticsearch/nodes/0/indices/RRrcNKjmRZSU4_xZEtBnbQ/0/translog/translog.ckp]

It appears that something else wrote to that file as root (I ran the tool as root, so that may be it). I will try to chown the files back to elasticsearch and retry the allocation.

Perhaps the tool should be run under the elasticsearch user or something?

pushshift · November 12, 2019, 8:59pm

Success !!!

I owe you a beer or coffee Glen!

pushshift · November 12, 2019, 9:21pm

One last comment. This is a very powerful tool that should probably get a bit more exposure in the documentation. I know it is a tool of last resort, but in my situation, it was able to restore the index without any data loss.

I'd recommend slipping a reference / link to the tool in a few sections of the documentation that cover recovery options.

Also, for this situation, the sequence of events was that I ran out of drive space (I turned off watermark checks and forgot to re-enable them) and in the logs, apparently a merge operation failed. It looks like ES retried and then gave up and marked the shard as corrupted (there was a file starting with corrupt in the nodes data directory for that index / shard).

My gut feeling is that ES is very liberal when it comes to marking a shard / index as corrupted (perhaps to prevent further data loss) and even though it was marked as corrupted / stale, the data itself was still preserved with no data loss when using the tool.

Hope this helps others out there.

system · December 10, 2019, 9:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CorruptIndexException after node restart Elasticsearch	5	1072	September 26, 2017
Nested: CorruptIndexException[failed engine (reason: [corrupt file (source: [index]) Elasticsearch	2	2985	April 27, 2018
Index Recovery failed Elasticsearch	3	2238	October 18, 2021
Elasticsearch doesn't allow to allocate unassigned shard Elasticsearch	8	624	May 21, 2020
Primary Shard Allocation_Failed Elasticsearch	5	1377	October 24, 2022

Option missing "fix" on startup

Related topics