Replica shard unassigned while performing shrink action through ILM policy

stevesimpson · October 14, 2019, 10:39am

Hi

The Health of my Index has turned yellow, seemingly because a replica shard cannot be allocated. This index has been created as part of the "shrink" action as defined in my ILM policy. The desired result is that the Index would have been shrunk and moved to the "warm" datanodes. I don't think that this can complete until all shards are assigned.

I'll post the output of _cluster/allocation/explain, _cat/shards/<index>, ILM policy, Index template and Elasticsearch logs below. Please let me know if you have any idea at all why this replica is now unassigned.

_cat/shards/shrink-filebeat-haproxy-production-2019.10.08-000001?v

index                                                shard prirep state           docs  store ip          node
shrink-filebeat-haproxy-production-2019.10.08-000001 0     p      STARTED    132097605   90gb 10.0.16.212 es-dn-warm-3.core.ld5.phg.io
shrink-filebeat-haproxy-production-2019.10.08-000001 0     r      STARTED    132097605 90.1gb 10.0.16.210 es-dn-warm-1.core.ld5.phg.io
shrink-filebeat-haproxy-production-2019.10.08-000001 0     r      UNASSIGNED

_cluster/allocation/explain

Moved to comment as hit max body limit.

ILM policy:

{
    "policy": {
        "phases": {
            "hot": {
                "min_age": "0ms",
                "actions": {
                    "rollover": {
                        "max_age": "30d",
                        "max_size": "90gb"
                    },
                    "set_priority": {
                        "priority": 100
                    }
                }
            },
            "warm": {
                "min_age": "30d",
                "actions": {
                    "allocate": {
                        "include": {},
                        "exclude": {},
                        "require": {
                            "data": "warm"
                        }
                    },
                    "forcemerge": {
                        "max_num_segments": 1
                    },
                    "set_priority": {
                        "priority": 50
                    },
                    "shrink": {
                        "number_of_shards": 1
                    }
                }
            }
        }
    }
}

Index Template (some fields removed):

{
  "settings": {
    "index": {
      "mapping": {
        "total_fields": {
          "limit": "10000"
        }
      },
      "refresh_interval": "5s",
      "blocks": {
        "write": "true"
      },
      "provided_name": "filebeat-haproxy-production-2019.10.08-000001",
      "query": {
      ...
      }
      "creation_date": "1570537372676",
      "priority": "50",
      "number_of_replicas": "2",
      "uuid": "***",
      "version": {
        "created": "7030099"
      },
      "lifecycle": {
        "name": "filebeat-haproxy-production-ilm-policy",
        "rollover_alias": "filebeat-haproxy-production-ilm-alias",
        "indexing_complete": "true"
      },
      "codec": "best_compression",
      "routing": {
        "allocation": {
          "require": {
            "data": "warm",
            "_id": "***"
          }
        }
      },
      "number_of_shards": "3",
      "shard": {
        "check_on_startup": "checksum"
      }
    }
  },

Elasticsearch Logs (From datanode which is failing to allocated replica)

Moved to comment as hit max body limit.

Things that I've tried:

Ran POST /_cluster/reroute?retry_failed=true to try and retry the shard allocation. Shard turns to INITIALIZATION state then moves back to UNASSIGNED after a short period of time. Above Elasticsearch log error is noticed once INITIALIZATION has failed.
Set cluster.routing.allocation.enable": "none". Tried to manually allocate the replica using the /_cluster/reroute API. Then renabled shard allocation. Same failure.

Please let me know if you need any further logs/information

stevesimpson · October 14, 2019, 10:40am

Elasticsearch Logs (From datanode which is failing to allocated replica)

[2019-10-14T10:15:35,364][WARN ][o.e.i.c.IndicesClusterStateService] [es-dn-warm-2.core.ld5.phg.io][shrink-filebeat-haproxy-production-2019.10.08-000001][0] marking and sending shard failed due to [failed recovery]
    org.elasticsearch.indices.recovery.RecoveryFailedException: [shrink-filebeat-haproxy-production-2019.10.08-000001][0]: Recovery failed from {es-dn-warm-3.core.ld5.phg.io}{8jYdhpY3T5q4wkIwuNqUJg}{SIZsBs0CQIiA3d3WTiMqkg}{10.0.16.212}{10.0.16.212:9300}{d}{data=warm, xpack.installed=true} into {es-dn-warm-2.core.ld5.phg.io}{ovybOoRaQTm6wFa2oUXKwQ}{ak_Hnb5CTb6JhnBYkht0tg}{10.0.16.211}{10.0.16.211:9300}{d}{xpack.installed=true, data=warm}
    	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:249) [elasticsearch-7.3.0.jar:7.3.0]
    	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:294) [elasticsearch-7.3.0.jar:7.3.0]
    	at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) [elasticsearch-7.3.0.jar:7.3.0]
    	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) [elasticsearch-7.3.0.jar:7.3.0]
    	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) [elasticsearch-7.3.0.jar:7.3.0]
    	at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:246) [elasticsearch-7.3.0.jar:7.3.0]
    	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.0.jar:7.3.0]
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    	at java.lang.Thread.run(Thread.java:835) [?:?]
    Caused by: org.elasticsearch.transport.RemoteTransportException: [es-dn-warm-3.core.ld5.phg.io][10.0.16.212:9300][internal:index/shard/recovery/start_recovery]
    Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] prepare target for translog failed
    	at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$prepareTargetForTranslog$23(RecoverySourceHandler.java:470) ~[elasticsearch-7.3.0.jar:7.3.0]
    	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) ~[elasticsearch-7.3.0.jar:7.3.0]
    	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:70) ~[elasticsearch-7.3.0.jar:7.3.0]
    	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.3.0.jar:7.3.0]
    	at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[elasticsearch-7.3.0.jar:7.3.0]
    	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.0.jar:7.3.0]
    	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111) ~[elasticsearch-7.3.0.jar:7.3.0]
    	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1012) ~[elasticsearch-7.3.0.jar:7.3.0]
    	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) ~[elasticsearch-7.3.0.jar:7.3.0]
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    	at java.lang.Thread.run(Thread.java:835) ~[?:?]
    Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-dn-warm-2.core.ld5.phg.io][10.0.16.211:9300][internal:index/shard/recovery/prepare_translog] request_id [30457025] timed out after [899891ms]
    	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1013) ~[elasticsearch-7.3.0.jar:7.3.0]
    	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) ~[elasticsearch-7.3.0.jar:7.3.0]
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    	at java.lang.Thread.run(Thread.java:835) ~[?:?]

stevesimpson · October 14, 2019, 10:44am

_cluster/allocation/explain

Some sections have been removed due to max body size limit.

{
      "index" : "shrink-filebeat-haproxy-production-2019.10.08-000001",
      "shard" : 0,
      "primary" : false,
      "current_state" : "unassigned",
      "unassigned_info" : {
        "reason" : "ALLOCATION_FAILED",
        "at" : "2019-10-14T10:15:35.442Z",
        "failed_allocation_attempts" : 1,
        "details" : "failed shard on node [ovybOoRaQTm6wFa2oUXKwQ]: failed recovery, failure RecoveryFailedException[[shrink-filebeat-haproxy-production-2019.10.08-000001][0]: Recovery failed from {es-dn-warm-3.core.ld5.phg.io}{8jYdhpY3T5q4wkIwuNqUJg}{SIZsBs0CQIiA3d3WTiMqkg}{10.0.16.212}{10.0.16.212:9300}{d}{data=warm, xpack.installed=true} into {es-dn-warm-2.core.ld5.phg.io}{ovybOoRaQTm6wFa2oUXKwQ}{ak_Hnb5CTb6JhnBYkht0tg}{10.0.16.211}{10.0.16.211:9300}{d}{xpack.installed=true, data=warm}]; nested: RemoteTransportException[[es-dn-warm-3.core.ld5.phg.io][10.0.16.212:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: ReceiveTimeoutTransportException[[es-dn-warm-2.core.ld5.phg.io][10.0.16.211:9300][internal:index/shard/recovery/prepare_translog] request_id [30457025] timed out after [899891ms]]; ",
        "last_allocation_status" : "no_attempt"
      },
      "can_allocate" : "no",
      "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
      "node_allocation_decisions" : [
        {
          "node_id" : "4FPfJmiAQsShkhkRs-Kkkw",
          "node_name" : "es-dn-warm-1.core.ld5.phg.io",
          "transport_address" : "10.0.16.210:9300",
          "node_attributes" : {
            "data" : "warm",
            "xpack.installed" : "true"
          },
          "node_decision" : "no",
          "deciders" : [
            {
              "decider" : "max_retry",
              "decision" : "NO",
              "explanation" : "shard has exceeded the maximum number of retries [1] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-10-14T10:15:35.442Z], failed_attempts[1], delayed=false, details[failed shard on node [ovybOoRaQTm6wFa2oUXKwQ]: failed recovery, failure RecoveryFailedException[[shrink-filebeat-haproxy-production-2019.10.08-000001][0]: Recovery failed from {es-dn-warm-3.core.ld5.phg.io}{8jYdhpY3T5q4wkIwuNqUJg}{SIZsBs0CQIiA3d3WTiMqkg}{10.0.16.212}{10.0.16.212:9300}{d}{data=warm, xpack.installed=true} into {es-dn-warm-2.core.ld5.phg.io}{ovybOoRaQTm6wFa2oUXKwQ}{ak_Hnb5CTb6JhnBYkht0tg}{10.0.16.211}{10.0.16.211:9300}{d}{xpack.installed=true, data=warm}]; nested: RemoteTransportException[[es-dn-warm-3.core.ld5.phg.io][10.0.16.212:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: ReceiveTimeoutTransportException[[es-dn-warm-2.core.ld5.phg.io][10.0.16.211:9300][internal:index/shard/recovery/prepare_translog] request_id [30457025] timed out after [899891ms]]; ], allocation_status[no_attempt]]]"
            },
            {
              "decider" : "same_shard",
              "decision" : "NO",
              "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[shrink-filebeat-haproxy-production-2019.10.08-000001][0], node[4FPfJmiAQsShkhkRs-Kkkw], [R], s[STARTED], a[id=pfwNFWNQQ26SWxgfp9kErg]]"
            }
          ]
        },
        {
          "node_id" : "ovybOoRaQTm6wFa2oUXKwQ",
          "node_name" : "es-dn-warm-2.core.ld5.phg.io",
          "transport_address" : "10.0.16.211:9300",
          "node_attributes" : {
            "data" : "warm",
            "xpack.installed" : "true"
          },
          "node_decision" : "no",
          "deciders" : [
            {
              "decider" : "max_retry",
              "decision" : "NO",
              "explanation" : "shard has exceeded the maximum number of retries [1] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-10-14T10:15:35.442Z], failed_attempts[1], delayed=false, details[failed shard on node [ovybOoRaQTm6wFa2oUXKwQ]: failed recovery, failure RecoveryFailedException[[shrink-filebeat-haproxy-production-2019.10.08-000001][0]: Recovery failed from {es-dn-warm-3.core.ld5.phg.io}{8jYdhpY3T5q4wkIwuNqUJg}{SIZsBs0CQIiA3d3WTiMqkg}{10.0.16.212}{10.0.16.212:9300}{d}{data=warm, xpack.installed=true} into {es-dn-warm-2.core.ld5.phg.io}{ovybOoRaQTm6wFa2oUXKwQ}{ak_Hnb5CTb6JhnBYkht0tg}{10.0.16.211}{10.0.16.211:9300}{d}{xpack.installed=true, data=warm}]; nested: RemoteTransportException[[es-dn-warm-3.core.ld5.phg.io][10.0.16.212:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: ReceiveTimeoutTransportException[[es-dn-warm-2.core.ld5.phg.io][10.0.16.211:9300][internal:index/shard/recovery/prepare_translog] request_id [30457025] timed out after [899891ms]]; ], allocation_status[no_attempt]]]"
            }
          ]
        }
      ]
    }

DavidTurner · October 14, 2019, 11:39am

This recovery timed out after 15 minutes, while starting the engine. Are they all timing out after 15 minutes? Can you start another recovery, wait a few minutes, and then grab the hot threads to see what this node is busy doing?

GET _nodes/hot_threads?threads=999999

stevesimpson · October 14, 2019, 12:19pm

Hi @DavidTurner thank you for the reply. Yes, they all seem to be failing around the 15 minute mark. I've not recorded the exact time of failure. I've got the hot_threads from the node that the shard is trying to initialize on

   24.0% (119.8ms out of 500ms) cpu usage by thread 'elasticsearch[es-dn-warm-2.core.ld5.phg.io][generic][T#8]'
 10/10 snapshots sharing following 35 elements
   java.base@12.0.1/sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
   java.base@12.0.1/sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:54)
   java.base@12.0.1/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:274)
   java.base@12.0.1/sun.nio.ch.IOUtil.read(IOUtil.java:245)
   java.base@12.0.1/sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:811)
   java.base@12.0.1/sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:796)
   app//org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
   app//org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:160)
   app//org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:116)
   app//org.apache.lucene.store.BufferedChecksumIndexInput.readBytes(BufferedChecksumIndexInput.java:49)
   app//org.apache.lucene.store.DataInput.readBytes(DataInput.java:87)
   app//org.apache.lucene.store.DataInput.skipBytes(DataInput.java:317)
   app//org.apache.lucene.store.ChecksumIndexInput.seek(ChecksumIndexInput.java:52)
   app//org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:525)
   app//org.elasticsearch.index.store.Store.checkIntegrity(Store.java:536)
   app//org.elasticsearch.index.shard.IndexShard.doCheckIndex(IndexShard.java:2329)
   app//org.elasticsearch.index.shard.IndexShard.checkIndex(IndexShard.java:2305)
   app//org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1500)
   app//org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:1488)
   app//org.elasticsearch.indices.recovery.RecoveryTarget.lambda$prepareForTranslogOperations$0(RecoveryTarget.java:291)
   app//org.elasticsearch.indices.recovery.RecoveryTarget$$Lambda$3472/0x0000000801cd2840.get(Unknown Source)
   app//org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:253)
   app//org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:289)
   app//org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:433)
   app//org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:427)
   org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257)
   app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:315)
   app//org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
   app//org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:267)
   app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758)
   app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   java.base@12.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   java.base@12.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   java.base@12.0.1/java.lang.Thread.run(Thread.java:835)

stevesimpson · October 14, 2019, 12:23pm

I've ran the command at various intervals and got the following:

6.6% (33.2ms out of 500ms) cpu usage by thread 'elasticsearch[es-dn-warm-2.core.ld5.phg.io][generic][T#8]'
 10/10 snapshots sharing following 26 elements
   app//org.apache.lucene.store.BufferedChecksumIndexInput.readBytes(BufferedChecksumIndexInput.java:49)
   app//org.apache.lucene.store.DataInput.readBytes(DataInput.java:87)
   app//org.apache.lucene.store.DataInput.skipBytes(DataInput.java:317)

&

::: {es-dn-warm-2.core.ld5.phg.io}{ovybOoRaQTm6wFa2oUXKwQ}{ak_Hnb5CTb6JhnBYkht0tg}{10.0.16.211}{10.0.16.211:9300}{d}{data=warm, xpack.installed=true}
   Hot threads at 2019-10-14T12:10:27.555Z, interval=500ms, busiestThreads=999999, ignoreIdleThreads=true:
   
   20.5% (102.3ms out of 500ms) cpu usage by thread 'elasticsearch[es-dn-warm-2.core.ld5.phg.io][generic][T#8]'
     2/10 snapshots sharing following 24 elements
       app//org.apache.lucene.store.DataInput.skipBytes(DataInput.java:319)
       app//org.apache.lucene.store.ChecksumIndexInput.seek(ChecksumIndexInput.java:52)
       app//org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:525)
       app//org.elasticsearch.index.store.Store.checkIntegrity(Store.java:536)
       app//org.elasticsearch.index.shard.IndexShard.doCheckIndex(IndexShard.java:2329)
       app//org.elasticsearch.index.shard.IndexShard.checkIndex(IndexShard.java:2305)
       app//org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1500)
       app//org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:1488)
       app//org.elasticsearch.indices.recovery.RecoveryTarget.lambda$prepareForTranslogOperations$0(RecoveryTarget.java:291)
       app//org.elasticsearch.indices.recovery.RecoveryTarget$$Lambda$3472/0x0000000801cd2840.get(Unknown Source)
       app//org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:253)
       app//org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:289)
       app//org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:433)
       app//org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:427)
       org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:315)
       app//org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
       app//org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:267)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@12.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       java.base@12.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@12.0.1/java.lang.Thread.run(Thread.java:835)

stevesimpson · October 14, 2019, 12:24pm

And yes, timeout was after 15 minutes this time also request_id [30806196] timed out after [900063ms]]

DavidTurner · October 14, 2019, 12:26pm

It looks like you have index.shard.check_on_startup configured. This performs a (very expensive) check of your index when it is starting up. I suggest removing this setting.

stevesimpson · October 14, 2019, 1:16pm

@DavidTurner - Thank you! I did have index.shard.check_on_startup set to "checksum". For anyone in future reading I had to close the Index, run the following:

PUT /shrink-filebeat-haproxy-production-2019.10.08-000001/_settings
{
  "index": {
     "shard": {
        "check_on_startup": "false"
        }
    }
}

Then re-open the Index and persist the changes to the template using the PUT _template API

system · November 11, 2019, 1:16pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unassigned shardsafter nodes restart Elasticsearch	4	387	July 6, 2017
Unassigned shards, v2 Elasticsearch	5	1341	July 6, 2017
Shards unassigned after node restarts - reason: NODE_LEFT Elasticsearch	16	37403	December 28, 2016
Constant Recovering and Unassigned shards for an index Elasticsearch	12	1018	July 6, 2017
Unassigned Shards Elasticsearch	11	887	July 6, 2017

Replica shard unassigned while performing shrink action through ILM policy

Related topics