Shard replica won't initialise

(Michael Day) #1

Over the weekend my ES cluster went Yellow.

I'm running 6.2.3 in a kubernetes cluster in GKE, we have 3 hot data nodes and 4 warm, 3 masters and 3 clients. We have k8s cronjobs which run prune, rollover and shrink. Rollover is run hourly on the half hour, shrink it run nightly and prune is run on a sunday at the same time as the sunday shrink.

It looks like things broke when the prune and shrink ran on the Sunday.

When I got in to work on the Monday, there was one damaged index which looked like

active-cyberprobe-002842   1 r STARTED    
active-cyberprobe-002842   1 p STARTED    
active-cyberprobe-002842   2 r STARTED    
active-cyberprobe-002842   2 p STARTED    
active-cyberprobe-002842   0 p RELOCATING 
active-cyberprobe-002842   0 r UNASSIGNED NODE_LEFT


$ kubectl exec $esclient -- curl -XGET $esip:$esport/active-cyberprobe-002842/_settings?pretty
  "active-cyberprobe-002842" : {
    "settings" : {
      "index" : {
        "routing" : {
          "allocation" : {
            "include" : {
              "box_type" : "warm,hot"
            "require" : {
              "_name" : "es-data-warm-0"
            "total_shards_per_node" : "3"
        "mapping" : {
          "total_fields" : {
            "limit" : "5000"
        "number_of_shards" : "3",
        "provided_name" : "active-cyberprobe-002842",
        "creation_date" : "1536944408316",
        "number_of_replicas" : "1",
        "uuid" : "BGOrErYVSA26X5RmSax5qg",
        "version" : {
          "created" : "6020399"

So it looks like it was trying to move all the the shards of 002842 to es-data-warm-0 for the shrink operation, but it then stalled.

So I removed the index.routing.allocation.require._name setting, set index.routing.allocation.include.box_type to warm only, and hoped it would reallocate the shards across the cluster and it did start doing this.

$ kubectl exec $esclient -- curl -XGET $esip:$esport/_cat/shards | grep 2842

active-cyberprobe-002842   1 r STARTED        2134427 912.5mb  es-data-warm-1
active-cyberprobe-002842   1 p STARTED        2134427     3gb es-data-warm-0
active-cyberprobe-002842   2 r STARTED        2132858 910.4mb  es-data-warm-1
active-cyberprobe-002842   2 p STARTED        2132858   2.8gb es-data-warm-0
active-cyberprobe-002842   0 p RELOCATING     2134552   3.7gb  es-data-hot-0 -> mwK56S8VQRqFld7CzgDVnQ es-data-warm-1
active-cyberprobe-002842   0 r INITIALIZING          es-data-warm-2

Time passed and the cluster remained Yellow. So I looked in the logs and found:

[2018-09-17T12:01:27,417][WARN ][o.e.i.c.IndicesClusterStateService] [es-data-warm-1] [[active-cyberprobe-002842][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [active-cyberprobe-002842][0]: Recovery failed from {es-data-hot-0}{kn0J013QQx69T9H0hSZGuQ}{ghAS5hs1Q3m5E3kv-5Lm3w}{}{}{box_type=hot} into {es-data-warm-1}{mwK56S8VQRqFld7CzgDVnQ}{utLoaBMkTA2kJrwmMGKBIA}{}{}{box_type=warm} (no activity after [30m])
        at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun( [elasticsearch-6.2.3.jar:6.2.3]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun( [elasticsearch-6.2.3.jar:6.2.3]
        at [elasticsearch-6.2.3.jar:6.2.3]
        at java.util.concurrent.ThreadPoolExecutor.runWorker( [?:1.8.0_161]
        at java.util.concurrent.ThreadPoolExecutor$ [?:1.8.0_161]
        at [?:1.8.0_161]
Caused by: org.elasticsearch.ElasticsearchTimeoutException: no activity after [30m]
        ... 6 more

Now the cluster is under some load, but not particularly excessive and there is headroom on the memory and disks on all the nodes, and the network should easily be able to cope.

After some time I cancelled the relocation of the p0 shard and removed all the allocation requires and includes and we ended up in this state.

$ kubectl exec $esclient -- curl -XGET $esip:$esport/_cat/shards | grep 2842

active-cyberprobe-002842   1 p STARTED      2134427 911.6mb  es-data-hot-0
active-cyberprobe-002842   1 r STARTED      2134427 911.4mb  es-data-hot-2
active-cyberprobe-002842   2 r STARTED      2132858     1gb  es-data-hot-0
active-cyberprobe-002842   2 p STARTED      2132858 910.2mb  es-data-hot-2
active-cyberprobe-002842   0 p STARTED      2134552   3.8gb  es-data-hot-0
active-cyberprobe-002842   0 r UNASSIGNED  

I've tried disabling and reenabling replication, and when it's disabled the cluster will come green, but enable it again and shard 0 r remains in initialising for a long time before it gives up and moves back to unassigned.

So is there anything I can do to resolve this situation?
Can I run any health checks on the files on disk which are the index which refuses to replicate?
Would it be worth setting the index to un-replicated and trying to shrink it?

Was this caused by shrink and prune jobs running concurrently and do i need to serialise them to prevent it happening again?

Thanks for any help you can offer.

(Michael Day) #2

So we fixed this by reindexing the 002842 index onto a new one (I used 009998, as we used 009999 to fix something else the other day, if you care).

I'm still not entirely clear as to what went wrong or why we couldn't replicate shard 0 in the original index, but with the reindex complete all my data is safe so I stopped worrying about it :slight_smile:

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.