Over the weekend my ES cluster went Yellow.
I'm running 6.2.3 in a kubernetes cluster in GKE, we have 3 hot data nodes and 4 warm, 3 masters and 3 clients. We have k8s cronjobs which run prune, rollover and shrink. Rollover is run hourly on the half hour, shrink it run nightly and prune is run on a sunday at the same time as the sunday shrink.
It looks like things broke when the prune and shrink ran on the Sunday.
When I got in to work on the Monday, there was one damaged index which looked like
active-cyberprobe-002842 1 r STARTED
active-cyberprobe-002842 1 p STARTED
active-cyberprobe-002842 2 r STARTED
active-cyberprobe-002842 2 p STARTED
active-cyberprobe-002842 0 p RELOCATING
active-cyberprobe-002842 0 r UNASSIGNED NODE_LEFT
and
$ kubectl exec $esclient -- curl -XGET $esip:$esport/active-cyberprobe-002842/_settings?pretty
"active-cyberprobe-002842" : {
"settings" : {
"index" : {
"routing" : {
"allocation" : {
"include" : {
"box_type" : "warm,hot"
},
"require" : {
"_name" : "es-data-warm-0"
},
"total_shards_per_node" : "3"
}
},
"mapping" : {
"total_fields" : {
"limit" : "5000"
}
},
"number_of_shards" : "3",
"provided_name" : "active-cyberprobe-002842",
"creation_date" : "1536944408316",
"number_of_replicas" : "1",
"uuid" : "BGOrErYVSA26X5RmSax5qg",
"version" : {
"created" : "6020399"
}
}
}
}
}
So it looks like it was trying to move all the the shards of 002842 to es-data-warm-0 for the shrink operation, but it then stalled.
So I removed the index.routing.allocation.require._name setting, set index.routing.allocation.include.box_type to warm only, and hoped it would reallocate the shards across the cluster and it did start doing this.
$ kubectl exec $esclient -- curl -XGET $esip:$esport/_cat/shards | grep 2842
active-cyberprobe-002842 1 r STARTED 2134427 912.5mb 10.0.16.5 es-data-warm-1
active-cyberprobe-002842 1 p STARTED 2134427 3gb 10.0.26.20 es-data-warm-0
active-cyberprobe-002842 2 r STARTED 2132858 910.4mb 10.0.16.5 es-data-warm-1
active-cyberprobe-002842 2 p STARTED 2132858 2.8gb 10.0.26.20 es-data-warm-0
active-cyberprobe-002842 0 p RELOCATING 2134552 3.7gb 10.0.24.4 es-data-hot-0 -> 10.0.16.5 mwK56S8VQRqFld7CzgDVnQ es-data-warm-1
active-cyberprobe-002842 0 r INITIALIZING 10.0.16.10 es-data-warm-2
Time passed and the cluster remained Yellow. So I looked in the logs and found:
[2018-09-17T12:01:27,417][WARN ][o.e.i.c.IndicesClusterStateService] [es-data-warm-1] [[active-cyberprobe-002842][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [active-cyberprobe-002842][0]: Recovery failed from {es-data-hot-0}{kn0J013QQx69T9H0hSZGuQ}{ghAS5hs1Q3m5E3kv-5Lm3w}{10.0.24.4}{10.0.24.4:9300}{box_type=hot} into {es-data-warm-1}{mwK56S8VQRqFld7CzgDVnQ}{utLoaBMkTA2kJrwmMGKBIA}{10.0.16.5}{10.0.16.5:9300}{box_type=warm} (no activity after [30m])
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:286) [elasticsearch-6.2.3.jar:6.2.3]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.3.jar:6.2.3]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.3.jar:6.2.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Caused by: org.elasticsearch.ElasticsearchTimeoutException: no activity after [30m]
... 6 more
Now the cluster is under some load, but not particularly excessive and there is headroom on the memory and disks on all the nodes, and the network should easily be able to cope.
After some time I cancelled the relocation of the p0 shard and removed all the allocation requires and includes and we ended up in this state.
$ kubectl exec $esclient -- curl -XGET $esip:$esport/_cat/shards | grep 2842
active-cyberprobe-002842 1 p STARTED 2134427 911.6mb 10.0.24.4 es-data-hot-0
active-cyberprobe-002842 1 r STARTED 2134427 911.4mb 10.0.11.8 es-data-hot-2
active-cyberprobe-002842 2 r STARTED 2132858 1gb 10.0.24.4 es-data-hot-0
active-cyberprobe-002842 2 p STARTED 2132858 910.2mb 10.0.11.8 es-data-hot-2
active-cyberprobe-002842 0 p STARTED 2134552 3.8gb 10.0.24.4 es-data-hot-0
active-cyberprobe-002842 0 r UNASSIGNED
I've tried disabling and reenabling replication, and when it's disabled the cluster will come green, but enable it again and shard 0 r
remains in initialising for a long time before it gives up and moves back to unassigned.
So is there anything I can do to resolve this situation?
Can I run any health checks on the files on disk which are the index which refuses to replicate?
Would it be worth setting the index to un-replicated and trying to shrink it?
Was this caused by shrink and prune jobs running concurrently and do i need to serialise them to prevent it happening again?
Thanks for any help you can offer.