Hello,
I'm using Elasticsearch 5.6.2 which has 16 nodes and more than 800 indices and 20,000 shards.
If I change configuration and restart elasticsearch, the following log continues for a long time(more than 2 hours)
And I can't see kibana due to timeout.
master node log
[2017-10-05T13:36:22,710][WARN ][o.e.c.a.s.ShardStateAction] [somehost] [somelog-2017.10.05][11] received shard failed for shard id [[somelog-2017.10.05][11]], allocation id [r8Z9jnqlTf6PfLzOLEleYQ], primary term [4], message [mark copy as stale] [2017-10-05T13:36:22,711][WARN ][o.e.c.a.s.ShardStateAction] [somehost] [somelog-2017.10.05][5] received shard failed for shard id [[somelog-2017.10.05][5]], allocation id [z3-KDmd-SMytd547Gj7P_Q], primary term [2], message [mark copy as stale] [2017-10-05T13:36:22,710][WARN ][o.e.c.a.s.ShardStateAction] [somehost] [somelog-2017.10.05][8] received shard failed for shard id [[somelog-2017.10.05][8]], allocation id [3qeFDJAZTyy0ActIVAKKFA], primary term [2], message [mark copy as stale] [2017-10-05T13:36:22,841][DEBUG][o.e.a.a.i.m.p.TransportPutMappingAction] [somehost] failed to put mappings on indices [[[somelog4-2017.10.05/QgSiWzbLTxWUiNJTyLjGpw]]], type [fluentd] org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$null$0(ClusterService.java:255) ~[elasticsearch-5.6.2.jar:5.6.2] at java.util.ArrayList.forEach(ArrayList.java:1249) ~[?:1.8.0_131] at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$onTimeout$1(ClusterService.java:254) ~[elasticsearch-5.6.2.jar:5.6.2] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.6.2.jar:5.6.2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131] [2017-10-05T13:36:23,390][WARN ][o.e.c.a.s.ShardStateAction] [somehost] [somelog2-2017.10.05][14] received shard failed for shard id [[somelog2-2017.10.05][14]], allocation id [TiSxoFI_Q56FcbXRtcoOUw], primary term [3], message [mark copy as stale] [2017-10-05T13:36:23,397][WARN ][o.e.c.a.s.ShardStateAction] [somehost] [somelog2-2017.10.05][2] received shard failed for shard id [[somelog2-2017.10.05][2]], allocation id [z0e9WxrnSiqRP0ccuFW12g], primary term [5], message [mark copy as stale] [2017-10-05T13:36:23,421][WARN ][o.e.c.a.s.ShardStateAction] [somehost] [somelog3-2017.10.05][13] received shard failed for shard id [[somelog3-2017.10.05][13]], allocation id [NIV27Oe2RpOZSTAA82WdQA], primary term [5], message [mark copy as stale] [2017-10-05T13:36:23,422][WARN ][o.e.c.a.s.ShardStateAction] [somehost] [somelog3-2017.10.05][9] received shard failed for shard id [[somelog3-2017.10.05][9]], allocation id [ADnldsuiQ6O29CHpGDTI5Q], primary term [2], message [mark copy as stale] [2017-10-05T13:36:23,424][WARN ][o.e.c.a.s.ShardStateAction] [somehost] [somelog3-2017.10.05][1] received shard failed for shard id [[somelog3-2017.10.05][1]], allocation id [CvUkq-VyR9eOcXreVFynOQ], primary term [3], message [mark copy as stale] [2017-10-05T13:36:23,424][WARN ][o.e.c.a.s.ShardStateAction] [somehost] [somelog3-2017.10.05][0] received shard failed for shard id [[somelog3-2017.10.05][0]], allocation id [dh5xKeCxToieP2UCsZ7hyw], primary term [4], message [mark copy as stale]
I wait more than 2 hours. cluster state become green and I can see kibana.
But, I have struggled to maintanance elasticsearch due to this problem.
I know elasticsearch's cluster state management is single threaded for simplicity.
But, are there any ideas to reduce maintenance time?
For example, increase cluster.routing.allocation.node_concurrent_recoveries, or so.
I tried to execute put request to increase, but timeout error occurred. so I couldn't try it.
cluster allocation explain api result is as follows
#curl -s -XGET 'localhost:9200/_cluster/allocation/explain' | python -m json.tool
{
"allocate_explanation": "allocation temporarily throttled",
"can_allocate": "throttled",
"current_state": "unassigned",
"index": "...-2017.10.05",
"node_allocation_decisions": [
{
"deciders": [
{
"decider": "throttling",
"decision": "THROTTLE",
"explanation": "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}
],
"node_attributes": {
"ml.enabled": "true",
"ml.max_open_jobs": "10"
},
"node_decision": "throttled",
"node_id": "...",
"node_name": "...",
"transport_address": "...:9300"
},
{
"deciders": [
{
"decider": "throttling",
"decision": "THROTTLE",
"explanation": "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}
],
"node_attributes": {
"ml.enabled": "true",
"ml.max_open_jobs": "10"
},
"node_decision": "throttled",
"node_id": "...",
"node_name": "...",
"transport_address": "...:9300"
},
...