Добрый день!
Сегодня обнаружил кластер в состоянии "yellow", один шард у системного .kibana_task_manager_1
индекса был Unassigned.
посмотрев через
GET /_cluster/allocation/explain
обнаружил ошибку:
obtaining shard lock timed out after 5000ms
(она касалась только одного индекса .kibana_task_manager_1
)
тут же запустил
POST /_cluster/reroute?retry_failed
кластер позеленел.
В логах избранного мастера обнаружил по множеству индексов:
[2020-10-15T06:14:57,913][ERROR][o.e.c.a.s.ShardStateAction] [es-tracker2b.node.ru] [event_log_tracker_2020_10-000001][3] unexpected failure while failing shard [shard id [[event_log_tracker_2020_10-000001][3]], allocation id [fYW_RhIGQtmaBLs_bmyxcQ], primary term [3], message [failed to perform indices:data/write/bulk[s] on replica [event_log_tracker_2020_10-000001][3], node[TBCq9WTdTQaAYTkufG9s2Q], [R], s[STARTED], a[id=fYW_RhIGQtmaBLs_bmyxcQ]], failure [RemoteTransportException[[es-tracker5b.node.ru][192.168.60.71:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[[event_log_tracker_2020_10-000001][3] operation primary term [3] is too old (current [4])]; ], markAsStale [true]] org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [3] did not match current primary term [4] at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:365) ~[elasticsearch-7.8.1.jar:7.8.1] at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[elasticsearch-7.8.1.jar:7.8.1] at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[elasticsearch-7.8.1.jar:7.8.1] at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) [elasticsearch-7.8.1.jar:7.8.1] at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.8.1.jar:7.8.1] at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.8.1.jar:7.8.1] at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.8.1.jar:7.8.1] at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.8.1.jar:7.8.1] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:636) [elasticsearch-7.8.1.jar:7.8.1] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.8.1.jar:7.8.1] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.8.1.jar:7.8.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?] at java.lang.Thread.run(Thread.java:832) [?:?]
там же, на мастере именно по индексу .kibana_task_manager_1
:
[2020-10-15T06:17:14,771][WARN ][o.e.c.r.a.AllocationService] [es-tracker2b.node.ru] failing shard [failed shard, shard [.kibana_task_manager_1][0], node[E6wviW6_RweNcQZMQeYMhw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=zEqj95CaR3qmGq0xht9jBw], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-10-15T03:17:04.328Z], failed_attempts[4], failed_nodes[[E6wviW6_RweNcQZMQeYMhw]], delayed=false, details[failed shard on node [E6wviW6_RweNcQZMQeYMhw]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[.kibana_task_manager_1][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_attempt]], expected_shard_size[31336], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[.kibana_task_manager_1][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], markAsStale [true]] java.io.IOException: failed to obtain in-memory shard lock
Версия 7.8.1, кластер - 6 нод, на каждой роли и master, и data
чтобы нода выпадала, такого не видел по графикам. Скажите, насколько проблема серьезна?
Есть ощущение что что-то глубже есть более серьезное и оно полыхнет сильнее, чем один неназначенный шард.