We have an AKS cluster in the US running an Elasticsearch pod (version 7.16.0) tied to an Azure File Share storage account in Europe. We are getting continuous issues in the Elasticsearch logs similar to the below upon a clean startup, and the pod is basically unresponsive (e.g. create index also times out):
{"type": "server", "timestamp": "2022-04-29T15:50:22,390Z", "level": "WARN", "component": "o.e.g.PersistedClusterStateService", "cluster.name": "search-123", "node.name": "search-123-es-master-0", "message": "writing cluster state took [15422ms] which is above the warn threshold of [10s]; wrote full state with [1] indices" }
{"type": "server", "timestamp": "2022-04-29T15:51:00,933Z", "level": "ERROR", "component": "o.e.x.m.e.l.LocalExporter", "cluster.name": "search-123", "node.name": "search-123-es-master-0", "message": "failed to set monitoring template [.monitoring-alerts-7]", "cluster.uuid": "lIkSuRkeRoucLcK7W3tvBw", "node.id": "vBeM0YOUSLOQR4wi7eyRIg" ,
"stacktrace": ["**org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (create-index-template [.monitoring-alerts-7], cause [api]) within 30s**",
"at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:158) [elasticsearch-7.16.0.jar:7.16.0]",
"at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]",
"at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:157) [elasticsearch-7.16.0.jar:7.16.0]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:712) [elasticsearch-7.16.0.jar:7.16.0]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
"at java.lang.Thread.run(Thread.java:833) [?:?]"] }
Is this due to the fact we're using remote storage for Elasticsearch? Is this not valid?
Other notes are that we are using a single node cluster, we've tried the workaround for Azure File Share (SMB workaround), and we've tried upgrading to Azure's premium storage for better latency.