We had a situation after migrating to Elasticsearch 7.3. We run 3 dedicated masters and 4 data nodes. Each data node has 24 vCPUs and 30GB of RAM, used primarily for ingesting logs. One of our data nodes was constantly being dropped from the cluster. It only happened to this single node. We are running in a Google Cloud Platform. We initially suspected some load issue that prevented the data nodes from acknowledging the state of the master or some networking issue, but we couldn't find any evidence of the latter and increasing the CPUs for the data nodes didn't help.
We think that we found the issue (or a sympthom), but its a bit puzzling and I would like to get some feedback/opinions if possible. Let me provide a bit of context:
After the migration to v7.3 one of our templates was not set up properly and an index (a big one) was created with a single shard (>200GB). This shard was allocated to the node that was constantly disconnecting from the cluster. After checking the logs we noticed the following:
[2019-08-07T02:13:24,055][WARN ][o.e.i.e.Engine ] [elastic-data4-prod-us-central1-55q5] [hsc-searchresultlogs-prod_2019.08.06][0] failed engine [exception during primary-replica resync]
org.elasticsearch.transport.RemoteTransportException: [elastic-data4-prod-us-central1-55q5][10.248.24.143:9300][internal:index/seq_no/resync[p]]
Caused by: org.elasticsearch.action.UnavailableShardsException: [hsc-searchresultlogs-prod_2019.08.06][0] Not enough active copies to meet shard count of [DEFAULT] (have 0, needed DEFAULT). Timeout: [1m], request: [TransportResyncReplicationAction.Request{shardId=[hsc-searchresultlogs-prod_2019.08.06][0], timeout=1m, index='hsc-searchresultlogs-prod_2019.08.06', trimAboveSeqNo=22351547, maxSeenAutoIdTimestampOnPrimary=1565137505204, ops=0}]
at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:102) [elasticsearch-7.3.0.jar:7.3.0]
...
at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2580) [elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:864) [elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:312) [elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:275) [elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:703) [elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758) [elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.3.0.jar:7.3.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:835) [?:?]
[2019-08-07T02:13:24,088][WARN ][o.e.t.ThreadPool ] [elastic-data4-prod-us-central1-55q5] failed to run scheduled task [org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker@20dd1994] on thread pool [same]
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
...
at org.elasticsearch.indices.IndexingMemoryController.getShardWritingBytes(IndexingMemoryController.java:182) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker.runUnlocked(IndexingMemoryController.java:310) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:290) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:225) [elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.3.0.jar:7.3.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:835) [?:?]
[2019-08-07T02:13:29,090][WARN ][o.e.t.ThreadPool ] [elastic-data4-prod-us-central1-55q5] failed to run scheduled task [org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker@20dd1994] on thread pool [same]
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:681) ~[lucene-core-
Seems like the translog of that shard was corrupted. After we deleted the index everything has been running stable.
Is this a plausible scenario? a single corrupted shard's translog causing the entire data node to get kicked from the cluster? We run normally with replication of 1 and 5 shards but still, it is a bit strange that the data node didn't OOM or anything it was just dropping out of the cluster .
Anyhow, I would appreciate any input or comments on this.