Elasticsearch Version: 7.3.0
Index: metricbeat-7.3.0 (mentioning to note that the data rates have been consistent for the past month)
It is noticed that two out of three Elasticsearch nodes (m4.4xlarge) crashed. When looking at the logs, this was the last message after a bunch of garbage collection info logs:
[2019-09-28T16:47:12,831][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ops-elk-1] fatal error in thread [elasticsearch[ops-elk-1][write][T#2]], exiting
java.lang.AssertionError: noop operation should never fail at document level
at org.elasticsearch.index.engine.InternalEngine.innerNoOp(InternalEngine.java:1519) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:918) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:792) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:764) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:721) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:256) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:159) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:191) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:116) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:77) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:923) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:108) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:398) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:316) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$15(IndexShard.java:2606) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:112) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:269) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:236) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2580) ~[elasticsearch-7.3.0.jar:7.3.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:864) ~[elasticsearch-7.3.0.jar:7.3.0]
at
...
Caused by: java.lang.IllegalArgumentException: number of documents in the index cannot exceed 2147483519
at org.apache.lucene.index.DocumentsWriterPerThread.reserveOneDoc(DocumentsWriterPerThread.java:225) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1213) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
at org.elasticsearch.index.engine.InternalEngine.innerNoOp(InternalEngine.java:1516) ~[elasticsearch-7.3.0.jar:7.3.0]
... 31 more
We have a ILM policy set for the metricbeat alias like this:
{
"metricbeat-7.3.0" : {
"version" : 1333,
"modified_date" : "2019-09-17T03:20:42.608Z",
"policy" : {
"phases" : {
"warm" : {
"min_age" : "10d",
"actions" : {
"forcemerge" : {
"max_num_segments" : 1
}
}
},
"cold" : {
"min_age" : "30d",
"actions" : {
"freeze" : { }
}
},
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_size" : "50gb",
"max_age" : "10d",
"max_docs" : 1000000000
}
}
},
"delete" : {
"min_age" : "60d",
"actions" : {
"delete" : { }
}
}
}
}
}
}
Based on the policy, we should've never reached that shard limit.
The indexing rate is less than 4000/s. This has always been the case for last month and we have had no significant additions to the metricbeat catalog of monitoring.
The index only has one shard.
I tried looking for ILM logs but couldn't find any.
How do I find more information related to the failure scenario? We had to delete the reported index to get the nodes to at least start up and continue with rest of the indexing.
Plus, if one index is exceeding the number of documents, the entire node is not expected to crash fully, right?