Node crashes due to max number of docs exceeded, but ILM should've not let it happen

Elasticsearch Version: 7.3.0
Index: metricbeat-7.3.0 (mentioning to note that the data rates have been consistent for the past month)

It is noticed that two out of three Elasticsearch nodes (m4.4xlarge) crashed. When looking at the logs, this was the last message after a bunch of garbage collection info logs:

[2019-09-28T16:47:12,831][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ops-elk-1] fatal error in thread [elasticsearch[ops-elk-1][write][T#2]], exiting
java.lang.AssertionError: noop operation should never fail at document level
	at org.elasticsearch.index.engine.InternalEngine.innerNoOp(InternalEngine.java:1519) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:918) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:792) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:764) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:721) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:256) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:159) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:191) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:116) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:77) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:923) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:108) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:398) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:316) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$15(IndexShard.java:2606) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:112) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:269) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:236) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2580) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:864) ~[elasticsearch-7.3.0.jar:7.3.0]
	at 
...
Caused by: java.lang.IllegalArgumentException: number of documents in the index cannot exceed 2147483519
	at org.apache.lucene.index.DocumentsWriterPerThread.reserveOneDoc(DocumentsWriterPerThread.java:225) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1213) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
	at org.elasticsearch.index.engine.InternalEngine.innerNoOp(InternalEngine.java:1516) ~[elasticsearch-7.3.0.jar:7.3.0]
	... 31 more

We have a ILM policy set for the metricbeat alias like this:

{
  "metricbeat-7.3.0" : {
    "version" : 1333,
    "modified_date" : "2019-09-17T03:20:42.608Z",
    "policy" : {
      "phases" : {
        "warm" : {
          "min_age" : "10d",
          "actions" : {
            "forcemerge" : {
              "max_num_segments" : 1
            }
          }
        },
        "cold" : {
          "min_age" : "30d",
          "actions" : {
            "freeze" : { }
          }
        },
        "hot" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_size" : "50gb",
              "max_age" : "10d",
              "max_docs" : 1000000000
            }
          }
        },
        "delete" : {
          "min_age" : "60d",
          "actions" : {
            "delete" : { }
          }
        }
      }
    }
  }
}

Based on the policy, we should've never reached that shard limit.
The indexing rate is less than 4000/s. This has always been the case for last month and we have had no significant additions to the metricbeat catalog of monitoring.
The index only has one shard.
I tried looking for ILM logs but couldn't find any.
How do I find more information related to the failure scenario? We had to delete the reported index to get the nodes to at least start up and continue with rest of the indexing.
Plus, if one index is exceeding the number of documents, the entire node is not expected to crash fully, right?

That's correct, there was a bug in 7.3.0 fixed in 7.3.2 via #46083 that would explain this.

I don't have a good answer for why this index grew so large. Do you have other indices that have not rolled over when you expected? Does the ILM explain API or a dry-run rollover help work out why?

We have been running 7.3 for a month now, and looking at ILM explain results, it is seen that the indices were getting rolled over at 50GB everytime. Even the last index, metricbeat-7.3.0-0000050 rolled over correctly from 49. The 51st rollover also worked as expected.
I can't do a dry run on that index now as I had to delete it to get the nodes started. But given that previous rollovers were successful, it might not be a problem with rollover itself?
After deleting the old 51 Index, I created a new 51 index to give a write index to the metricbeat-7.3.0 alias and continued indexing data. Now, I do see that 52nd rollover was successfully done today morning.
I did see the same problem with an APM (apm-7.3.0-span-0000029) index previously, and hence edited the default lifecycle policy packaged with beats to do a rollover on number of docs as well (default policy rolls over on 50gb) , because I believed that would've been the problem at that time.
However, I'll get the ES cluster upgraded to the 7.3.2 patch so that nodes don't crash again.

Thanks. After you upgrade to 7.3.2 you might see this happen again; the PR I linked above doesn't change anything in that regard, it just makes it fatal to the shard rather than to the node. If it does recur, it should be easier to investigate.

Hi
Sorry, to bump this older topic but I observed yet another case where ILM did not work.
This time it was the apm-transaction index.

Lifecycle Policy:

{
  "apm-7.3.0-transaction": {
    "version": 848,
    "modified_date": "2019-10-17T07:51:44.658Z",
    "policy": {
      "phases": {
        "warm": {
          "min_age": "10d",
          "actions": {
            "forcemerge": {
              "max_num_segments": 1
            }
          }
        },
        "cold": {
          "min_age": "30d",
          "actions": {
            "freeze": {}
          }
        },
        "hot": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_size": "50gb",
              "max_age": "10d",
              "max_docs": 1000000000
            }
          }
        },
        "delete": {
          "min_age": "60d",
          "actions": {
            "delete": {}
          }
        }
      }
    }
  }
}

However, after doing a _cat/shards, I find that one rolled over index is 270gigs in size:

apm-7.3.0-transaction-000004            0 p STARTED      769005887  297.7gb 10.0.196.65  elk-1

Yes, the lifecycle policy seems to be attached to this index using the default template

"settings": {
      "index": {
        "mapping": {
          "total_fields": {
            "limit": "2000"
          }
        },
        "refresh_interval": "5s",
        "blocks": {
          "read_only_allow_delete": "false"
        },
        "provided_name": "<apm-7.3.0-transaction-000004>",
        "query": {},
        "creation_date": "1567073623157",
        "priority": "100",
        "number_of_replicas": "1",
        "uuid": "54DSnyYSS2yl6v78u9QzYQ",
        "version": {
          "created": "7030099"
        },
        "lifecycle": {
          "name": "apm-7.3.0-transaction",
          "rollover_alias": "apm-7.3.0-transaction"
        },
        "codec": "best_compression",
        "number_of_shards": "1"
      }
    }
  }
}

As seen, the index was also created 2 months ago, meeting not only the size but also the date at which it should've rolled over.

What does ILM explain say about this index?

I can show a different configuration where a similar behaviour is observed.
In this case. There is a simple policy that asks indices to be deleleted after 1 day, and rollover on 10 gigs.

{
  "dataitems" : {
    "version" : 2,
    "modified_date" : "2019-11-01T12:35:24.764Z",
    "policy" : {
      "phases" : {
        "hot" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_size" : "30gb",
              "max_age" : "1d"
            }
          }
        },
        "delete" : {
          "min_age" : "1d",
          "actions" : {
            "delete" : { }
          }
        }
      }
    }
  }
}

Note: Earlier version of this policy did not have a delete phase.
Here's an output of dataitems-000004 ILM explain.
The creation date of this index is newer than the latest policy changes.

{
  "indices" : {
    "dataitems-2019.10.31-000004" : {
      "index" : "dataitems-2019.10.31-000004",
      "managed" : true,
      "policy" : "dataitems",
      "lifecycle_date_millis" : 1572584614067,
      "phase" : "completed",
      "phase_time_millis" : 1572584614513,
      "action" : "completed",
      "action_time_millis" : 1572584614513,
      "step" : "completed",
      "step_time_millis" : 1572584614513,
      "phase_execution" : {
        "policy" : "dataitems",
        "version" : 1,
        "modified_date_in_millis" : 1572506204442
      }
    }
  }
}

I see that the new policy version is not applied to dataitems-00004 index.
Does this mean that existing indices do not get newer versions of policies auto applied?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.