Elastic docker restart

Hi everyone,

We have an elastic instance that we feed data from CI jobs.
It is single node instance.
Maybe some data from the cluster:

  "cluster_name" : "docker-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 998,
  "active_shards" : 998,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 931,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 51.73665111456713

Sometimes we see elastic docker restart.
Today I managed to find in the logs the possible culprit, but I can't seem to understand exactly what went wrong.
Here is the stack trace:

2022-09-16T13:51:41.553888557Z {"type": "server", "timestamp": "2022-09-16T13:51:41,553Z", "level": "INFO", "component": "o.e.x.i.IndexLifecycleTransition", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "moving index [test-report-index-000047] from [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"update-rollover-lifecycle-date\"}] to [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"set-indexing-complete\"}] in policy [lifetime]", "cluster.uuid": "go2F4QctRnuC0TnMLmhfSg", "node.id": "gI3QYOf5S4ma7Cz75F79rA"  }
2022-09-16T13:51:41.916359627Z {"type": "server", "timestamp": "2022-09-16T13:51:41,915Z", "level": "INFO", "component": "o.e.x.i.IndexLifecycleTransition", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "moving index [test-report-index-000048] from [{\"phase\":\"hot\",\"action\":\"set_priority\",\"name\":\"set_priority\"}] to [{\"phase\":\"hot\",\"action\":\"unfollow\",\"name\":\"wait-for-indexing-complete\"}] in policy [lifetime]", "cluster.uuid": "go2F4QctRnuC0TnMLmhfSg", "node.id": "gI3QYOf5S4ma7Cz75F79rA"  }
2022-09-16T13:51:42.014259216Z {"type": "server", "timestamp": "2022-09-16T13:51:42,013Z", "level": "INFO", "component": "o.e.x.i.IndexLifecycleTransition", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "moving index [test-report-index-000047] from [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"set-indexing-complete\"}] to [{\"phase\":\"hot\",\"action\":\"complete\",\"name\":\"complete\"}] in policy [lifetime]", "cluster.uuid": "go2F4QctRnuC0TnMLmhfSg", "node.id": "gI3QYOf5S4ma7Cz75F79rA"  }
2022-09-16T13:51:42.360122383Z {"type": "server", "timestamp": "2022-09-16T13:51:42,359Z", "level": "INFO", "component": "o.e.x.i.IndexLifecycleTransition", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "moving index [test-report-index-000048] from [{\"phase\":\"hot\",\"action\":\"unfollow\",\"name\":\"wait-for-indexing-complete\"}] to [{\"phase\":\"hot\",\"action\":\"unfollow\",\"name\":\"wait-for-follow-shard-tasks\"}] in policy [lifetime]", "cluster.uuid": "go2F4QctRnuC0TnMLmhfSg", "node.id": "gI3QYOf5S4ma7Cz75F79rA"  }
2022-09-16T13:58:23.057165197Z {"type": "server", "timestamp": "2022-09-16T13:58:23,054Z", "level": "ERROR", "component": "o.e.b.ElasticsearchUncaughtExceptionHandler", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "fatal error in thread [elasticsearch[elasticsearch][search][T#24]], exiting", "cluster.uuid": "go2F4QctRnuC0TnMLmhfSg", "node.id": "gI3QYOf5S4ma7Cz75F79rA" , 
2022-09-16T13:58:23.057217798Z "stacktrace": ["java.lang.AssertionError: unexpected higher total ops [41] compared to expected [40]",
2022-09-16T13:58:23.057221480Z "at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:403) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057224203Z "at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:70) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057226443Z "at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:258) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057228695Z "at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057231077Z "at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057233348Z "at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:408) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057235517Z "at org.elasticsearch.transport.TransportService$6.handleException(TransportService.java:640) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057237557Z "at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1181) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057239739Z "at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1290) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057241780Z "at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1251) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057244677Z "at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1229) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057260466Z "at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:52) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057262610Z "at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:43) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057264825Z "at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:27) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057267483Z "at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057269903Z "at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057271869Z "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057274418Z "at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057277553Z "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057279401Z "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057282028Z "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]",
2022-09-16T13:58:23.057283931Z "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]",
2022-09-16T13:58:23.057285936Z "at java.lang.Thread.run(Thread.java:832) [?:?]",
2022-09-16T13:58:23.057287760Z "Caused by: org.elasticsearch.action.search.SearchPhaseExecutionException: Shard failures",

What jumps to me is the following:
Java.lang.AssertionError: unexpected higher total ops [41] compared to expected [40]

Is it because Elasticsearch was overwhelmed with operations?
It seems it was doing some house cleaning with index life cycle management.

Thanks

You probably need to look at reducing the shard count pretty heavily. Ideally we would recommend less than 700 shards on a single node, and your level is likely to be putting the node under pressure.