Elastic docker restart

Hi everyone,

We have an elastic instance that we feed data from CI jobs.
It is single node instance.
Maybe some data from the cluster:

  "cluster_name" : "docker-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 998,
  "active_shards" : 998,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 931,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 51.73665111456713

Sometimes we see elastic docker restart.
Today I managed to find in the logs the possible culprit, but I can't seem to understand exactly what went wrong.
Here is the stack trace:

2022-09-16T13:51:41.553888557Z {"type": "server", "timestamp": "2022-09-16T13:51:41,553Z", "level": "INFO", "component": "o.e.x.i.IndexLifecycleTransition", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "moving index [test-report-index-000047] from [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"update-rollover-lifecycle-date\"}] to [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"set-indexing-complete\"}] in policy [lifetime]", "cluster.uuid": "go2F4QctRnuC0TnMLmhfSg", "node.id": "gI3QYOf5S4ma7Cz75F79rA"  }
2022-09-16T13:51:41.916359627Z {"type": "server", "timestamp": "2022-09-16T13:51:41,915Z", "level": "INFO", "component": "o.e.x.i.IndexLifecycleTransition", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "moving index [test-report-index-000048] from [{\"phase\":\"hot\",\"action\":\"set_priority\",\"name\":\"set_priority\"}] to [{\"phase\":\"hot\",\"action\":\"unfollow\",\"name\":\"wait-for-indexing-complete\"}] in policy [lifetime]", "cluster.uuid": "go2F4QctRnuC0TnMLmhfSg", "node.id": "gI3QYOf5S4ma7Cz75F79rA"  }
2022-09-16T13:51:42.014259216Z {"type": "server", "timestamp": "2022-09-16T13:51:42,013Z", "level": "INFO", "component": "o.e.x.i.IndexLifecycleTransition", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "moving index [test-report-index-000047] from [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"set-indexing-complete\"}] to [{\"phase\":\"hot\",\"action\":\"complete\",\"name\":\"complete\"}] in policy [lifetime]", "cluster.uuid": "go2F4QctRnuC0TnMLmhfSg", "node.id": "gI3QYOf5S4ma7Cz75F79rA"  }
2022-09-16T13:51:42.360122383Z {"type": "server", "timestamp": "2022-09-16T13:51:42,359Z", "level": "INFO", "component": "o.e.x.i.IndexLifecycleTransition", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "moving index [test-report-index-000048] from [{\"phase\":\"hot\",\"action\":\"unfollow\",\"name\":\"wait-for-indexing-complete\"}] to [{\"phase\":\"hot\",\"action\":\"unfollow\",\"name\":\"wait-for-follow-shard-tasks\"}] in policy [lifetime]", "cluster.uuid": "go2F4QctRnuC0TnMLmhfSg", "node.id": "gI3QYOf5S4ma7Cz75F79rA"  }
2022-09-16T13:58:23.057165197Z {"type": "server", "timestamp": "2022-09-16T13:58:23,054Z", "level": "ERROR", "component": "o.e.b.ElasticsearchUncaughtExceptionHandler", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "fatal error in thread [elasticsearch[elasticsearch][search][T#24]], exiting", "cluster.uuid": "go2F4QctRnuC0TnMLmhfSg", "node.id": "gI3QYOf5S4ma7Cz75F79rA" , 
2022-09-16T13:58:23.057217798Z "stacktrace": ["java.lang.AssertionError: unexpected higher total ops [41] compared to expected [40]",
2022-09-16T13:58:23.057221480Z "at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:403) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057224203Z "at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:70) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057226443Z "at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:258) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057228695Z "at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057231077Z "at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057233348Z "at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:408) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057235517Z "at org.elasticsearch.transport.TransportService$6.handleException(TransportService.java:640) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057237557Z "at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1181) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057239739Z "at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1290) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057241780Z "at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1251) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057244677Z "at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1229) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057260466Z "at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:52) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057262610Z "at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:43) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057264825Z "at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:27) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057267483Z "at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057269903Z "at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057271869Z "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057274418Z "at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057277553Z "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057279401Z "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057282028Z "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]",
2022-09-16T13:58:23.057283931Z "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]",
2022-09-16T13:58:23.057285936Z "at java.lang.Thread.run(Thread.java:832) [?:?]",
2022-09-16T13:58:23.057287760Z "Caused by: org.elasticsearch.action.search.SearchPhaseExecutionException: Shard failures",

What jumps to me is the following:
Java.lang.AssertionError: unexpected higher total ops [41] compared to expected [40]

Is it because Elasticsearch was overwhelmed with operations?
It seems it was doing some house cleaning with index life cycle management.

Thanks

You probably need to look at reducing the shard count pretty heavily. Ideally we would recommend less than 700 shards on a single node, and your level is likely to be putting the node under pressure.

Ok I see, thanks for the input!
Since this is single node only, are the replica shards incurring in the shard count overhead?
I'm taking actions in to reducing shard count as some index policy were too loose and would create new index with just a few Mb.
Than I read this post from elastic recommending a shard to have around 20/50Gb:

Given it's a single node the replicas are not allocated so will not count here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.