Elastic docker restart

Isekai · September 16, 2022, 2:40pm

Hi everyone,

We have an elastic instance that we feed data from CI jobs.
It is single node instance.
Maybe some data from the cluster:

  "cluster_name" : "docker-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 998,
  "active_shards" : 998,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 931,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 51.73665111456713

Sometimes we see elastic docker restart.
Today I managed to find in the logs the possible culprit, but I can't seem to understand exactly what went wrong.
Here is the stack trace:

2022-09-16T13:51:41.553888557Z {"type": "server", "timestamp": "2022-09-16T13:51:41,553Z", "level": "INFO", "component": "o.e.x.i.IndexLifecycleTransition", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "moving index [test-report-index-000047] from [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"update-rollover-lifecycle-date\"}] to [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"set-indexing-complete\"}] in policy [lifetime]", "cluster.uuid": "go2F4QctRnuC0TnMLmhfSg", "node.id": "gI3QYOf5S4ma7Cz75F79rA"  }
2022-09-16T13:51:41.916359627Z {"type": "server", "timestamp": "2022-09-16T13:51:41,915Z", "level": "INFO", "component": "o.e.x.i.IndexLifecycleTransition", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "moving index [test-report-index-000048] from [{\"phase\":\"hot\",\"action\":\"set_priority\",\"name\":\"set_priority\"}] to [{\"phase\":\"hot\",\"action\":\"unfollow\",\"name\":\"wait-for-indexing-complete\"}] in policy [lifetime]", "cluster.uuid": "go2F4QctRnuC0TnMLmhfSg", "node.id": "gI3QYOf5S4ma7Cz75F79rA"  }
2022-09-16T13:51:42.014259216Z {"type": "server", "timestamp": "2022-09-16T13:51:42,013Z", "level": "INFO", "component": "o.e.x.i.IndexLifecycleTransition", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "moving index [test-report-index-000047] from [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"set-indexing-complete\"}] to [{\"phase\":\"hot\",\"action\":\"complete\",\"name\":\"complete\"}] in policy [lifetime]", "cluster.uuid": "go2F4QctRnuC0TnMLmhfSg", "node.id": "gI3QYOf5S4ma7Cz75F79rA"  }
2022-09-16T13:51:42.360122383Z {"type": "server", "timestamp": "2022-09-16T13:51:42,359Z", "level": "INFO", "component": "o.e.x.i.IndexLifecycleTransition", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "moving index [test-report-index-000048] from [{\"phase\":\"hot\",\"action\":\"unfollow\",\"name\":\"wait-for-indexing-complete\"}] to [{\"phase\":\"hot\",\"action\":\"unfollow\",\"name\":\"wait-for-follow-shard-tasks\"}] in policy [lifetime]", "cluster.uuid": "go2F4QctRnuC0TnMLmhfSg", "node.id": "gI3QYOf5S4ma7Cz75F79rA"  }
2022-09-16T13:58:23.057165197Z {"type": "server", "timestamp": "2022-09-16T13:58:23,054Z", "level": "ERROR", "component": "o.e.b.ElasticsearchUncaughtExceptionHandler", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "fatal error in thread [elasticsearch[elasticsearch][search][T#24]], exiting", "cluster.uuid": "go2F4QctRnuC0TnMLmhfSg", "node.id": "gI3QYOf5S4ma7Cz75F79rA" , 
2022-09-16T13:58:23.057217798Z "stacktrace": ["java.lang.AssertionError: unexpected higher total ops [41] compared to expected [40]",
2022-09-16T13:58:23.057221480Z "at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:403) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057224203Z "at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:70) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057226443Z "at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:258) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057228695Z "at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057231077Z "at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057233348Z "at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:408) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057235517Z "at org.elasticsearch.transport.TransportService$6.handleException(TransportService.java:640) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057237557Z "at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1181) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057239739Z "at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1290) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057241780Z "at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1251) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057244677Z "at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1229) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057260466Z "at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:52) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057262610Z "at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:43) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057264825Z "at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:27) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057267483Z "at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057269903Z "at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057271869Z "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057274418Z "at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057277553Z "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057279401Z "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.10.1.jar:7.10.1]",
2022-09-16T13:58:23.057282028Z "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]",
2022-09-16T13:58:23.057283931Z "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]",
2022-09-16T13:58:23.057285936Z "at java.lang.Thread.run(Thread.java:832) [?:?]",
2022-09-16T13:58:23.057287760Z "Caused by: org.elasticsearch.action.search.SearchPhaseExecutionException: Shard failures",

What jumps to me is the following:
Java.lang.AssertionError: unexpected higher total ops [41] compared to expected [40]

Is it because Elasticsearch was overwhelmed with operations?
It seems it was doing some house cleaning with index life cycle management.

Thanks

warkolm · September 20, 2022, 2:15am

You probably need to look at reducing the shard count pretty heavily. Ideally we would recommend less than 700 shards on a single node, and your level is likely to be putting the node under pressure.

Isekai · October 7, 2022, 9:30am

Ok I see, thanks for the input!
Since this is single node only, are the replica shards incurring in the shard count overhead?
I'm taking actions in to reducing shard count as some index policy were too loose and would create new index with just a few Mb.
Than I read this post from elastic recommending a shard to have around 20/50Gb:

warkolm · October 10, 2022, 10:29pm

Given it's a single node the replicas are not allocated so will not count here.

system · November 7, 2022, 10:29pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Stop elasticsearch cluster safely Elasticsearch docker	7	869	March 16, 2022
Elasticsearch Cluster Down automatically Elasticsearch elastic-stack-monitoring , docker	1	257	December 21, 2023
Random exits every few weeks - Native controller process has stopped - no new native processes can be started Elasticsearch docker	1	882	August 29, 2022
Elasticsearch v5.0.2 (master eligible) nodes as Docker container keeps on restarting Elasticsearch	5	738	April 27, 2018
Restart Elasticsearch cluster Elasticsearch	9	411	December 7, 2020

Elastic docker restart

Related topics