Waiting for nodes to stop - timeout

iremmats · May 1, 2017, 7:25am

While resizing a three node cluster it doesn't show as completed action due to timeout when stopping nodes.

Change memory per node from 4 GB to 8 GB
Timeout for configuration change: 5h 33m
Unexpected error during step: [wait-until-stopped]: [no.found.constructor.models.TimeoutException: Timeout]

Starting step: [wait-until-stopped]:
instances=List(ElasticsearchInstance(ElasticsearchCluster(d2d1c12716a04716baadc3e6795d2ecc),instance-0000000011), ElasticsearchInstance(ElasticsearchCluster(d2d1c12716a04716baadc3e6795d2ecc),instance-0000000010), ElasticsearchInstance(ElasticsearchCluster(d2d1c12716a04716baadc3e6795d2ecc),instance-0000000009))

The new nodes show up perfectly fine and it works to index data and search.

What should I do here? Kill the docker containers manually?

uricohen · May 1, 2017, 9:10am

Hi @iremmats

This looks like a Docker related issue. Which version of ECE are you running?
In addition, can you please do the following:

Fetch all the logs from the relevant allocator hosts and post them here. Use this command to zip them on each host:
docker ps -a > /mnt/data/elastic/docker.out && tar czvf ece-logs.tgz $(find /mnt/data/elastic -name "*.log" -o -name "*.out")
Try to resume the stopped instances, and then stop them manually.
If that doesn't help, kill the old 4GB containers in each host (the relevant containers name are of the form fac-{cluster-id}-{instance-id}.
If the plan is not running (failed because of a timeout), try to resubmit it so it cleans up the old instances.

In any case, would be good to see those logs so we can further investigate.

HTH,
Uri

iremmats · May 1, 2017, 5:38pm

We are running Beta2.

I got the logs. They are linked to my dropbox.
I resumed the instances in the admin console and then stopped them in the admin console. Seems one of the allocators wasn't able to stop the container.
Both stop and kill the container did not work. I found a few similar docker-related issues. https://github.com/moby/moby/issues/208

I also tried attaching to the container but nothing happens.

From our perspective as potential ECE customers the docker handling and docker itself is super important to work flawlessly.

iremmats · May 1, 2017, 5:44pm

2637b394b726 docker.elastic.co/cloud-enterprise/elasticsearch:5.3.0-1 "/sbin/entry-point" 4 days ago Restarting (1) 30 hours ago 0.0.0.0:18438->18438/tcp, 0.0.0.0:19296->19296/tcp fac-d2d1c12716a04716baadc3e6795d2ecc-instance-0000000010

It seems to be stuck in some restarting state.

iremmats · May 1, 2017, 7:45pm

A restart of the server solved the docker issue. Would be interesting to know what caused it though.

system · May 15, 2017, 7:58pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Security cluster cannot restart Elastic Cloud Enterprise (ECE)	4	601	June 6, 2019
Install of ECE 3.4.1 times out Elastic Cloud Enterprise (ECE) docker	4	631	December 6, 2022
Failed to detect running cluster - instance was not detected as running in time Elastic Cloud Enterprise (ECE)	1	935	November 18, 2020
'timed out waiting for all nodes to process published state' and cluster unavailability Elasticsearch	3	4527	August 7, 2018
Failed to reconfigure Kibana in ECE 1.1.3 Elastic Cloud Enterprise (ECE)	6	807	June 1, 2018

Waiting for nodes to stop - timeout

Related topics