Bootstrap monitoring process did not exit as expected

Hi there,

I try to install a "large ". I started a few month ago with a small Installation and now we need to expand.
For now I just try to install 2.4.2 on a new VM and it is not exiting without errors. But there are no errors. I used "debug" and evertything but there are no errors.

I get a few warnings:

-- Verifying Prerequisites --
Checking runner container does not exist... PASSED
Checking host storage root volume path is not root... PASSED
Checking host storage path is accessible... PASSED
Checking host storage path contents matches whitelist... PASSED
Checking Docker version... PASSED
Checking Docker file system... PASSED
Checking Docker storage driver... PASSED
 - The installation with overlay2 can proceed; however, we recommend using overlay
Checking whether 'setuser' works inside a Docker container... PASSED
Checking memory settings... PASSED
Checking runner ip connectivity... PASSED
Checking OS IPv4 IP forward setting... PASSED
Checking OS max map count setting... PASSED
Checking OS kernel version... PASSED
 - OS kernel version is 3.10.0-1062.4.1.el7.x86_64 but we recommend 4.4.
Checking minimum required memory... PASSED
Checking OS kernel cgroup.memory... PASSED
 - OS setting 'cgroup.memory' should be set to cgroup.memory=nokmem
Checking OS minimum ephemeral port... PASSED
Checking OS max open file descriptors per process... PASSED
Checking OS max open file descriptors system-wide... PASSED
Checking OS file system and Docker storage driver compatibility... PASSED
Checking OS file system storage driver permissions... PASSED
-- Completed Verifying Prerequisites --

And the exit notification is:

- Exiting bootstrapper {}
[2019-11-20 10:06:32,958][INFO ][no.found.util.LogApplicationExit$] Application is exiting {}


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Errors have caused Elastic Cloud Enterprise installation to fail
Bootstrap monitoring process did not exit as expected
Traceback (most recent call last):
  File "/elastic_cloud_apps/bootstrap-initiator/initiator.py", line 88, in <module>
    exitcode = monitor.logging_and_bootstrap_monitor(bootstrap_properties, bootstrap_container_id, enable_debug)
  File "/elastic_cloud_apps/bootstrap-initiator/bootstrap_initiator/monitor.py", line 24, in logging_and_bootstrap_monitor
    raise BootstrapMonitoringException('Bootstrap monitoring process did not exit as expected')
bootstrap_initiator.monitor.BootstrapMonitoringException: Bootstrap monitoring process did not exit as expected

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now I dont know what to do. I tried removing and reinstalling everything new. So docker, ece, all permissions, etc.

Are there any logs, eg in /mnt/data/elastic/logs or under /mnt/data/elastic/:runner-id/services/**

Yes but nothing with more information.

in /mnt/data/elastic/logs/bootstrap-logs/bootstrap.log:

019-11-20 14:51:53,289][INFO ][no.found.bootstrap.BootstrapInitial] Creating Admin Console Elasticsearch backend {}
[2019-11-20 14:51:53,629][INFO ][no.found.bootstrap.ServiceLayerBootstrap] Waiting for [ensuring-plan] to complete. Retrying every [1 second] (cause: [org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /clusters/9af21b816f2b4281a922e35fca73d7ca/plans/status]) {}
[2019-11-20 14:51:57,712][INFO ][no.found.bootstrap.ServiceLayerBootstrap] Waiting for [ensuring-plan] to complete. Retrying every [1 second] (cause: [java.lang.Exception: not yet started]) {}
[2019-11-20 14:52:30,373][INFO ][no.found.bootstrap.BootstrapInitial] Applying Admin Console Elasticsearch index templates {}
[2019-11-20 14:52:30,649][INFO ][no.found.bootstrap.BootstrapInitial] Updating dynamic Admin Console settings with Elasticsearch cluster information {}
[2019-11-20 14:52:30,653][INFO ][no.found.bootstrap.BootstrapInitial] Starting reindexing Admin Console data {}
[2019-11-20 14:52:31,170][INFO ][no.found.bootstrap.BootstrapInitial] Creating Logging and Metrics Cluster {}
[2019-11-20 14:52:31,445][INFO ][no.found.bootstrap.ServiceLayerBootstrap] Waiting for [ensuring-plan] to complete. Retrying every [1 second] (cause: [org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /clusters/b528fc722db44831bf856e5346cd89f8/plans/status]) {}
[2019-11-20 14:52:32,464][INFO ][no.found.bootstrap.ServiceLayerBootstrap] Waiting for [ensuring-plan] to complete. Retrying every [1 second] (cause: [java.lang.Exception: not yet started]) {}
[2019-11-20 14:53:04,203][INFO ][no.found.bootstrap.BootstrapInitial] Enabling Kibana for Logging and Metrics Cluster {}
[2019-11-20 14:53:04,621][INFO ][no.found.bootstrap.ServiceLayerBootstrap] waiting for cluster [KibanaCluster(1df65f827d3149cfb737114161517799)] {}
[2019-11-20 14:53:04,624][INFO ][no.found.bootstrap.ServiceLayerBootstrap] Waiting for [waiting-for-cluster] to complete. Retrying every [1 second] (cause: [org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /kibanas/1df65f827d3149cfb737114161517799/plans/status]) {}
[2019-11-20 14:53:05,644][INFO ][no.found.bootstrap.ServiceLayerBootstrap] Waiting for [waiting-for-cluster] to complete. Retrying every [1 second] (cause: [java.lang.Exception: not yet started]) {}
[2019-11-20 14:53:31,155][INFO ][no.found.bootstrap.BootstrapInitial] Updating dynamic Admin Console settings with metrics cluster information {}
[2019-11-20 14:53:31,163][INFO ][no.found.bootstrap.BootstrapInitial] Creating security cluster {}
[2019-11-20 14:53:32,111][INFO ][no.found.bootstrap.SystemClustersBootstrap$] Waiting for the security cluster to be created. {}
[2019-11-20 14:54:08,674][INFO ][no.found.bootstrap.BootstrapInitial] Security cluster was created successfully {}
[2019-11-20 14:54:08,676][INFO ][no.found.bootstrap.BootstrapInitial] Shutting down bootstrapper {}
[2019-11-20 14:54:08,676][INFO ][org.apache.curator.framework.imps.CuratorFrameworkImpl] backgroundOperationsLoop exiting {}
[2019-11-20 14:54:08,693][INFO ][no.found.bootstrap.BootstrapInitial] Exiting bootstrapper {}
[2019-11-20 14:54:08,695][INFO ][no.found.util.LogApplicationExit$] Application is exiting {}

There are no warnings or errors. When I run it with --debug I also get no error.

I think everything is installed alsmost correctly. Because it is working. But I just want to know why this is happening.

Interesting, I found a couple of instances of this in our knowledgebase

One of them was caused by the permissions problems on the ECE directory but is quite old (reported in 1.x) and in theory was fixed.

I think we can see if that one is (re-)occurring by doing docker logs elastic-cloud-enterprise-bootstrap-<correct-version>

The other one was a bit more mysterious but looked like it was related to use of libvirtd, a useful log dump for that would be along the lines of:

journalctl --no-pager |grep "Elastic Cloud Enterprise Installer" -B 50

(ie looking at journalctl around the time ECE was installed)

so the docker logs ... Shows the same as the logs in the first post.

[2019-11-20 14:53:31,163][INFO ][no.found.bootstrap.BootstrapInitial] Creating security cluster {}
[2019-11-20 14:53:32,111][INFO ][no.found.bootstrap.SystemClustersBootstrap$] Waiting for the security cluster to be created. {}
[2019-11-20 14:54:08,674][INFO ][no.found.bootstrap.BootstrapInitial] Security cluster was created successfully {}
[2019-11-20 14:54:08,676][INFO ][no.found.bootstrap.BootstrapInitial] Shutting down bootstrapper {}
[2019-11-20 14:54:08,676][INFO ][org.apache.curator.framework.imps.CuratorFrameworkImpl] backgroundOperationsLoop exiting {}
[2019-11-20 14:54:08,693][INFO ][no.found.bootstrap.BootstrapInitial] Exiting bootstrapper {}
[2019-11-20 14:54:08,695][INFO ][no.found.util.LogApplicationExit$] Application is exiting {}
*** Shutting down runit daemon (PID 8)...
*** Killing all processes...

I did not know that it is not possible to block http ports before the Installation. like 12300 or 2180-2181.
But one after another.

For more Explanation. We use dedicated Server for the allocators and VM`s for the rest. With Rhel 7.7

But we are not able to use overlay. We have overlay2.

Well at least I now know im not the only RHEL 7 installation that gets this issue , i logged this multiple times over the past year as well.

when i get that error, i wait 30minutes and then remove the upgrade container thats left behind, its always the one thats in a stopped state.

so far ive never had a specific answer why it fails, but ive always felt it was due to the script not waiting and checking properly.

i havent upgraded production to 2.4.2 yet, as we didnt have ipv6 enabled and i got side tracked.

im keen to hear a root cause of this error as well.

1 Like

But it seems to run as it should.
What I noticed is that usually you get listed a few information like admin password and tokens for adding more nodes.

But since I get the error there is no notification. I have to look in the bootstrap-secrets.json to get the admin pw.

Maybe it is a smal hint to find the error.

yeah i had to do the same when i struck this on a new ece install on a test server.

ive got used to quirks with RHEL and Elastic documentation.

Hmm interesting, thanks for chiming in. I'll re-open the internal issue that discussed this (it had closed out since we had no reported incidences for a while) and we'll put some more effort into reproducing it.

@logger could you run the journalctl command (or a similar one) I posted?

Thanks a lot for bringing this issue to our attention!

Although I failed reproducing this issue myself (RHEL 7.4,Docker 1.12.6 ) I’ve analysed the code and come up with a hypothesis I’d like to share
What happens is that the ECE installer uses python Docker client to attach log stream from bootstrap container.
I can see that bootstrap container did its job perfectly fine and most probably you’ve ended up with a complete, working ECE installation.
When bootstrap container shuts down, ECE installer waits 5 seconds for the log stream handler to shutdown itself and clean up.
It appears that in your case, that didn’t happen and exception was thrown.
There are two cases we’d need to look closer at:

  1. The timeout period is too strict (although we’ve never ran in our testing environments)
  2. The log stream handler process waits forever or doesn’t exit in the way we expect it to exit because of underlying Docker client issue

@logger , @Jugsofbeer - could you share some details about your setup? Specifically, the RHEL and Docker version.

My test setup (on which I failed reproducing this issue):

[elastic@ip-192-168-44-10 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)
[elastic@ip-192-168-44-10 ~]$ docker --version
Docker version 1.12.6, build 78d1802
[elastic@ip-192-168-44-10 ~]$

Hi,

I dont have access remotely to run the commands but i know the details

my system is RHEL 7.6 with docker 1.13

most times that its happenned to me , ive had to control c to force exit, ive waited up to 30minutes most times.. in a second ssh session in parallel ive seen the docker instance for the upgrade has stopped

Hi,

we are using
Docker version 1.13.1, build 4ef4b30/1.13.1
and
Red Hat Enterprise Linux Server release 7.7 (Maipo)

I get no output. I also tried just grep "Elastic Cloud Enterprise" but there is nothing.

I've ran ECE bootstrap on similar setup (RHEL 7.7 and Docker 1.13.1-cs9) and still no luck in reproducing your error. This makes me think that we should just increase the 5 second timeout I've mentioned earlier as this appears to be the actual problem.

However, I'd also like to ask you two additional questions:

  1. Is this constantly failing for you, or maybe one of N installations succeeds on your setup?
    (that'd match the strict timeout scenario)
  2. Could you please provide us with full docker version? That way we could also see the Docker API version. Let's try to really narrow this down a little more...
    (for the record, my RHEL tests were on 1.27 and 1.24)

Again, thanks a lot for your collaboration!

It happend the last 5 times. Next week I am going to install it again on a nother environment. Then I can give you mor information.

I am not in office this week. But I try to get the information. Or it would be next week.
Regards
Malte

We're having exactly the same issue on RHEL 7.7 (kernel 3.10.0-1062.1.1.el7.x86_64) and the following Docker version:

Client:
Version: 1.13.1
API version: 1.26
Package version: docker-1.13.1-103.git7f2769b.el7.x86_64
Go version: go1.10.8
Git commit: 7f2769b/1.13.1
Built: Fri Aug 2 10:19:53 2019
OS/Arch: linux/amd64

Server:
Version: 1.13.1
API version: 1.26 (minimum version 1.12)
Package version: docker-1.13.1-103.git7f2769b.el7.x86_64
Go version: go1.10.8
Git commit: 7f2769b/1.13.1
Built: Fri Aug 2 10:19:53 2019
OS/Arch: linux/amd64
Experimental: false

We tried installing ECE 2.4.2 on the first node and it keeps failing with that error even though the admin console is running fine and we can get in if we dig the passwords out of the bootstrap-secrets.json file manually.

We did the setup with one the the Principal Consulting Architects of Elastic.co, so we know it wasn't just an error on our part.

One thing we noticed during our debugging, is that we tried to run the ece-support-diagnostics script and it would stall on the iptables -L step and take several minutes to dump the Docker rules for some reasons.

On another server with Docker installed (but not for ECE), we don't have any of these issues with the command... Not sure if it's related, but the consulting architect thought it was relevant to mention it.

We suffer the same issues still from time to time, it can be intermittant.

Our fleet runs the follow across the board.

[root@testserver123 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.7 (Maipo)

[root@testserver123 ~]# uname -a
Linux testserver123.prod.asbgroup.co.nz 3.10.0-1062.4.1.el7.x86_64 #1 SMP Wed Sep 25 09:42:57 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux

[root@testserver123 ~]# uname -mrs
Linux 3.10.0-1062.4.1.el7.x86_64 x86_64

[root@testserver123 ~]# docker info
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 13
Server Version: 1.13.1
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: journald
Cgroup Driver: systemd
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Authorization: rhel-push-plugin
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Init Binary: /usr/libexec/docker/docker-init-current
containerd version: (expected: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1)
runc version: 290a33602b16ff2d1cc5339bc0297f0e094462ce (expected: 9df8b306d01f59d3a8029be411de015b7304dd8f)
init version: fec3683b971d9c3ef73f284f176672c44b448662 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
seccomp
WARNING: You're not using the default seccomp profile
Profile: /etc/docker/seccomp.json
Kernel Version: 3.10.0-1062.4.1.el7.x86_64
Operating System: Red Hat Enterprise Linux
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 3
CPUs: 64
Total Memory: 251.4 GiB
Name: testserver123.prod.asbgroup.co.nz
ID: AASD:IG6U:EGYR:ZI6J:D4BP:DDFE:N26I:MG4R:XPI2:OMSP:W6DJ:RHYF
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://registry.access.redhat.com/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Registries: registry.access.redhat.com (secure), docker.io (secure), registry.fedoraproject.org (secure), quay.io (secure), registry.centos.org (secure), docker.io (secure)
[root@testserver123 ~]#

Hi
my docker version is:

docker info
Containers: 11
 Running: 11
 Paused: 0
 Stopped: 0
Images: 5
Server Version: 1.13.1
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: journald
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Authorization: rhel-push-plugin
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Init Binary: /usr/libexec/docker/docker-init-current
containerd version:  (expected: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1)
runc version: 9c3c5f853ebf0ffac0d087e94daef462133b69c7 (expected: 9df8b306d01f59d3a8029be411de015b7304dd8f)
init version: fec3683b971d9c3ef73f284f176672c44b448662 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  WARNING: You're not using the default seccomp profile
  Profile: /etc/docker/seccomp.json
 selinux
Kernel Version: 3.10.0-1062.4.1.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.7 (Maipo)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 3
CPUs: 4
Total Memory: 23.37 GiB
ID: SM4W:IBAE:7HTA:ISDJ:JHCT:P6YQ:HOQU:W4Q3:JGCZ:7Y5L:GZVM:KYYA
Docker Root Dir: /opt/elastic/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://registry.access.redhat.com/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Registries: registry.access.redhat.com (secure), docker.io (secure), registry.fedoraproject.org (secure), quay.io (secure), registry.centos.org (secure), docker.io (secure)

Thank you all for your input!

I'm happy to announce that today we've released ECE 2.4.3 which not only ships multiple bug fixes but also includes increased timeout for bootstrap/upgrade process to exit. I hope this solves some of the problems you're facing.

You can find 2.4.3 release notes here.

Unfortunately, the 2.4.3 release does not fix the error reported above. It just takes 55 seconds more before it exits with the same error now. I checked the initiator.py script that is calling the monitor.py script and it seems that it's not checking the right thing to detect the status of the elastic-cloud-enterprise-bootstrap-2.4.3 container. When the initiator is at the "Waiting for bootstrap monitor process to exit" step, the container is actually already long gone and no longer running. I even tried to kill it using docker kill and got an error that it was not running, confirming it. Yet, the script still went through the 60 seconds timeout and produced the error.