I try to install a "large ". I started a few month ago with a small Installation and now we need to expand.
For now I just try to install 2.4.2 on a new VM and it is not exiting without errors. But there are no errors. I used "debug" and evertything but there are no errors.
I get a few warnings:
-- Verifying Prerequisites --
Checking runner container does not exist... PASSED
Checking host storage root volume path is not root... PASSED
Checking host storage path is accessible... PASSED
Checking host storage path contents matches whitelist... PASSED
Checking Docker version... PASSED
Checking Docker file system... PASSED
Checking Docker storage driver... PASSED
- The installation with overlay2 can proceed; however, we recommend using overlay
Checking whether 'setuser' works inside a Docker container... PASSED
Checking memory settings... PASSED
Checking runner ip connectivity... PASSED
Checking OS IPv4 IP forward setting... PASSED
Checking OS max map count setting... PASSED
Checking OS kernel version... PASSED
- OS kernel version is 3.10.0-1062.4.1.el7.x86_64 but we recommend 4.4.
Checking minimum required memory... PASSED
Checking OS kernel cgroup.memory... PASSED
- OS setting 'cgroup.memory' should be set to cgroup.memory=nokmem
Checking OS minimum ephemeral port... PASSED
Checking OS max open file descriptors per process... PASSED
Checking OS max open file descriptors system-wide... PASSED
Checking OS file system and Docker storage driver compatibility... PASSED
Checking OS file system storage driver permissions... PASSED
-- Completed Verifying Prerequisites --
And the exit notification is:
- Exiting bootstrapper {}
[2019-11-20 10:06:32,958][INFO ][no.found.util.LogApplicationExit$] Application is exiting {}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Errors have caused Elastic Cloud Enterprise installation to fail
Bootstrap monitoring process did not exit as expected
Traceback (most recent call last):
File "/elastic_cloud_apps/bootstrap-initiator/initiator.py", line 88, in <module>
exitcode = monitor.logging_and_bootstrap_monitor(bootstrap_properties, bootstrap_container_id, enable_debug)
File "/elastic_cloud_apps/bootstrap-initiator/bootstrap_initiator/monitor.py", line 24, in logging_and_bootstrap_monitor
raise BootstrapMonitoringException('Bootstrap monitoring process did not exit as expected')
bootstrap_initiator.monitor.BootstrapMonitoringException: Bootstrap monitoring process did not exit as expected
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Now I dont know what to do. I tried removing and reinstalling everything new. So docker, ece, all permissions, etc.
But it seems to run as it should.
What I noticed is that usually you get listed a few information like admin password and tokens for adding more nodes.
But since I get the error there is no notification. I have to look in the bootstrap-secrets.json to get the admin pw.
Hmm interesting, thanks for chiming in. I'll re-open the internal issue that discussed this (it had closed out since we had no reported incidences for a while) and we'll put some more effort into reproducing it.
@logger could you run the journalctl command (or a similar one) I posted?
Thanks a lot for bringing this issue to our attention!
Although I failed reproducing this issue myself (RHEL 7.4,Docker 1.12.6 ) I’ve analysed the code and come up with a hypothesis I’d like to share
What happens is that the ECE installer uses python Docker client to attach log stream from bootstrap container.
I can see that bootstrap container did its job perfectly fine and most probably you’ve ended up with a complete, working ECE installation.
When bootstrap container shuts down, ECE installer waits 5 seconds for the log stream handler to shutdown itself and clean up.
It appears that in your case, that didn’t happen and exception was thrown.
There are two cases we’d need to look closer at:
The timeout period is too strict (although we’ve never ran in our testing environments)
The log stream handler process waits forever or doesn’t exit in the way we expect it to exit because of underlying Docker client issue
@logger , @Jugsofbeer - could you share some details about your setup? Specifically, the RHEL and Docker version.
My test setup (on which I failed reproducing this issue):
[elastic@ip-192-168-44-10 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)
[elastic@ip-192-168-44-10 ~]$ docker --version
Docker version 1.12.6, build 78d1802
[elastic@ip-192-168-44-10 ~]$
I dont have access remotely to run the commands but i know the details
my system is RHEL 7.6 with docker 1.13
most times that its happenned to me , ive had to control c to force exit, ive waited up to 30minutes most times.. in a second ssh session in parallel ive seen the docker instance for the upgrade has stopped
I've ran ECE bootstrap on similar setup (RHEL 7.7 and Docker 1.13.1-cs9) and still no luck in reproducing your error. This makes me think that we should just increase the 5 second timeout I've mentioned earlier as this appears to be the actual problem.
However, I'd also like to ask you two additional questions:
Is this constantly failing for you, or maybe one of N installations succeeds on your setup?
(that'd match the strict timeout scenario)
Could you please provide us with full docker version? That way we could also see the Docker API version. Let's try to really narrow this down a little more...
(for the record, my RHEL tests were on 1.27 and 1.24)
We're having exactly the same issue on RHEL 7.7 (kernel 3.10.0-1062.1.1.el7.x86_64) and the following Docker version:
Client:
Version: 1.13.1
API version: 1.26
Package version: docker-1.13.1-103.git7f2769b.el7.x86_64
Go version: go1.10.8
Git commit: 7f2769b/1.13.1
Built: Fri Aug 2 10:19:53 2019
OS/Arch: linux/amd64
Server:
Version: 1.13.1
API version: 1.26 (minimum version 1.12)
Package version: docker-1.13.1-103.git7f2769b.el7.x86_64
Go version: go1.10.8
Git commit: 7f2769b/1.13.1
Built: Fri Aug 2 10:19:53 2019
OS/Arch: linux/amd64
Experimental: false
We tried installing ECE 2.4.2 on the first node and it keeps failing with that error even though the admin console is running fine and we can get in if we dig the passwords out of the bootstrap-secrets.json file manually.
We did the setup with one the the Principal Consulting Architects of Elastic.co, so we know it wasn't just an error on our part.
One thing we noticed during our debugging, is that we tried to run the ece-support-diagnostics script and it would stall on the iptables -L step and take several minutes to dump the Docker rules for some reasons.
On another server with Docker installed (but not for ECE), we don't have any of these issues with the command... Not sure if it's related, but the consulting architect thought it was relevant to mention it.
I'm happy to announce that today we've released ECE 2.4.3 which not only ships multiple bug fixes but also includes increased timeout for bootstrap/upgrade process to exit. I hope this solves some of the problems you're facing.
Unfortunately, the 2.4.3 release does not fix the error reported above. It just takes 55 seconds more before it exits with the same error now. I checked the initiator.py script that is calling the monitor.py script and it seems that it's not checking the right thing to detect the status of the elastic-cloud-enterprise-bootstrap-2.4.3 container. When the initiator is at the "Waiting for bootstrap monitor process to exit" step, the container is actually already long gone and no longer running. I even tried to kill it using docker kill and got an error that it was not running, confirming it. Yet, the script still went through the 60 seconds timeout and produced the error.
Just confirming that I attempted a clean install of ECE v2.4.3 and just like Yanick says, it failed to install properly and aborts at "Waiting for bootstrap monitor process to exit".
I had a second ssh session open looking at the docker processes and the the docker container had stopped fairly quickly so perhaps it really is as simple as the script is looking for the wrong item to be stopped and never succeeds.
Begs the question... hows anyone able to install this properly on RHEL then ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.