Installation failure

Hi there,
I am testing ece beta 2 on ubuntu 16.04. I have the phenomena that on one test machine the first installation does not work, then I remove the docker images, then I restart the installation script and afterwards it works. I would like to ask in a first step / high level if someone else had this behaviour and if it makes sense. To do so I attach some non-debug level information:

STEP 1:
...
-- Verifying Prerequisites --
Checking host storage path... PASSED
Checking docker version... PASSED
Checking internal ip connectivity... PASSED
Checking OS settings... PASSED
Checking OS kernel version... PASSED
Checking Minimum required memory... PASSED
-- Completed Verifying Prerequisites --

  • Running Bootstrap container
  • Monitoring bootstrap process
  • Loaded bootstrap settings {}
  • Pulling Elastic Stack images {}
  • Starting local runner {}
  • Started local runner {}
  • Waiting for runner container node {}
  • Runner container node detected {}
  • Waiting for coordinator candidate {}
  • Detected coordinator candidate {}
  • Detected pending coordinator, promoting coordinator {}
  • Coordinator accepted {}
  • Storing current platform version: 1.0.0-beta2 {}
  • Storing Elastic Stack versions: [2.4.4,5.2.2,5.3.0] {}
  • Creating Admin Console Elasticsearch backend {}
Errors have caused Elastic Cloud Enterprise installation to fail
UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
  • Applying Admin Console Elasticsearch index templates {}
  • Updating dynamic Admin Console settings with Elasticsearch cluster information {}
  • Starting reindexing Admin Console data {}
  • Creating Logging and Metrics Cluster {}
  • Enabling Kibana for Logging and Metrics Cluster {}
  • Shutting down bootstrapper {}
  • Exiting bootstrapper {}
    Error response from daemon: Driver devicemapper failed to remove root filesystem 59c6dcf119e0667a3a0d4aad5384136a95fa439163448f58ba2de1aed64acd22: Device is Busy
    ubuntu@ece-big:~$

STEP 2:
docker rm -f frc-runners-runner frc-allocators-allocator $(docker ps -a -q); sudo rm -rf /mnt/data/elastic/* && docker ps -a

STEP 3:
/tmp/elastic-cloud-enterprise-installer.sh --debug
and it works.

Memory: 30GB
CPU: 16 CPU

Please share some idea or explenation (e.g. can I influence timeout behaviour) - I can provide more data and make new tests too.
Thank you.

kind regards

Hi @smm

It looks like docker is timing out while pulling down or starting some of the images (it's a commonly seen problem, eg: Issues · docker/compose · GitHub), it probably gets overloaded at some point.

The common solution is to try setting (eg) export DOCKER_CLIENT_TIMEOUT=120 .. I'm not 100% sure where the error is occurring and whether having it in your environment when running the script is sufficient, or if it needs to get pushed into the install container itself. But I'd start there.

Incidentally, we released 1.0.0 GA yesterday! Unfortunately it doesn't have a fix for the issue you report, which we've never seen before (*), but it does have many other improvements and fixes: Installing Elastic Cloud Enterprise | Elastic Cloud Enterprise Reference [3.6] | Elastic

(*) What version of docker are you running incidentally? I didn't see that anywhere. (All my testing was on 1.11.2 build b9f10c9 fwiw)

Alex

Good morning Alex,
thank you for your insight. I will test it as recommanded and using version 1.0 just realeased.
I am using the recommanded docker version, meaning Docker version 1.11.2, build b9f10c9.
I will start from the scatch again with debug on and the new docker client timeout.
Let's see how it works out.
cheers and keep you informed,
Stefano

Hi everyone,
I am still having problems with the installation with the newest version of ece this time (1.0.1).
I went this way again on an ubuntu 16.04, after the preparation steps from the documentation
and also added this parameter as I was told to do: export DOCKER_CLIENT_TIMEOUT=120

The results look like this. Can anyone give me a hint how to get an installation that works from the beginning? Some tips from Elastic side perhaps? Thanks!

Now here is the output:
ubuntu@ece-big:~$ bash <(curl -fsSL https://download.elastic.co/cloud/elastic-cloud-enterprise.sh) install
Unable to find image 'docker.elastic.co/cloud-enterprise/elastic-cloud-enterprise:1.0.1' locally
1.0.1: Pulling from cloud-enterprise/elastic-cloud-enterprise
439fe6ccfef0: Pull complete
a050c6188f94: Pull complete
ea722b27bfe9: Pull complete
173e7923dc42: Pull complete
dd22238d8d5b: Pull complete
af29aaa82fca: Pull complete
d0ea989ac0af: Pull complete
c1f41cd3093e: Pull complete
09b373cc2d1c: Pull complete
b4cf9c358be9: Pull complete
5f317c65cf01: Pull complete
64fa2e2cc9a4: Pull complete
007d851f4170: Pull complete
1c8955d70407: Pull complete
35f77b6ea327: Pull complete
81234cdb3d74: Pull complete
Digest: sha256:d05c3fde7c1ab31a8ea9791f9a1c8001cc47b467084b73ceb9e4aeaa2b411912
Status: Downloaded newer image for docker.elastic.co/cloud-enterprise/elastic-cloud-enterprise:1.0.1

Elastic Cloud Enterprise Installer 

Start setting up a new Elastic Cloud Enterprise installation by installing the software on your first host. 
This first host becomes the initial coordinator and provides access to the Cloud UI, where you can manage your installation. 
To learn more about the options you can specify, see the documentation. 

NOTE: If you want to add this host to an existing installation, please specify the --coordinator-host and --roles-token flags

-- Verifying Prerequisites --
Checking host storage path... PASSED
Checking docker version... PASSED
Checking internal ip connectivity... PASSED
Checking OS settings... PASSED
Checking OS kernel version... PASSED
Checking Minimum required memory... PASSED
Checking Kernel cgroup.memory setting... PASSED

  • OS setting 'cgroup.memory' should be set to cgroup.memory=nokmem
    -- Completed Verifying Prerequisites --

  • Running Bootstrap container

  • Monitoring bootstrap process

  • Loaded bootstrap settings {}

  • Starting local runner {}

  • Started local runner {}

  • Waiting for runner container node {}

  • Runner container node detected {}

  • Waiting for coordinator candidate {}

  • Detected coordinator candidate {}

  • Detected pending coordinator, promoting coordinator {}

  • Coordinator accepted {}

  • Storing current platform version: 1.0.1 {}

  • Storing Elastic Stack versions: [2.4.5,5.4.1] {}

  • Creating Admin Console Elasticsearch backend {}

  • Applying Admin Console Elasticsearch index templates {}

  • Unhandled error. {}
    -- An error has occurred in bootstrap process. Please examine logs --
    Exception in thread Thread-4:
    Traceback (most recent call last):
    File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
    File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
    File "/usr/lib/python3.5/site-packages/sh.py", line 1484, in output_thread
    done = stream.read()
    File "/usr/lib/python3.5/site-packages/sh.py", line 1974, in read
    self.write_chunk(chunk)
    File "/usr/lib/python3.5/site-packages/sh.py", line 1949, in write_chunk
    self.should_quit = self.process_chunk(chunk)
    File "/usr/lib/python3.5/site-packages/sh.py", line 1867, in process
    return handler(chunk)
    File "/usr/lib/python3.5/site-packages/sh.py", line 1106, in fn
    return handler(chunk, *args)
    File "/elastic_cloud_apps/bootstrap-initiator/bootstrap_initiator/logging.py", line 34, in process_info_log_output
    puts("- {0}".format(line.split("]",3)[3].strip()))
    IndexError: list index out of range

  Errors have caused Elastic Cloud Enterprise installation to fail - Please check logs 
  Node type - initial
Elastic Cloud Enterprise installation completed successfully
...

As always I can give more input or / and make a new test from scratch and / or share more details.

kind regards,
Stefano

Hi @smm - sorry you're still having problems, at least they seem to be different problems now!

There is a known issue with stack trace parsing in our installer, I see from our github feed that we're fixing it at the moment.

In the meantime you should be able to see the actual root cause by heading over to /mnt/data/elastic/logs/bootstrap-logs/bootstrap.log - there should be an obvious stack trace at the end of that

Alex

Hi Alex, thank you for your insight. I will take a look soon!
cheers,
Stefano

@smm

We've seen something similar looking a few times internally (*). It appears that the combination of the docker pull for Elasticsearch and Kibana images and the subsequent cluster provisioning can take more than the (annoyingly hardwired!) 10 minute timeout, particularly on slower machines or if the docker repo is under heavy load. We have an issue to fix this.

(*) (If in the bootstrap.log the exception is a 600 second timeout)

This is only a problem when provisoning the genesis node - annoying though it is, it can be worked around by removing the failed installed as per here and then rerunning the install. The cleansing step doesn't purge the docker images so the provisioning step will go faster

Alex

Dear Alex,
thank you for your inputs! I will follow up this topic tomorrow. The last week(s) I had quite much to do in / for my central elastic environment here at Metro.
I keep you updated.
kind regards,
Stefano

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.