Bootstrap Error w/ Adding Runner to ECE Platform (v2.6.2) in ECE install script env "HOST_STORAGE_DEVICE_PATH" when using AWS EFS (NFS4) as argument for "--host-storage-path"

Hello, first post here. I tried to lookup existing topics, but found none that fit the description I was having. I manually downloaded the ECE script to my runner host and attempted to run the install script to add my runner with allocator and proxy roles.

I attempted to use an AWS Elastic File System mounted on my host at /mnt/efs/data/elastic for mounting the persisted container(s) data specified by --host-storage-path and ran into an issue. However, I was ultimately able to find a workaround so I wanted to report that here for feedback and get advice on any downstream impacts it may have to the resiliency of my platform.

For full transparency, /etc/fstab of my runner host contains the appropriate entry for the root device storage path of the mounted nfs4 volume:

<AWS-EFS-FILE-SYSTEM-ID>.efs.<MY-AWS-REGION>.amazonaws.com:/ /mnt/efs/data nfs4 nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,_netdev 0 0

To ensure an available directory exists at the specified /mnt/efs/data/elastic, I set permission modes and owner/group privileges before install.

[elastic@ip-XXX-XXX-XXX-XXX ~]$ sudo install -o $USER -g $USER -d -m 700 /mnt/efs/data/elastic

It should be noted that I am using one of the pre-configured elastic CentOS 8 community AMIs indicated in the current-amis.json list.

The run command of my EC2 instance contains the necessary logic (auto-generated by AWS) which mounts the nfs4 volume on the target host and is executed/tested on instance start. I ran the following install command and received a Java/Scala parsing exception error:

[elastic@ip-XXX-XXX-XXX-XXX ~]$ bash <(curl -fsSL https://download.elastic.co/cloud/elastic-cloud-enterprise.sh) install --debug \
  --overwrite-existing-image \
  --roles "proxy,allocator" \
  --memory-settings '{"runner":{"xms":"1G","xmx":"1G"},"allocator":{"xms":"4G","xmx":"4G"},"proxy":{"xms":"8G","xmx":"8G"},"zookeeper":{"xms":"4G","xmx":"4G"},"director":{"xms":"1G","xmx":"1G"},"constructor":{"xms":"4G","xmx":"4G"},"admin-console":{"xms":"4G","xmx":"4G"}}' \
  --cloud-enterprise-version "2.6.2" \
  --availability-zone "<MY_ECE_ZONE>" \
  --coordinator-host "<MY_COORDINATOR_HOST_IP>" \
  --host-ip "<MY_RUNNER_HOST_IP>" \
  --host-docker-host /var/run/docker.sock \
  --host-storage-path /mnt/efs/data/elastic \
  --roles-token \'${ECE_INSTALL_ROLES_TOKEN}\' \
  --external-hostname "<MY_EXTERNAL_HOSTNAME>" \
  --api-base-url "<MY_API_BASE_URL>"

It is my understanding the error was caused by an apparent colon (":") that is post-fixed to the resolved field HOST_STORAGE_DEVICE_PATH in bootstrap container config file /elastic_cloud_apps/bootstrap/bootstrap.conf at line 20:

1 include "reference"
2 include "application-bootstrap"
3 include file("/elastic_cloud_apps/additional.conf")
          .
          .
          .
17   host {
18     storage-path = ${HOST_STORAGE_PATH}
19     storage-root-volume-path = ${HOST_STORAGE_ROOT_VOLUME_PATH}
20     storage-device-path = ${HOST_STORAGE_DEVICE_PATH} <============== X
21     docker-config-path = ${?HOST_DOCKER_CONFIG_PATH}
22   }
23
          .
          .
          .
41 }

When I set the volume mount to /mnt/efs/data/elastic, the HOST_STORAGE_DEVICE_PATH resolves to the Filesystem name specified by the 'disk free' command used to obtain it (truncated output below).

[elastic@ip-XXX-XXX-XXX-XXX data]$ df -hT
Filesystem                                                   Type      Size  Used Avail Use% Mounted on
/dev/mapper/lxc-data                                         xfs       119G   16G  104G  13% /mnt/data
<AWS-EFS-FILE-SYSTEM-ID>.efs.<MY-AWS-REGION>.amazonaws.com:/ nfs4      8.0E     0  8.0E   0% /mnt/efs/data

During install, bootstrap execution appears to work fine until an "invalid volume specification" error occurs in the creation of runner container [frc-runners-runner] noted by the log output below where <INSTALL-HOST-IP>, <MY-COORDINATOR-HOST-IP>, <AWS-EFS-FILE-SYSTEM-ID>, and <MY-AWS-REGION> correspond to their respective values:

                             .
                             .
                             .
[2021-06-09 01:51:11,015][INFO ][org.apache.curator.framework.state.ConnectionStateManager] State change: CONNECTED {}
[2021-06-09 01:51:11,056][INFO ][no.found.bootstrap.BootstrapAdditional] Starting local runner {}
[2021-06-09 01:51:11,059][INFO ][no.found.bootstrap.containers.RunnerContainerBootstrap] Bootstrapping container [runners-runner] {}
[2021-06-09 01:52:41,382][INFO ][no.found.bootstrap.BootstrapAdditional$] Api Exception:  {}
no.found.docker.DockerApiException: Unable to create container [frc-runners-runner]
Docker API request: [HttpRequest(HttpMethod(POST),http://localhost:2375/v1.22/containers/create?name=frc-runners-runner,-,-,HttpProtocol(HTTP/1.1)) [INJECTED BYTECODE STRING REDACTION: Until https:/ist(Api-Version: 1.40, Docker-Experimental: false, Ostype: linux, Server: Docker/19.03.13 (linux), X-Content-Type-Options: nosniff, Date: Wed, 09 Jun 2021 01:52:41 GMT), invalid volume specification: '<AWS-EFS-FILE-SYSTEM-ID>.efs.<MY-AWS-REGION>.amazonaws.com::<AWS-EFS-FILE-SYSTEM-ID>.efs.<MY-AWS-REGION>.amazonaws.com:'
, HttpProtocol(HTTP/1.1)))]
        at no.found.docker.DockerApiException$.apply(DockerApiException.scala:93)
        at no.found.docker.DockerApiException$.apply(DockerApiException.scala:97)
        at no.found.docker.DockerApi.$anonfun$createContainer$1(DockerApi.scala:272)
        at scala.util.Success.$anonfun$map$1(Try.scala:255)
        at scala.util.Success.map(Try.scala:213)
        at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
        at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
        at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
        at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:92)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
        at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:92)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[2021-06-09 01:52:41,405][ERROR][scala.Predef$            ] Uncaught throwable occurred on thread: [main], calling System.exit(1) {}
no.found.docker.DockerApiException: Unable to create container [frc-runners-runner]
Docker API request: [HttpRequest(HttpMethod(POST),http://localhost:2375/v1.22/containers/create?name=frc-runners-runner,-,-,HttpProtocol(HTTP/1.1)) [INJECTED BYTECODE STRING REDACTION: Until https:/ist(Api-Version: 1.40, Docker-Experimental: false, Ostype: linux, Server: Docker/19.03.13 (linux), X-Content-Type-Options: nosniff, Date: Wed, 09 Jun 2021 01:52:41 GMT), invalid volume specification: '<AWS-EFS-FILE-SYSTEM-ID>.efs.<MY-AWS-REGION>.amazonaws.com::<AWS-EFS-FILE-SYSTEM-ID>.efs.<MY-AWS-REGION>.amazonaws.com:'
, HttpProtocol(HTTP/1.1)))]
        at no.found.docker.DockerApiException$.apply(DockerApiException.scala:93)
        at no.found.docker.DockerApiException$.apply(DockerApiException.scala:97)
        at no.found.docker.DockerApi.$anonfun$createContainer$1(DockerApi.scala:272)
        at scala.util.Success.$anonfun$map$1(Try.scala:255)
        at scala.util.Success.map(Try.scala:213)
        at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
        at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
        at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
        at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:92)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
        at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:92)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[2021-06-09 01:52:41,411][INFO ][no.found.util.LogApplicationExit$] Application is exiting {}

To debug my issue, I downloaded a local copy of the ECE script and slightly edited the function responsible for setting the HOST_STORAGE_DEVICE_PATH ("createAndValidateHostStoragePath()") to the same path indicated by HOST_STORAGE_PATH for my very specific use-case, at which point, I was able to complete a successful install.

createAndValidateHostStoragePath() {
  uid=`id -u`
  gid=`id -g`

  if [[ ! -e ${HOST_STORAGE_PATH} ]]; then
    mkdir -p ${HOST_STORAGE_PATH}
    chown -R $uid:$gid ${HOST_STORAGE_PATH}
  fi

  if [[ ! -r ${HOST_STORAGE_PATH} ]]; then
    printf "${RED}%s${NC}\n" "Host storage path ${HOST_STORAGE_PATH} exists but doesn't have read permissions for user '${USER}'."
    printf "${RED}%s${NC}\n" "Please supply the correct permissions for the host storage path."
    exit $GENERAL_ERROR_EXIT_CODE
  fi

  if [[ ! -w ${HOST_STORAGE_PATH} ]]; then
    printf "${RED}%s${NC}\n" "Host storage path ${HOST_STORAGE_PATH} exists but doesn't have write permissions for user '${USER}'."
    printf "${RED}%s${NC}\n" "Please supply the correct permissions for the host storage path."
    exit $GENERAL_ERROR_EXIT_CODE
  fi
  
  # ORIGINAL
  # export HOST_STORAGE_DEVICE_PATH=$(df --output=source ${HOST_STORAGE_PATH} | sed 1d)

  # ********** MY EDIT **********
  export HOST_STORAGE_DEVICE_PATH=${HOST_STORAGE_PATH}
}

Despite the fact the bootstrap initiator container (elastic-cloud-enterprise-installer) docker run arguments use -v ${HOST_STORAGE_PATH}:${HOST_STORAGE_PATH} to bind mount the volume, it appears the subsequent frc-runners-runner container attempts to use -v ${HOST_STORAGE_DEVICE_PATH}:${HOST_STORAGE_DEVICE_PATH} which invariably results in the "invalid volume specification" error I reported earlier.

To retain the original intent of the exported variable, I also attempted to use the following subshell command replacement which would permit the device path EFS DNS name in the install checks (as seen in line 20 bootstrap.conf), which also proved unsuccessful:

HOST_STORAGE_DEVICE_PATH=$(df --output=source ${HOST_STORAGE_PATH} | sed 1d | awk -v efs_path=${HOST_STORAGE_PATH} '{split($0,a,":"); print a[1]"\042:"efs_path"\042"}')

It seems (from my observations) that HOST_STORAGE_DEVICE_PATH is a redundant environment variable in the given context. Any input on this would be appreciated.

I have a premium orchestration license but posed the question here for the sake of anyone else who might have a similar use-case.

Any input would be appreciated.

Update on this. Abort this route if you end up here. Amazon EFS is a no-go based on disk quota errors. Once you generate enough daily indices for a single cluster, you'll invariably end up with an issue where allocation of primary AND replica shards is no longer possible.

Even though NFS permits millions of file-process pairs, AWS allows only 256 of them.

You can determine how many file locks you have by running, lslocks | grep java | wc -l to get the number of locks on a particular server/EC2 instance. If the resulting number is at or approaching 256 (for items indicated by the EFS mount path for ES data stores), you can go ahead and stop provisioning additional ECE nodes on the host altogether. The java runtime used by ES won't be able to allocate additional shards and you'll be left scratching your head wondering why.

It's also a limit AWS doesn't permit users to request an increase on.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.