Troubleshooting ECE fail after restart

(D.J) #1

I have installed ECE 1.1.3 , used the UI to create 3 nodes cluster.
I wanted to check how it will behave after host failure (reboot).
now, I see docker is up, in "docker ps" I see many containers running.
I was able to enter ECE UI but dont see anything there ! no cluster . nada.

I run the curl API it returns output with all clusters (2 internal ) + my cluster. but most not healthy.

I am rather new to the ece concept and dockers.
in past with standalone I knew to check the elastic/kibana processes.
but now, documentation is not that good with troubleshooting such issues..

questions :

  1. what should I check running on the host machine ?
  2. any way to validate ece started well ?
  3. I have a about 50 log files that were written , which one should I look ? which log should I start with ?
    any documentation about the logs or directory structure ?

Thank you

(Yuri Tceretian) #2

Hi there,
what's your OS and version of Docker?

  1. make sure that all containers on the host you restarted are running (check docker ps -a). Sometimes not all containers are started because of a bug in earlier versions of Docker which we only support (1.11.x). If you see some containers are not running, try to kill them (do not worry, data will not go away) using docker rm {container name or id} and in a few seconds check that the container is re-created.
  2. Basically, there are a few indicators: allocators page that displays statuses of allocators, runners page that displays statuses of runners and platform info that displays a status of Zookeeper ensemble.
  3. the first place you should go is logging-and-metrics cluster which accumulates logs from all services. If you prefer using Unix tools then you can go to {host-storage-dir}/{runner-id}/services/{service}/logs and pick the most recent log.

(D.J) #3

Thank you Yuri,
using Rhel7.3 with docker 17.12 (although I know it was certified with version 1.11)
you mentioned to check all containers are running.... I have about 20 rows, how do I know if one is missing ....
any documentation of what should exist ?
I have for example :
fac-.....-instance-000000000X x 8 rows
Any documentation of what I should have or expect to have ?

Thank you

(Yuri Tceretian) #4

Hi DJ,
unfortunately, I do not see any documentation for it. However, if you explicitly do not delete containers they will not disappear after the restart but they can be stopped and something could prevent them from starting.
What are statuses of the containers? are there any statuses except Up ... or Started?

(system) #5

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.