tried to install it several times starting from 0.8.0 operator, and 0.8.1, always the same issue.
whenever we try to reboot one or few machines, cluster never recovers. all masters become unavailable and it continues to try to recover, but fails indefinitely, here's a snippet of log from kubetail'ing all pods. So far ready to give up on the operator, looks like we don't have enough technical knowledge to be able to run this in production and should issues occur - we have no way:
not sure if these events would be any helpful:
LAST SEEN TYPE REASON OBJECT MESSAGE
1s Warning BackOff pod/kibana-kibana-c98867586-4594g Back-off restarting failed container
1s Warning BackOff pod/kibana-kibana-c98867586-4594g Back-off restarting failed container
1s Warning Unhealthy pod/kibana-kibana-c98867586-4594g Readiness probe failed: HTTP probe failed with statuscode: 503
114s Warning FailedToUpdateEndpoint endpoints/elastic-es-discovery Failed to update endpoint ops/elastic-es-discovery: Operation cannot be fulfilled on endpoints "elastic-es-discovery": the object has been modified; please apply your changes to the latest version and try again
114s Warning FailedToUpdateEndpoint endpoints/elastic-es Failed to update endpoint ops/elastic-es: Operation cannot be fulfilled on endpoints "elastic-es": the object has been modified; please apply your changes to the latest version and try again
1s Normal Killing pod/elastic-es-qqtm956plr Stopping container elasticsearch
0s Normal Killing pod/elastic-es-k2bcb29q68 Stopping container elasticsearch
115s Warning FailedToUpdateEndpoint endpoints/elastic-es Failed to update endpoint ops/elastic-es: Operation cannot be fulfilled on endpoints "elastic-es": the object has been modified; please apply your changes to the latest version and try again
0s Normal Killing pod/elastic-es-pzthtpb4mh Stopping container elasticsearch
0s Normal Killing pod/elastic-es-9hdbl2tzj7 Stopping container elasticsearch
0s Normal Killing pod/elastic-es-4vkcnm5kxv Stopping container elasticsearch
0s Warning Unhealthy pod/elastic-es-9hdbl2tzj7 Readiness probe failed:
0s Warning Unhealthy pod/elastic-es-pzthtpb4mh Readiness probe failed:
0s Warning Unhealthy pod/elastic-es-9hdbl2tzj7 Readiness probe failed:
0s Warning Unhealthy pod/elastic-es-9hdbl2tzj7 Readiness probe failed:
1s Warning Unhealthy pod/kibana-kibana-c98867586-4594g Readiness probe failed: HTTP probe failed with statuscode: 503
Or these logs:
[elastic-es-655t98ckzr cert-initializer] 2019-07-10T16:46:32.499Z INFO certificate-initializer No private key found on disk, will create one {"reason": "open /mnt/elastic/private-key/node.key: no such file or directory"}
[elastic-es-655t98ckzr prepare-fs] at org.elasticsearch.cli.Command.main(Command.java:90)
[elastic-es-655t98ckzr prepare-fs] at org.elasticsearch.plugins.PluginCli.main(PluginCli.java:47)
[elastic-es-655t98ckzr cert-initializer] 2019-07-10T16:46:32.499Z INFO certificate-initializer Creating a private key on disk
[elastic-es-655t98ckzr prepare-fs] Installed plugins:
[elastic-es-655t98ckzr cert-initializer] 2019-07-10T16:46:32.815Z INFO certificate-initializer Generating a CSR from the private key
[elastic-es-655t98ckzr prepare-fs] Plugins installation duration: 52 sec.
[elastic-es-655t98ckzr cert-initializer] 2019-07-10T16:46:32.818Z INFO certificate-initializer Serving CSR over HTTP {"port": 8001}
[elastic-es-655t98ckzr cert-initializer] 2019-07-10T16:46:32.818Z INFO certificate-initializer Watching filesystem for cert update
Can you share your Elasticsearch cluster yaml specification?
Are you using PersistentVolumes? Which storage class implementation?
Nodes should normally come back to life by reusing persistent volumes.
Based on the logs it looks like some pods are waiting for certificates that should be provided by the operator: there could potentially be a bug here. In the upcoming release 0.9 we changed the way certificates are provided to the pod, which would certainly fix the issue here.
Tried installing 0.9.0 and appears to have the same issue. Could you elaborate on ways to troubleshoot? Basically new install on 0.9.0 doesn't allow single node to re-join after the node is restarted.
Unlike demo docs - we use rancher/local-path-provisioner as the local disk storageClass, not sure if there's any conflicting functionality here, but appears to be pretty much exact same.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.