Hi,
Lastely we have been doing a couple of tests to ES clusters over K8s.
Unfortunately, one of our tests didn’t go as we anticipated
A little bit of practical information about our cluster:
- The deployment of ECK is on the OpenShift platform, version 4.6.1
- ECK operator version - 1.2.0
- Elasticsearch version - 7.6.2
- Three bare-metal OpenShift severs on which the pods are deployed
- The configuration in which we work is based off of Availability Zone Awareness, as presented in the documentation here: Advanced Elasticsearch node scheduling | Elastic Cloud on Kubernetes [1.3] | Elastic
- In our deployment, we chose to distribute the pods into two zones
Basically, our tests were designated to check High-Availability.
This specific test simulates a fall of the network in one server in one zone and goes as follows:
Anticipation - the cluster should be available for data ingestion and searches.
In practice, the opposite had happened
The cluster didn’t respond for client requests and we were having a latency in both searches and data ingestion between 20 seconds up to 70 seconds
We came up with another vector - perhaps the cluster wouldn’t respond externally, buy internally it would still be working
That didn’t go as well - there was a check of a local curl request to the api _cat/nodes in one of the master nodes CLI belonging the pods and yet, there was no response until around 40 seconds later
We are out of ideas right now
If anyone here is using ECK in production environment and know what’s going on, we would love to hear about it