Hello all,
We have an ES cluster running in our pre-production environment, using official Docker images and Swarm.
The cluster contains 3 nodes that are running as both data and master and a single agent node. Attaching stack YAML file link.
Docker configurations is attached as well. link
Environment:
• Single bare metal server 256GB RAM, 72 cores
• 4.15.0-29-generic #31~16.04.1-Ubuntu
• Docker 18.06.0-ce
• docker-compose version 1.21.0
We are facing a directional connectivity loss between services (ES nodes - Node not connected).
It means ES node 01 can reach node 02, but 02 cannot reach 01.
Obviously it cased cluster stability issues, master re-election, unassigned shards and etc.
Log samples attached bellow. link-1 link-2 link-3
The issues occurs both when no data been streamed (except monitoring) and when we are streaming significant amount of data (10-15K pps).
First two log samples are from when the cluster is in standby (almost no data ingested, only couple MB is stores) while third one is from when the cluster is populated with 1.1TB.
Once the cluster is populated – each glitch causes painful recovery.
Docker log is free of errors.
Note: same behavior was observers when the swarm is running above number of VMs; with ES versions 5.6.4, 5.6.9 and 5.6.10
Any thoughts on how to proceed with troubleshooting?
Thanks,
Daniel