Some tech background - I have 5 nodes in the cluster, all are configured as both master and data nodes, around 2TB of data, all of them run with docker.
I just moved my cluster from AWS EC2 to Google Compute, and looking at the logs it seems there are constant network issues.
It starts with the error:
master left (reason = failed to ping, tried  times, each with maximum [30s] timeout), current nodes: nodes:
Checking the logs it doesn't seem like they are restarting (running with docker), just disconnecting and reconnecting.
After a few seconds they are back online, but when they are back online I get the error
obtaining shard lock timed out after 5000ms
Which causes them to forget they have data if I understand correctly.
When I look at the master node, I see that server specifically disconnecting and reconnecting in the logs, so I think the problem is with that server.
- What should I check?
- Is there any hotfix to improve the time it takes to reconnect?