Elasticsearch on Google Compute - Network issues and shard lock fails

eranhirs · August 21, 2018, 9:03pm

Some tech background - I have 5 nodes in the cluster, all are configured as both master and data nodes, around 2TB of data, all of them run with docker.

I just moved my cluster from AWS EC2 to Google Compute, and looking at the logs it seems there are constant network issues.

It starts with the error:

master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:

Checking the logs it doesn't seem like they are restarting (running with docker), just disconnecting and reconnecting.

After a few seconds they are back online, but when they are back online I get the error

obtaining shard lock timed out after 5000ms

Which causes them to forget they have data if I understand correctly.

When I look at the master node, I see that server specifically disconnecting and reconnecting in the logs, so I think the problem is with that server.

What should I check?
Is there any hotfix to improve the time it takes to reconnect?

eranhirs · August 23, 2018, 7:18am

I have some more information

master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:

This is weird that for 180 seconds it couldn't ping, because only 3 seconds after the disconnect it already reconnects.

In these 3 seconds, the master node receives the message

"mark copy as stale"

Which I guess causes the Shard Lock error.

system · September 20, 2018, 7:25am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.