Cluster failures

vinh · July 19, 2013, 4:42pm

Got a test cluster of 6+ data nodes and 3 masters (dataless). About 100+ indexes, each about 20GB in size. Deployed in AWS using zen discovery. Using mostly default configs for discovery and recovery. Most activity is indexing and at fairly modest rates.

Am starting to see issues with nodes dropping out of the cluster. Some stay out, and some are able to eventually rejoin. Applies to both data and master nodes, although I have not quite seen the elected master dropping out yet.

Logs seem to show zen ping timeouts on both master and data nodes. This is puzzling because the default fd timeout is 30s and 3 retries. Increasing these to 60s and 6 retries, respectively seems to help a little, but really just prolongs the issue.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

vinh · July 19, 2013, 4:52pm

Pressed send too early.

Anyone seeing similar issues and have possible solutions around this? Is there any way to prevent a node, particularly masters, from failing to respond to zen pings? And without having to set timeouts so high? 30s should ideally be more than sufficient. This seems to be a very serious issue if basic, built-in cluster mgmt operations begin to fail.

I'm seeing this with both 0.90.1 and 0.90.2.

Thanks,
-Vinh

On Jul 19, 2013, at 9:42 AM, Vinh Nguyen vinh@loggly.com wrote:

Got a test cluster of 6+ data nodes and 3 masters (dataless). About 100+ indexes, each about 20GB in size. Deployed in AWS using zen discovery. Using mostly default configs for discovery and recovery. Most activity is indexing and at fairly modest rates.

Am starting to see issues with nodes dropping out of the cluster. Some stay out, and some are able to eventually rejoin. Applies to both data and master nodes, although I have not quite seen the elected master dropping out yet.

Logs seem to show zen ping timeouts on both master and data nodes. This is puzzling because the default fd timeout is 30s and 3 retries. Increasing these to 60s and 6 retries, respectively seems to help a little, but really just prolongs the issue.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Cluter removed timeout Coordinating node Elasticsearch	2	228	February 13, 2023
Optimal discovery.zen.ping_timeout for on-prem cluster over 10gb network? Elasticsearch	1	420	April 23, 2020
Internal:discovery/zen/fd/master_ping time out Elasticsearch	1	991	March 9, 2018
Cluster hanging on node failure Elasticsearch	2	527	July 6, 2017
Fd pings start timing out, causing multiple nodes to be kicked out and cluster going red Elasticsearch	1	437	June 16, 2020

Cluster failures

Related topics