Cluster failures

Got a test cluster of 6+ data nodes and 3 masters (dataless). About 100+ indexes, each about 20GB in size. Deployed in AWS using zen discovery. Using mostly default configs for discovery and recovery. Most activity is indexing and at fairly modest rates.

Am starting to see issues with nodes dropping out of the cluster. Some stay out, and some are able to eventually rejoin. Applies to both data and master nodes, although I have not quite seen the elected master dropping out yet.

Logs seem to show zen ping timeouts on both master and data nodes. This is puzzling because the default fd timeout is 30s and 3 retries. Increasing these to 60s and 6 retries, respectively seems to help a little, but really just prolongs the issue.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Pressed send too early.

Anyone seeing similar issues and have possible solutions around this? Is there any way to prevent a node, particularly masters, from failing to respond to zen pings? And without having to set timeouts so high? 30s should ideally be more than sufficient. This seems to be a very serious issue if basic, built-in cluster mgmt operations begin to fail.

I'm seeing this with both 0.90.1 and 0.90.2.

Thanks,
-Vinh

On Jul 19, 2013, at 9:42 AM, Vinh Nguyen vinh@loggly.com wrote:

Got a test cluster of 6+ data nodes and 3 masters (dataless). About 100+ indexes, each about 20GB in size. Deployed in AWS using zen discovery. Using mostly default configs for discovery and recovery. Most activity is indexing and at fairly modest rates.

Am starting to see issues with nodes dropping out of the cluster. Some stay out, and some are able to eventually rejoin. Applies to both data and master nodes, although I have not quite seen the elected master dropping out yet.

Logs seem to show zen ping timeouts on both master and data nodes. This is puzzling because the default fd timeout is 30s and 3 retries. Increasing these to 60s and 6 retries, respectively seems to help a little, but really just prolongs the issue.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.