Slow cluster startup with zen discovery and large number of nodes

Hi,

Starting a cluster with 100 nodes takes half an hour just for the
nodes to join in elasticsearch version 1.0. In version 0.19.8 nodes
were very quick to join the cluster. The issue seems to come from the
master node sending the updated state to all the nodes in the cluster
after every single addition of a node and then waiting for the nodes
to acknowledge the cluster update before adding the next node
(zen-disco-receive).

Setting discovery.zen.publish_timeout:0 seems to resolve the issue
during startup, because the master node does not block anymore, but I
am not sure if something can go wrong afterwards while running the
cluster with the timeout set to 0.

I also tried setting increasing the kernel connections, but it did not
make a difference:
sysctl -w net.ipv4.tcp_max_syn_backlog=20480
sysctl -w net.core.somaxconn=8192
sysctl -w net.ipv4.tcp_syncookies=1
sysctl -w net.ipv4.tcp_synack_retries=1

So the question would be if it is safe to run the cluster with
discovery.zen.publish_timeout set to 0 and if the behavior is to be
expected that zen discovery does not perform well for a larger number
of nodes? Or if there might still be something wrong with the setup?

Thanks in Advance,
Michel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH0sEYisBxUpRksMbYox06sbONjtOPtRbc%2Btqzzg%2Bu5%3DVrbrcw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

You may be interested in some settings that help a full cluster restart:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html#recover-after

There is also a webinar that talks about some of the above:

http://www.elasticsearch.org/webinars/elasticsearch-pre-flight-checklist/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a405d32a-50a5-4d83-a3c7-8a0ea3449d28%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

I'm working with Michel on that issue:

The cluster is completely empty and has no indexes at all. So it certainly
is not related to revocery.
The old elasticsearch version doesn't have code to wait for replies which
causes the very slow startup.

Thanks,
Thibaut

On Fri, Feb 21, 2014 at 6:11 PM, Binh Ly binh@hibalo.com wrote:

You may be interested in some settings that help a full cluster restart:

Elasticsearch Platform — Find real-time answers at scale | Elastic

There is also a webinar that talks about some of the above:

Elasticsearch Platform — Find real-time answers at scale | Elastic

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a405d32a-50a5-4d83-a3c7-8a0ea3449d28%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAE_AicSiqGrCdfT-Xw5cAuuqLrct%3DsPb9Z_JXB8EvvckonU54A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Also the kernel complains about too many connections being made at
once on the joining node.
(Seems to occur after 30 nodes joined the cluster)

TCP: TCP: Possible SYN flooding on port 9300. Sending cookies. Check
SNMP counters.

On Fri, Feb 21, 2014 at 6:44 PM, Thibaut Britz t.britz@trendiction.com wrote:

Hi,

I'm working with Michel on that issue:

The cluster is completely empty and has no indexes at all. So it certainly
is not related to revocery.
The old elasticsearch version doesn't have code to wait for replies which
causes the very slow startup.

Thanks,
Thibaut

On Fri, Feb 21, 2014 at 6:11 PM, Binh Ly binh@hibalo.com wrote:

You may be interested in some settings that help a full cluster restart:

Elasticsearch Platform — Find real-time answers at scale | Elastic

There is also a webinar that talks about some of the above:

Elasticsearch Platform — Find real-time answers at scale | Elastic

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a405d32a-50a5-4d83-a3c7-8a0ea3449d28%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH0sEYjyTwFZFP7ZPCsOquSJDrYmdTrrhkmpD5XxbKpv-YO85A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.