Strange issue after upgrading from 1.1.0 to 1.4.1 ES Version

Hi,

We recently upgraded one of our ES Clusters from ES Version 1.1.0 to 1.4.1.

We have dedicated master-data-search deployment in AWS. Cluster settings
are same for all the clusters.

Strangely, only in one cluster; we are seeing that nodes are constantly
failing to connect to Master node and rejoining back.
It happens all the time, even during idle period (when there are no read
or writes)
.

We keep on seeing following exception in the logs
org.elasticsearch.transport.NodeNotConnectedException

Because of this, Cluster has slowed down considerably.

We use kopf plugin for monitoring and it keeps popping up message -
"Loading cluster information is talking too long"

There is not much data on individual nodes; almost 80% disk is free. CPU
and Heap are doing fine.

Only difference between this cluster and other cluster is the number of
indices and shards. Other clusters have shards in hundreds and indices in
double digit. While this cluster has around 5000 shards and close to 250
indices.

But we are not sure, if number of shards or indices can cause reconnection
issues between nodes.

Not sure if it's really related to 1.4.1 or something else. But in that
case, other clusters should have been affected too.

Any help will be appreciated !

Thanks,

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/12e9b4ab-7600-4d22-a347-c03edaebe4f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

+1

I have a similar story: After around six months using the v1.3.x series, I
upgraded from v1.3.4 to v1.4.4. I've had monitoring and metrics in place
for a while now, and compared to baseline I'm seeing occasional periods
where nodes appear to drop out and usually back into the cluster (with
NodeNotConnectedException). During these periods the cluster status is red
on-and-off for maybe 15-30 minutes with anywhere from 30 minutes to 2 hours
in-between. The issue is worse in larger clusters with larger shard counts
(thousands, tens of thousands).

Resource utilization is still good. The number of shards (and the amount of
data) is essentially constant. I'm confident the upgrade was the only
change; I have strict controls on the clusters.

On Wed, Feb 25, 2015 at 5:18 PM, sagarl sagarit2@gmail.com wrote:

Hi,

We recently upgraded one of our ES Clusters from ES Version 1.1.0 to 1.4.1.

We have dedicated master-data-search deployment in AWS. Cluster settings
are same for all the clusters.

Strangely, only in one cluster; we are seeing that nodes are constantly
failing to connect to Master node and rejoining back.
It happens all the time, even during idle period (when there are no read
or writes)
.

We keep on seeing following exception in the logs
org.elasticsearch.transport.NodeNotConnectedException

Because of this, Cluster has slowed down considerably.

We use kopf plugin for monitoring and it keeps popping up message -
"Loading cluster information is talking too long"

There is not much data on individual nodes; almost 80% disk is free. CPU
and Heap are doing fine.

Only difference between this cluster and other cluster is the number of
indices and shards. Other clusters have shards in hundreds and indices in
double digit. While this cluster has around 5000 shards and close to 250
indices.

But we are not sure, if number of shards or indices can cause reconnection
issues between nodes.

Not sure if it's really related to 1.4.1 or something else. But in that
case, other clusters should have been affected too.

Any help will be appreciated !

Thanks,

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/12e9b4ab-7600-4d22-a347-c03edaebe4f3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/12e9b4ab-7600-4d22-a347-c03edaebe4f3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CADa-AwfBqx%3D2AGfgwj%3DZjLHMyCtKUa_tAU_Ffdy-MbQ2Ve5tBQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

We have seen similar issue in our setup too, but we are running 1.3.6. I
think it occurs with large index and shards counts. We have approx 12000
shards in total. I think its a bug in elasticsearch, as having large shard
count does not mean that node just simply drops out from cluster. The shard
count stats and cluster membership threads should not depend upon each
other.

On Thursday, February 26, 2015 at 7:36:20 AM UTC+1, Sean Clemmer wrote:

+1

I have a similar story: After around six months using the v1.3.x series, I
upgraded from v1.3.4 to v1.4.4. I've had monitoring and metrics in place
for a while now, and compared to baseline I'm seeing occasional periods
where nodes appear to drop out and usually back into the cluster (with
NodeNotConnectedException). During these periods the cluster status is
red on-and-off for maybe 15-30 minutes with anywhere from 30 minutes to 2
hours in-between. The issue is worse in larger clusters with larger shard
counts (thousands, tens of thousands).

Resource utilization is still good. The number of shards (and the amount
of data) is essentially constant. I'm confident the upgrade was the only
change; I have strict controls on the clusters.

On Wed, Feb 25, 2015 at 5:18 PM, sagarl <saga...@gmail.com <javascript:>>
wrote:

Hi,

We recently upgraded one of our ES Clusters from ES Version 1.1.0 to
1.4.1.

We have dedicated master-data-search deployment in AWS. Cluster settings
are same for all the clusters.

Strangely, only in one cluster; we are seeing that nodes are constantly
failing to connect to Master node and rejoining back.
It happens all the time, even during idle period (when there are no read
or writes)
.

We keep on seeing following exception in the logs
org.elasticsearch.transport.NodeNotConnectedException

Because of this, Cluster has slowed down considerably.

We use kopf plugin for monitoring and it keeps popping up message -
"Loading cluster information is talking too long"

There is not much data on individual nodes; almost 80% disk is free. CPU
and Heap are doing fine.

Only difference between this cluster and other cluster is the number of
indices and shards. Other clusters have shards in hundreds and indices in
double digit. While this cluster has around 5000 shards and close to 250
indices.

But we are not sure, if number of shards or indices can cause
reconnection issues between nodes.

Not sure if it's really related to 1.4.1 or something else. But in that
case, other clusters should have been affected too.

Any help will be appreciated !

Thanks,

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/12e9b4ab-7600-4d22-a347-c03edaebe4f3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/12e9b4ab-7600-4d22-a347-c03edaebe4f3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b810319e-f553-4459-b81d-464743e90ee8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

12000 shards across how many nodes?

Don't forget that a shard is a lucene instance, it needs resources to
operate and a node only has so many resources. This is why scaling is
important.

On 27 February 2015 at 10:10, emkt84@gmail.com wrote:

We have seen similar issue in our setup too, but we are running 1.3.6. I
think it occurs with large index and shards counts. We have approx 12000
shards in total. I think its a bug in elasticsearch, as having large shard
count does not mean that node just simply drops out from cluster. The shard
count stats and cluster membership threads should not depend upon each
other.

On Thursday, February 26, 2015 at 7:36:20 AM UTC+1, Sean Clemmer wrote:

+1

I have a similar story: After around six months using the v1.3.x series,
I upgraded from v1.3.4 to v1.4.4. I've had monitoring and metrics in place
for a while now, and compared to baseline I'm seeing occasional periods
where nodes appear to drop out and usually back into the cluster (with
NodeNotConnectedException). During these periods the cluster status is
red on-and-off for maybe 15-30 minutes with anywhere from 30 minutes to 2
hours in-between. The issue is worse in larger clusters with larger shard
counts (thousands, tens of thousands).

Resource utilization is still good. The number of shards (and the amount
of data) is essentially constant. I'm confident the upgrade was the only
change; I have strict controls on the clusters.

On Wed, Feb 25, 2015 at 5:18 PM, sagarl saga...@gmail.com wrote:

Hi,

We recently upgraded one of our ES Clusters from ES Version 1.1.0 to
1.4.1.

We have dedicated master-data-search deployment in AWS. Cluster settings
are same for all the clusters.

Strangely, only in one cluster; we are seeing that nodes are constantly
failing to connect to Master node and rejoining back.
It happens all the time, even during idle period (when there are no
read or writes)
.

We keep on seeing following exception in the logs
org.elasticsearch.transport.NodeNotConnectedException

Because of this, Cluster has slowed down considerably.

We use kopf plugin for monitoring and it keeps popping up message -
"Loading cluster information is talking too long"

There is not much data on individual nodes; almost 80% disk is free. CPU
and Heap are doing fine.

Only difference between this cluster and other cluster is the number of
indices and shards. Other clusters have shards in hundreds and indices in
double digit. While this cluster has around 5000 shards and close to 250
indices.

But we are not sure, if number of shards or indices can cause
reconnection issues between nodes.

Not sure if it's really related to 1.4.1 or something else. But in that
case, other clusters should have been affected too.

Any help will be appreciated !

Thanks,

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/12e9b4ab-7600-4d22-a347-c03edaebe4f3%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/12e9b4ab-7600-4d22-a347-c03edaebe4f3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b810319e-f553-4459-b81d-464743e90ee8%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b810319e-f553-4459-b81d-464743e90ee8%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8ebpywmOLkceMHP%2B7D9sxDAUQUuAk%3D1rHC96BQTJDt0w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

We have 10 data nodes which is storing data and separate master and client
nodes. So each node is having approx 1200 shards and 110 indices which I
think is not much.

On Fri, Feb 27, 2015 at 12:45 AM, Mark Walkom markwalkom@gmail.com wrote:

12000 shards across how many nodes?

Don't forget that a shard is a lucene instance, it needs resources to
operate and a node only has so many resources. This is why scaling is
important.

On 27 February 2015 at 10:10, emkt84@gmail.com wrote:

We have seen similar issue in our setup too, but we are running 1.3.6. I
think it occurs with large index and shards counts. We have approx 12000
shards in total. I think its a bug in elasticsearch, as having large shard
count does not mean that node just simply drops out from cluster. The shard
count stats and cluster membership threads should not depend upon each
other.

On Thursday, February 26, 2015 at 7:36:20 AM UTC+1, Sean Clemmer wrote:

+1

I have a similar story: After around six months using the v1.3.x series,
I upgraded from v1.3.4 to v1.4.4. I've had monitoring and metrics in place
for a while now, and compared to baseline I'm seeing occasional periods
where nodes appear to drop out and usually back into the cluster (with
NodeNotConnectedException). During these periods the cluster status is
red on-and-off for maybe 15-30 minutes with anywhere from 30 minutes to 2
hours in-between. The issue is worse in larger clusters with larger shard
counts (thousands, tens of thousands).

Resource utilization is still good. The number of shards (and the amount
of data) is essentially constant. I'm confident the upgrade was the
only change; I have strict controls on the clusters.

On Wed, Feb 25, 2015 at 5:18 PM, sagarl saga...@gmail.com wrote:

Hi,

We recently upgraded one of our ES Clusters from ES Version 1.1.0 to
1.4.1.

We have dedicated master-data-search deployment in AWS. Cluster
settings are same for all the clusters.

Strangely, only in one cluster; we are seeing that nodes are constantly
failing to connect to Master node and rejoining back.
It happens all the time, even during idle period (when there are no
read or writes)
.

We keep on seeing following exception in the logs
org.elasticsearch.transport.NodeNotConnectedException

Because of this, Cluster has slowed down considerably.

We use kopf plugin for monitoring and it keeps popping up message -
"Loading cluster information is talking too long"

There is not much data on individual nodes; almost 80% disk is free.
CPU and Heap are doing fine.

Only difference between this cluster and other cluster is the number of
indices and shards. Other clusters have shards in hundreds and indices in
double digit. While this cluster has around 5000 shards and close to 250
indices.

But we are not sure, if number of shards or indices can cause
reconnection issues between nodes.

Not sure if it's really related to 1.4.1 or something else. But in that
case, other clusters should have been affected too.

Any help will be appreciated !

Thanks,

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/12e9b4ab-7600-4d22-a347-c03edaebe4f3%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/12e9b4ab-7600-4d22-a347-c03edaebe4f3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b810319e-f553-4459-b81d-464743e90ee8%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b810319e-f553-4459-b81d-464743e90ee8%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/sy8K-48bwbU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8ebpywmOLkceMHP%2B7D9sxDAUQUuAk%3D1rHC96BQTJDt0w%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8ebpywmOLkceMHP%2B7D9sxDAUQUuAk%3D1rHC96BQTJDt0w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CADm%2BQZcDL2NGf1U_%3D-SpcumZbnK0zpnAbd5BOxU30oHNtTaZeQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

That's excessive. You don't need that many shards for 10 nodes.

On 27 February 2015 at 18:31, Em Kt emkt84@gmail.com wrote:

We have 10 data nodes which is storing data and separate master and client
nodes. So each node is having approx 1200 shards and 110 indices which I
think is not much.

On Fri, Feb 27, 2015 at 12:45 AM, Mark Walkom markwalkom@gmail.com
wrote:

12000 shards across how many nodes?

Don't forget that a shard is a lucene instance, it needs resources to
operate and a node only has so many resources. This is why scaling is
important.

On 27 February 2015 at 10:10, emkt84@gmail.com wrote:

We have seen similar issue in our setup too, but we are running 1.3.6. I
think it occurs with large index and shards counts. We have approx 12000
shards in total. I think its a bug in elasticsearch, as having large shard
count does not mean that node just simply drops out from cluster. The shard
count stats and cluster membership threads should not depend upon each
other.

On Thursday, February 26, 2015 at 7:36:20 AM UTC+1, Sean Clemmer wrote:

+1

I have a similar story: After around six months using the v1.3.x
series, I upgraded from v1.3.4 to v1.4.4. I've had monitoring and metrics
in place for a while now, and compared to baseline I'm seeing occasional
periods where nodes appear to drop out and usually back into the cluster
(with NodeNotConnectedException). During these periods the cluster
status is red on-and-off for maybe 15-30 minutes with anywhere from 30
minutes to 2 hours in-between. The issue is worse in larger clusters with
larger shard counts (thousands, tens of thousands).

Resource utilization is still good. The number of shards (and the
amount of data) is essentially constant. I'm confident the upgrade was
the only change; I have strict controls on the clusters.

On Wed, Feb 25, 2015 at 5:18 PM, sagarl saga...@gmail.com wrote:

Hi,

We recently upgraded one of our ES Clusters from ES Version 1.1.0 to
1.4.1.

We have dedicated master-data-search deployment in AWS. Cluster
settings are same for all the clusters.

Strangely, only in one cluster; we are seeing that nodes are
constantly failing to connect to Master node and rejoining back.
It happens all the time, even during idle period (when there are no
read or writes)
.

We keep on seeing following exception in the logs
org.elasticsearch.transport.NodeNotConnectedException

Because of this, Cluster has slowed down considerably.

We use kopf plugin for monitoring and it keeps popping up message -
"Loading cluster information is talking too long"

There is not much data on individual nodes; almost 80% disk is free.
CPU and Heap are doing fine.

Only difference between this cluster and other cluster is the number
of indices and shards. Other clusters have shards in hundreds and indices
in double digit. While this cluster has around 5000 shards and close to 250
indices.

But we are not sure, if number of shards or indices can cause
reconnection issues between nodes.

Not sure if it's really related to 1.4.1 or something else. But in
that case, other clusters should have been affected too.

Any help will be appreciated !

Thanks,

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/12e9b4ab-7600-4d22-a347-c03edaebe4f3%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/12e9b4ab-7600-4d22-a347-c03edaebe4f3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b810319e-f553-4459-b81d-464743e90ee8%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b810319e-f553-4459-b81d-464743e90ee8%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/sy8K-48bwbU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8ebpywmOLkceMHP%2B7D9sxDAUQUuAk%3D1rHC96BQTJDt0w%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8ebpywmOLkceMHP%2B7D9sxDAUQUuAk%3D1rHC96BQTJDt0w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CADm%2BQZcDL2NGf1U_%3D-SpcumZbnK0zpnAbd5BOxU30oHNtTaZeQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CADm%2BQZcDL2NGf1U_%3D-SpcumZbnK0zpnAbd5BOxU30oHNtTaZeQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-DXs9JBwFK73KF%2BdC6-UjDt_FzONVq6Ac5%2BXC79rBBdw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

There are two different points,we are trying to figure out

  1. why it started giving above mentioned error in 1.4.1 only ? It is working fine with no issues in 1.3.x.

  2. As I mentioned, disks have 80% space free and we have 5000 shards across 200 indices in 6 node cluster,but even in an idle cluster with no reads and no writes, we are seeing this issue.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/00f02b63-635b-40f9-853d-e3b11a37180c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I have created an issue
https://github.com/elasticsearch/elasticsearch/issues/10003 on github
site.

@Sean, @Em , please feel free to comment on it.

Thanks,

On Tuesday, March 3, 2015 at 8:33:29 AM UTC-8, sagarl wrote:

There are two different points,we are trying to figure out

  1. why it started giving above mentioned error in 1.4.1 only ? It is
    working fine with no issues in 1.3.x.

  2. As I mentioned, disks have 80% space free and we have 5000 shards
    across 200 indices in 6 node cluster,but even in an idle cluster with no
    reads and no writes, we are seeing this issue.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dc3183bc-f644-4cf8-b047-4990d3c96b71%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.