We recently upgraded one of our ES Clusters from ES Version 1.1.0 to 1.4.1.
We have dedicated master-data-search deployment in AWS. Cluster settings
are same for all the clusters.
Strangely, only in one cluster; we are seeing that nodes are constantly
failing to connect to Master node and rejoining back. It happens all the time, even during idle period (when there are no read
or writes).
We keep on seeing following exception in the logs
org.elasticsearch.transport.NodeNotConnectedException
Because of this, Cluster has slowed down considerably.
We use kopf plugin for monitoring and it keeps popping up message -
"Loading cluster information is talking too long"
There is not much data on individual nodes; almost 80% disk is free. CPU
and Heap are doing fine.
Only difference between this cluster and other cluster is the number of
indices and shards. Other clusters have shards in hundreds and indices in
double digit. While this cluster has around 5000 shards and close to 250
indices.
But we are not sure, if number of shards or indices can cause reconnection
issues between nodes.
Not sure if it's really related to 1.4.1 or something else. But in that
case, other clusters should have been affected too.
I have a similar story: After around six months using the v1.3.x series, I
upgraded from v1.3.4 to v1.4.4. I've had monitoring and metrics in place
for a while now, and compared to baseline I'm seeing occasional periods
where nodes appear to drop out and usually back into the cluster (with
NodeNotConnectedException). During these periods the cluster status is red
on-and-off for maybe 15-30 minutes with anywhere from 30 minutes to 2 hours
in-between. The issue is worse in larger clusters with larger shard counts
(thousands, tens of thousands).
Resource utilization is still good. The number of shards (and the amount of
data) is essentially constant. I'm confident the upgrade was the only
change; I have strict controls on the clusters.
We recently upgraded one of our ES Clusters from ES Version 1.1.0 to 1.4.1.
We have dedicated master-data-search deployment in AWS. Cluster settings
are same for all the clusters.
Strangely, only in one cluster; we are seeing that nodes are constantly
failing to connect to Master node and rejoining back. It happens all the time, even during idle period (when there are no read
or writes).
We keep on seeing following exception in the logs
org.elasticsearch.transport.NodeNotConnectedException
Because of this, Cluster has slowed down considerably.
We use kopf plugin for monitoring and it keeps popping up message -
"Loading cluster information is talking too long"
There is not much data on individual nodes; almost 80% disk is free. CPU
and Heap are doing fine.
Only difference between this cluster and other cluster is the number of
indices and shards. Other clusters have shards in hundreds and indices in
double digit. While this cluster has around 5000 shards and close to 250
indices.
But we are not sure, if number of shards or indices can cause reconnection
issues between nodes.
Not sure if it's really related to 1.4.1 or something else. But in that
case, other clusters should have been affected too.
We have seen similar issue in our setup too, but we are running 1.3.6. I
think it occurs with large index and shards counts. We have approx 12000
shards in total. I think its a bug in elasticsearch, as having large shard
count does not mean that node just simply drops out from cluster. The shard
count stats and cluster membership threads should not depend upon each
other.
On Thursday, February 26, 2015 at 7:36:20 AM UTC+1, Sean Clemmer wrote:
+1
I have a similar story: After around six months using the v1.3.x series, I
upgraded from v1.3.4 to v1.4.4. I've had monitoring and metrics in place
for a while now, and compared to baseline I'm seeing occasional periods
where nodes appear to drop out and usually back into the cluster (with
NodeNotConnectedException). During these periods the cluster status is
red on-and-off for maybe 15-30 minutes with anywhere from 30 minutes to 2
hours in-between. The issue is worse in larger clusters with larger shard
counts (thousands, tens of thousands).
Resource utilization is still good. The number of shards (and the amount
of data) is essentially constant. I'm confident the upgrade was the only
change; I have strict controls on the clusters.
On Wed, Feb 25, 2015 at 5:18 PM, sagarl <saga...@gmail.com <javascript:>>
wrote:
Hi,
We recently upgraded one of our ES Clusters from ES Version 1.1.0 to
1.4.1.
We have dedicated master-data-search deployment in AWS. Cluster settings
are same for all the clusters.
Strangely, only in one cluster; we are seeing that nodes are constantly
failing to connect to Master node and rejoining back. It happens all the time, even during idle period (when there are no read
or writes).
We keep on seeing following exception in the logs
org.elasticsearch.transport.NodeNotConnectedException
Because of this, Cluster has slowed down considerably.
We use kopf plugin for monitoring and it keeps popping up message -
"Loading cluster information is talking too long"
There is not much data on individual nodes; almost 80% disk is free. CPU
and Heap are doing fine.
Only difference between this cluster and other cluster is the number of
indices and shards. Other clusters have shards in hundreds and indices in
double digit. While this cluster has around 5000 shards and close to 250
indices.
But we are not sure, if number of shards or indices can cause
reconnection issues between nodes.
Not sure if it's really related to 1.4.1 or something else. But in that
case, other clusters should have been affected too.
We have seen similar issue in our setup too, but we are running 1.3.6. I
think it occurs with large index and shards counts. We have approx 12000
shards in total. I think its a bug in elasticsearch, as having large shard
count does not mean that node just simply drops out from cluster. The shard
count stats and cluster membership threads should not depend upon each
other.
On Thursday, February 26, 2015 at 7:36:20 AM UTC+1, Sean Clemmer wrote:
+1
I have a similar story: After around six months using the v1.3.x series,
I upgraded from v1.3.4 to v1.4.4. I've had monitoring and metrics in place
for a while now, and compared to baseline I'm seeing occasional periods
where nodes appear to drop out and usually back into the cluster (with
NodeNotConnectedException). During these periods the cluster status is
red on-and-off for maybe 15-30 minutes with anywhere from 30 minutes to 2
hours in-between. The issue is worse in larger clusters with larger shard
counts (thousands, tens of thousands).
Resource utilization is still good. The number of shards (and the amount
of data) is essentially constant. I'm confident the upgrade was the only
change; I have strict controls on the clusters.
We recently upgraded one of our ES Clusters from ES Version 1.1.0 to
1.4.1.
We have dedicated master-data-search deployment in AWS. Cluster settings
are same for all the clusters.
Strangely, only in one cluster; we are seeing that nodes are constantly
failing to connect to Master node and rejoining back. It happens all the time, even during idle period (when there are no
read or writes).
We keep on seeing following exception in the logs
org.elasticsearch.transport.NodeNotConnectedException
Because of this, Cluster has slowed down considerably.
We use kopf plugin for monitoring and it keeps popping up message -
"Loading cluster information is talking too long"
There is not much data on individual nodes; almost 80% disk is free. CPU
and Heap are doing fine.
Only difference between this cluster and other cluster is the number of
indices and shards. Other clusters have shards in hundreds and indices in
double digit. While this cluster has around 5000 shards and close to 250
indices.
But we are not sure, if number of shards or indices can cause
reconnection issues between nodes.
Not sure if it's really related to 1.4.1 or something else. But in that
case, other clusters should have been affected too.
We have 10 data nodes which is storing data and separate master and client
nodes. So each node is having approx 1200 shards and 110 indices which I
think is not much.
We have seen similar issue in our setup too, but we are running 1.3.6. I
think it occurs with large index and shards counts. We have approx 12000
shards in total. I think its a bug in elasticsearch, as having large shard
count does not mean that node just simply drops out from cluster. The shard
count stats and cluster membership threads should not depend upon each
other.
On Thursday, February 26, 2015 at 7:36:20 AM UTC+1, Sean Clemmer wrote:
+1
I have a similar story: After around six months using the v1.3.x series,
I upgraded from v1.3.4 to v1.4.4. I've had monitoring and metrics in place
for a while now, and compared to baseline I'm seeing occasional periods
where nodes appear to drop out and usually back into the cluster (with
NodeNotConnectedException). During these periods the cluster status is
red on-and-off for maybe 15-30 minutes with anywhere from 30 minutes to 2
hours in-between. The issue is worse in larger clusters with larger shard
counts (thousands, tens of thousands).
Resource utilization is still good. The number of shards (and the amount
of data) is essentially constant. I'm confident the upgrade was the
only change; I have strict controls on the clusters.
We recently upgraded one of our ES Clusters from ES Version 1.1.0 to
1.4.1.
We have dedicated master-data-search deployment in AWS. Cluster
settings are same for all the clusters.
Strangely, only in one cluster; we are seeing that nodes are constantly
failing to connect to Master node and rejoining back. It happens all the time, even during idle period (when there are no
read or writes).
We keep on seeing following exception in the logs
org.elasticsearch.transport.NodeNotConnectedException
Because of this, Cluster has slowed down considerably.
We use kopf plugin for monitoring and it keeps popping up message -
"Loading cluster information is talking too long"
There is not much data on individual nodes; almost 80% disk is free.
CPU and Heap are doing fine.
Only difference between this cluster and other cluster is the number of
indices and shards. Other clusters have shards in hundreds and indices in
double digit. While this cluster has around 5000 shards and close to 250
indices.
But we are not sure, if number of shards or indices can cause
reconnection issues between nodes.
Not sure if it's really related to 1.4.1 or something else. But in that
case, other clusters should have been affected too.
We have 10 data nodes which is storing data and separate master and client
nodes. So each node is having approx 1200 shards and 110 indices which I
think is not much.
We have seen similar issue in our setup too, but we are running 1.3.6. I
think it occurs with large index and shards counts. We have approx 12000
shards in total. I think its a bug in elasticsearch, as having large shard
count does not mean that node just simply drops out from cluster. The shard
count stats and cluster membership threads should not depend upon each
other.
On Thursday, February 26, 2015 at 7:36:20 AM UTC+1, Sean Clemmer wrote:
+1
I have a similar story: After around six months using the v1.3.x
series, I upgraded from v1.3.4 to v1.4.4. I've had monitoring and metrics
in place for a while now, and compared to baseline I'm seeing occasional
periods where nodes appear to drop out and usually back into the cluster
(with NodeNotConnectedException). During these periods the cluster
status is red on-and-off for maybe 15-30 minutes with anywhere from 30
minutes to 2 hours in-between. The issue is worse in larger clusters with
larger shard counts (thousands, tens of thousands).
Resource utilization is still good. The number of shards (and the
amount of data) is essentially constant. I'm confident the upgrade was
the only change; I have strict controls on the clusters.
We recently upgraded one of our ES Clusters from ES Version 1.1.0 to
1.4.1.
We have dedicated master-data-search deployment in AWS. Cluster
settings are same for all the clusters.
Strangely, only in one cluster; we are seeing that nodes are
constantly failing to connect to Master node and rejoining back. It happens all the time, even during idle period (when there are no
read or writes).
We keep on seeing following exception in the logs
org.elasticsearch.transport.NodeNotConnectedException
Because of this, Cluster has slowed down considerably.
We use kopf plugin for monitoring and it keeps popping up message -
"Loading cluster information is talking too long"
There is not much data on individual nodes; almost 80% disk is free.
CPU and Heap are doing fine.
Only difference between this cluster and other cluster is the number
of indices and shards. Other clusters have shards in hundreds and indices
in double digit. While this cluster has around 5000 shards and close to 250
indices.
But we are not sure, if number of shards or indices can cause
reconnection issues between nodes.
Not sure if it's really related to 1.4.1 or something else. But in
that case, other clusters should have been affected too.
There are two different points,we are trying to figure out
why it started giving above mentioned error in 1.4.1 only ? It is working fine with no issues in 1.3.x.
As I mentioned, disks have 80% space free and we have 5000 shards across 200 indices in 6 node cluster,but even in an idle cluster with no reads and no writes, we are seeing this issue.
On Tuesday, March 3, 2015 at 8:33:29 AM UTC-8, sagarl wrote:
There are two different points,we are trying to figure out
why it started giving above mentioned error in 1.4.1 only ? It is
working fine with no issues in 1.3.x.
As I mentioned, disks have 80% space free and we have 5000 shards
across 200 indices in 6 node cluster,but even in an idle cluster with no
reads and no writes, we are seeing this issue.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.