Hung node, cluster state green


(davie) #1

Hi,
We recently had a situation where one of our nodes was hanging, http
requests to 9200 did not return, and indexing/search requests were timing
out.

I'm still looking into the cause of this, but I'm slightly surprised that
the cluster state stayed green.
When I shutdown the hung node the other node processed the requests fine.
Presumably this is a bug - that node should have been disconnected from the
cluster right?

Thanks,
Davie


(Shay Banon) #2

Yes, assuming it was not responding to pings from the master as well. It
takes time though, based on the ping timeouts settings. They are pretty
conservative.

On Wed, Jun 13, 2012 at 11:55 AM, Davie Moston daviemoston@gmail.comwrote:

Hi,
We recently had a situation where one of our nodes was hanging, http
requests to 9200 did not return, and indexing/search requests were timing
out.

I'm still looking into the cause of this, but I'm slightly surprised that
the cluster state stayed green.
When I shutdown the hung node the other node processed the requests fine.
Presumably this is a bug - that node should have been disconnected from
the cluster right?

Thanks,
Davie


(Daniel Schnell) #3

I am having the exact same problem at the moment. So far it has been going very well with ES.

But since yesterday I use a cleanup script which basically uses the 'more like this' query to find all duplicate entries diving through all documents in descending id order (btw: is there a better way?).
After 30.000-50.000 docs the same problem as you described happens: no more http request on this machine possible (http request timeouts) but cluster green status.

My gut feeling is that it has to do with the number of open file descriptors. I already have set these to 32000, but ulimit -a shows still the Ubuntu 10.04 standard of 1024, although bigdesk shows the value 32000.
This would also explain why the cluster state is still green. To my understanding tthe ES servers use persisting connections between them, so they would only hit the problem if they created new index files or rebalance the shards, right ?

This problem is reproducable. It needs a time, but it happens even on my Development Macbook Pro. I am using ES 0.19.3

Regards,
Daniel.

Am 14.06.2012 um 23:29 schrieb Shay Banon:

Yes, assuming it was not responding to pings from the master as well. It takes time though, based on the ping timeouts settings. They are pretty conservative.

On Wed, Jun 13, 2012 at 11:55 AM, Davie Moston daviemoston@gmail.com wrote:
Hi,
We recently had a situation where one of our nodes was hanging, http requests to 9200 did not return, and indexing/search requests were timing out.

I'm still looking into the cause of this, but I'm slightly surprised that the cluster state stayed green.
When I shutdown the hung node the other node processed the requests fine.
Presumably this is a bug - that node should have been disconnected from the cluster right?

Thanks,
Davie


(Daniel Schnell) #4

Hi,

after investigating during the day, i would rule out file descriptors. I set all of these to at least 32000 and couldn't see any dramatic increase of those.
But I noticed a different VM on one of the machines: 1.7-04 JDK instead of 1.6.20 on the other two, so I replaced this with the official Ubuntu OpenJDK 1.6.20. Furthermore I increased JVM Memory from 1 to 8GB on each of the 3 nodes of the cluster, which is half of what's available.
With these settings it needed quite a while longer until the HTTP timeout happened. A while after these timeouts I can see that one of the nodes disconnects from the cluster. There is a java_pid31477.hprof file in the elasticsearch/ directory of that node with about the timestamp the whole problem happened. Does this hint to anything obviously ?

My setup is is the standard 5 Shards / 1 Replica setup. I have 1 index with about 360.000 docs

Sorry for capturing your thread, I thought my problems could maybe well be related to yours.

Best regards,

Daniel.

Am 15.06.2012 um 09:21 schrieb Daniel Schnell:

I am having the exact same problem at the moment. So far it has been going very well with ES.

But since yesterday I use a cleanup script which basically uses the 'more like this' query to find all duplicate entries diving through all documents in descending id order (btw: is there a better way?).
After 30.000-50.000 docs the same problem as you described happens: no more http request on this machine possible (http request timeouts) but cluster green status.

My gut feeling is that it has to do with the number of open file descriptors. I already have set these to 32000, but ulimit -a shows still the Ubuntu 10.04 standard of 1024, although bigdesk shows the value 32000.
This would also explain why the cluster state is still green. To my understanding tthe ES servers use persisting connections between them, so they would only hit the problem if they created new index files or rebalance the shards, right ?

This problem is reproducable. It needs a time, but it happens even on my Development Macbook Pro. I am using ES 0.19.3

Regards,
Daniel.

Am 14.06.2012 um 23:29 schrieb Shay Banon:

Yes, assuming it was not responding to pings from the master as well. It takes time though, based on the ping timeouts settings. They are pretty conservative.

On Wed, Jun 13, 2012 at 11:55 AM, Davie Moston daviemoston@gmail.com wrote:
Hi,
We recently had a situation where one of our nodes was hanging, http requests to 9200 did not return, and indexing/search requests were timing out.

I'm still looking into the cause of this, but I'm slightly surprised that the cluster state stayed green.
When I shutdown the hung node the other node processed the requests fine.
Presumably this is a bug - that node should have been disconnected from the cluster right?

Thanks,
Davie


(Daniel Schnell) #5

Hi,

testing with 0.19.4, still no go. JVM on 100%, have to kill it with -9 and restart ES. No hints in the logs, just a hprof file is dumped. But the other nodes will notice the failing node after quite a while.
If I analyze the hprof file with MAT, what should I watch out for ?

in a LAN with sub millisecond ping times, what is a reasonable value to configure for the timeouts? Which settings should I change ? In my understanding the other cluster nodes have to notice pretty quickly if one node goes down, otherwise they try to route the queries to the failing node too long, right ?

Regards,

Daniel.

Am 16.06.2012 um 00:01 schrieb Daniel Schnell:

Hi,

after investigating during the day, i would rule out file descriptors. I set all of these to at least 32000 and couldn't see any dramatic increase of those.
But I noticed a different VM on one of the machines: 1.7-04 JDK instead of 1.6.20 on the other two, so I replaced this with the official Ubuntu OpenJDK 1.6.20. Furthermore I increased JVM Memory from 1 to 8GB on each of the 3 nodes of the cluster, which is half of what's available.
With these settings it needed quite a while longer until the HTTP timeout happened. A while after these timeouts I can see that one of the nodes disconnects from the cluster. There is a java_pid31477.hprof file in the elasticsearch/ directory of that node with about the timestamp the whole problem happened. Does this hint to anything obviously ?

My setup is is the standard 5 Shards / 1 Replica setup. I have 1 index with about 360.000 docs

Sorry for capturing your thread, I thought my problems could maybe well be related to yours.

Best regards,

Daniel.

Am 15.06.2012 um 09:21 schrieb Daniel Schnell:

I am having the exact same problem at the moment. So far it has been going very well with ES.

But since yesterday I use a cleanup script which basically uses the 'more like this' query to find all duplicate entries diving through all documents in descending id order (btw: is there a better way?).
After 30.000-50.000 docs the same problem as you described happens: no more http request on this machine possible (http request timeouts) but cluster green status.

My gut feeling is that it has to do with the number of open file descriptors. I already have set these to 32000, but ulimit -a shows still the Ubuntu 10.04 standard of 1024, although bigdesk shows the value 32000.
This would also explain why the cluster state is still green. To my understanding tthe ES servers use persisting connections between them, so they would only hit the problem if they created new index files or rebalance the shards, right ?

This problem is reproducable. It needs a time, but it happens even on my Development Macbook Pro. I am using ES 0.19.3

Regards,
Daniel.

Am 14.06.2012 um 23:29 schrieb Shay Banon:

Yes, assuming it was not responding to pings from the master as well. It takes time though, based on the ping timeouts settings. They are pretty conservative.

On Wed, Jun 13, 2012 at 11:55 AM, Davie Moston daviemoston@gmail.com wrote:
Hi,
We recently had a situation where one of our nodes was hanging, http requests to 9200 did not return, and indexing/search requests were timing out.

I'm still looking into the cause of this, but I'm slightly surprised that the cluster state stayed green.
When I shutdown the hung node the other node processed the requests fine.
Presumably this is a bug - that node should have been disconnected from the cluster right?

Thanks,
Davie


(Daniel Schnell) #6

Hi,

updated to version 0.19.6: and it works now !

I could finally iterate over all >400.000 docs with my script and executed >350.000 more like this queries to clean up the duplicated docs while inserting new docs from another process. This was not possible beforehand.

Thanks Shay and folks: this gives me much more confidence in the technology now :wink:

Regards,

Daniel.

Am 21.06.2012 um 09:50 schrieb Daniel Schnell:

Hi,

testing with 0.19.4, still no go. JVM on 100%, have to kill it with -9 and restart ES. No hints in the logs, just a hprof file is dumped. But the other nodes will notice the failing node after quite a while.
If I analyze the hprof file with MAT, what should I watch out for ?

in a LAN with sub millisecond ping times, what is a reasonable value to configure for the timeouts? Which settings should I change ? In my understanding the other cluster nodes have to notice pretty quickly if one node goes down, otherwise they try to route the queries to the failing node too long, right ?

Regards,

Daniel.

Am 16.06.2012 um 00:01 schrieb Daniel Schnell:

Hi,

after investigating during the day, i would rule out file descriptors. I set all of these to at least 32000 and couldn't see any dramatic increase of those.
But I noticed a different VM on one of the machines: 1.7-04 JDK instead of 1.6.20 on the other two, so I replaced this with the official Ubuntu OpenJDK 1.6.20. Furthermore I increased JVM Memory from 1 to 8GB on each of the 3 nodes of the cluster, which is half of what's available.
With these settings it needed quite a while longer until the HTTP timeout happened. A while after these timeouts I can see that one of the nodes disconnects from the cluster. There is a java_pid31477.hprof file in the elasticsearch/ directory of that node with about the timestamp the whole problem happened. Does this hint to anything obviously ?

My setup is is the standard 5 Shards / 1 Replica setup. I have 1 index with about 360.000 docs

Sorry for capturing your thread, I thought my problems could maybe well be related to yours.

Best regards,

Daniel.

Am 15.06.2012 um 09:21 schrieb Daniel Schnell:

I am having the exact same problem at the moment. So far it has been going very well with ES.

But since yesterday I use a cleanup script which basically uses the 'more like this' query to find all duplicate entries diving through all documents in descending id order (btw: is there a better way?).
After 30.000-50.000 docs the same problem as you described happens: no more http request on this machine possible (http request timeouts) but cluster green status.

My gut feeling is that it has to do with the number of open file descriptors. I already have set these to 32000, but ulimit -a shows still the Ubuntu 10.04 standard of 1024, although bigdesk shows the value 32000.
This would also explain why the cluster state is still green. To my understanding tthe ES servers use persisting connections between them, so they would only hit the problem if they created new index files or rebalance the shards, right ?

This problem is reproducable. It needs a time, but it happens even on my Development Macbook Pro. I am using ES 0.19.3

Regards,
Daniel.

Am 14.06.2012 um 23:29 schrieb Shay Banon:

Yes, assuming it was not responding to pings from the master as well. It takes time though, based on the ping timeouts settings. They are pretty conservative.

On Wed, Jun 13, 2012 at 11:55 AM, Davie Moston daviemoston@gmail.com wrote:
Hi,
We recently had a situation where one of our nodes was hanging, http requests to 9200 did not return, and indexing/search requests were timing out.

I'm still looking into the cause of this, but I'm slightly surprised that the cluster state stayed green.
When I shutdown the hung node the other node processed the requests fine.
Presumably this is a bug - that node should have been disconnected from the cluster right?

Thanks,
Davie


(system) #7