Finding the reason behind random node shutdowns


(Donald Piret-2) #1

Hello,

We're currently setting up a cluster of 3 ES nodes running on EC2 with auto
discovery.
Everything seems to be working fine except for regular and random node
restarts, which makes the master move around quite a bit and makes the
nodes sometimes unavailable for a few seconds (quite annoying as we're
using Tire which doesn't natively support falling back to alternate nodes
when queries fail).

The log files don't indicate anything specific and look like this:

[2013-08-29 06:54:41,310][INFO ][node ]
[elasticsearch3] version[0.90.3], pid[7919],
build[5c38d60/2013-08-06T13:18:31Z]
[2013-08-29 06:54:41,310][INFO ][node ]
[elasticsearch3] initializing ...
[2013-08-29 06:54:41,401][INFO ][plugins ]
[elasticsearch3] loaded [analysis-kuromoji, cloud-aws], sites [bigdesk,
head, paramedic]
[2013-08-29 06:54:45,562][INFO ][node ]
[elasticsearch3] initialized
[2013-08-29 06:54:45,563][INFO ][node ]
[elasticsearch3] starting ...
[2013-08-29 06:54:45,790][INFO ][transport ]
[elasticsearch3] bound_address {inet[/10.158.2.117:9300]}, publish_address
{inet[/10.158.2.117:9300]}
[2013-08-29 06:54:49,826][INFO ][cluster.service ]
[elasticsearch3] new_master
[elasticsearch3][0-r3DpIaS4aehPk9BQcrAQ][inet[/10.158.2.117:9300]], reason:
zen-disco-join (elected_as_master)
[2013-08-29 06:54:49,835][INFO ][discovery ]
[elasticsearch3] elasticsearch/0-r3DpIaS4aehPk9BQcrAQ
[2013-08-29 06:54:49,861][INFO ][http ]
[elasticsearch3] bound_address {inet[/10.158.2.117:9200]}, publish_address
{inet[/10.158.2.117:9200]}
[2013-08-29 06:54:49,861][INFO ][node ]
[elasticsearch3] started
[2013-08-29 06:54:51,089][INFO ][gateway ]
[elasticsearch3] recovered [8] indices into cluster_state
[2013-08-29 06:54:55,446][INFO ][cluster.service ]
[elasticsearch3] added
{[elasticsearch2][6e4q3UuASI-m08ZvrUoGFw][inet[/10.215.41.203:9300]],},
reason: zen-disco-receive(join from
node[[elasticsearch2][6e4q3UuASI-m08ZvrUoGFw][inet[/10.215.41.203:9300]]])
[2013-08-29 06:55:06,613][INFO ][cluster.service ]
[elasticsearch3] added
{[elasticsearch1][ZwxSqqehRXutdqtWz-THKw][inet[/10.31.146.12:9300]],},
reason: zen-disco-receive(join from
node[[elasticsearch1][ZwxSqqehRXutdqtWz-THKw][inet[/10.31.146.12:9300]]])
[2013-08-29 07:10:57,523][INFO ][node ]
[elasticsearch3] stopping ...
[2013-08-29 07:10:57,611][INFO ][node ]
[elasticsearch3] stopped
[2013-08-29 07:10:57,615][INFO ][node ]
[elasticsearch3] closing ...
[2013-08-29 07:10:57,623][INFO ][node ]
[elasticsearch3] closed
[2013-08-29 07:10:59,383][INFO ][node ]
[elasticsearch3] version[0.90.3], pid[10127],
build[5c38d60/2013-08-06T13:18:31Z]
[2013-08-29 07:10:59,384][INFO ][node ]
[elasticsearch3] initializing ...
[2013-08-29 07:10:59,418][INFO ][plugins ]
[elasticsearch3] loaded [analysis-kuromoji, cloud-aws], sites [bigdesk,
head, paramedic]
[2013-08-29 07:11:03,110][INFO ][node ]
[elasticsearch3] initialized
[2013-08-29 07:11:03,111][INFO ][node ]
[elasticsearch3] starting ...

As you can see at 7:10:57 it's just stopping the node out of the blue
without any apparent reason.

Why would this be the case or how could I get more details about the reason
behind the shutdown and how to prevent it?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(simonw-2) #2

I have seen these also in some tests but I never figured it out though. Can
you set you logging level to TRACE and maybe gimme a log?

simon

On Thursday, August 29, 2013 9:35:12 AM UTC+2, Donald Piret wrote:

Hello,

We're currently setting up a cluster of 3 ES nodes running on EC2 with
auto discovery.
Everything seems to be working fine except for regular and random node
restarts, which makes the master move around quite a bit and makes the
nodes sometimes unavailable for a few seconds (quite annoying as we're
using Tire which doesn't natively support falling back to alternate nodes
when queries fail).

The log files don't indicate anything specific and look like this:

[2013-08-29 06:54:41,310][INFO ][node ]
[elasticsearch3] version[0.90.3], pid[7919],
build[5c38d60/2013-08-06T13:18:31Z]
[2013-08-29 06:54:41,310][INFO ][node ]
[elasticsearch3] initializing ...
[2013-08-29 06:54:41,401][INFO ][plugins ]
[elasticsearch3] loaded [analysis-kuromoji, cloud-aws], sites [bigdesk,
head, paramedic]
[2013-08-29 06:54:45,562][INFO ][node ]
[elasticsearch3] initialized
[2013-08-29 06:54:45,563][INFO ][node ]
[elasticsearch3] starting ...
[2013-08-29 06:54:45,790][INFO ][transport ]
[elasticsearch3] bound_address {inet[/10.158.2.117:9300]}, publish_address
{inet[/10.158.2.117:9300]}
[2013-08-29 06:54:49,826][INFO ][cluster.service ]
[elasticsearch3] new_master
[elasticsearch3][0-r3DpIaS4aehPk9BQcrAQ][inet[/10.158.2.117:9300]], reason:
zen-disco-join (elected_as_master)
[2013-08-29 06:54:49,835][INFO ][discovery ]
[elasticsearch3] elasticsearch/0-r3DpIaS4aehPk9BQcrAQ
[2013-08-29 06:54:49,861][INFO ][http ]
[elasticsearch3] bound_address {inet[/10.158.2.117:9200]}, publish_address
{inet[/10.158.2.117:9200]}
[2013-08-29 06:54:49,861][INFO ][node ]
[elasticsearch3] started
[2013-08-29 06:54:51,089][INFO ][gateway ]
[elasticsearch3] recovered [8] indices into cluster_state
[2013-08-29 06:54:55,446][INFO ][cluster.service ]
[elasticsearch3] added
{[elasticsearch2][6e4q3UuASI-m08ZvrUoGFw][inet[/10.215.41.203:9300]],},
reason: zen-disco-receive(join from
node[[elasticsearch2][6e4q3UuASI-m08ZvrUoGFw][inet[/10.215.41.203:9300]]])
[2013-08-29 06:55:06,613][INFO ][cluster.service ]
[elasticsearch3] added
{[elasticsearch1][ZwxSqqehRXutdqtWz-THKw][inet[/10.31.146.12:9300]],},
reason: zen-disco-receive(join from
node[[elasticsearch1][ZwxSqqehRXutdqtWz-THKw][inet[/10.31.146.12:9300]]])
[2013-08-29 07:10:57,523][INFO ][node ]
[elasticsearch3] stopping ...
[2013-08-29 07:10:57,611][INFO ][node ]
[elasticsearch3] stopped
[2013-08-29 07:10:57,615][INFO ][node ]
[elasticsearch3] closing ...
[2013-08-29 07:10:57,623][INFO ][node ]
[elasticsearch3] closed
[2013-08-29 07:10:59,383][INFO ][node ]
[elasticsearch3] version[0.90.3], pid[10127],
build[5c38d60/2013-08-06T13:18:31Z]
[2013-08-29 07:10:59,384][INFO ][node ]
[elasticsearch3] initializing ...
[2013-08-29 07:10:59,418][INFO ][plugins ]
[elasticsearch3] loaded [analysis-kuromoji, cloud-aws], sites [bigdesk,
head, paramedic]
[2013-08-29 07:11:03,110][INFO ][node ]
[elasticsearch3] initialized
[2013-08-29 07:11:03,111][INFO ][node ]
[elasticsearch3] starting ...

As you can see at 7:10:57 it's just stopping the node out of the blue
without any apparent reason.

Why would this be the case or how could I get more details about the
reason behind the shutdown and how to prevent it?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Donald Piret-2) #3

Hey Simon,

Setting the rootLogger to trace seems to generate massive amounts of
output, not too sure where i'd have to start looking to even find these
errors anymore.
Is there any specific logger action I could set to trace that would provide
more targeted output or will I just have to plow through it?

On Thursday, August 29, 2013 5:36:52 PM UTC+8, simonw wrote:

I have seen these also in some tests but I never figured it out though.
Can you set you logging level to TRACE and maybe gimme a log?

simon

On Thursday, August 29, 2013 9:35:12 AM UTC+2, Donald Piret wrote:

Hello,

We're currently setting up a cluster of 3 ES nodes running on EC2 with
auto discovery.
Everything seems to be working fine except for regular and random node
restarts, which makes the master move around quite a bit and makes the
nodes sometimes unavailable for a few seconds (quite annoying as we're
using Tire which doesn't natively support falling back to alternate nodes
when queries fail).

The log files don't indicate anything specific and look like this:

[2013-08-29 06:54:41,310][INFO ][node ]
[elasticsearch3] version[0.90.3], pid[7919],
build[5c38d60/2013-08-06T13:18:31Z]
[2013-08-29 06:54:41,310][INFO ][node ]
[elasticsearch3] initializing ...
[2013-08-29 06:54:41,401][INFO ][plugins ]
[elasticsearch3] loaded [analysis-kuromoji, cloud-aws], sites [bigdesk,
head, paramedic]
[2013-08-29 06:54:45,562][INFO ][node ]
[elasticsearch3] initialized
[2013-08-29 06:54:45,563][INFO ][node ]
[elasticsearch3] starting ...
[2013-08-29 06:54:45,790][INFO ][transport ]
[elasticsearch3] bound_address {inet[/10.158.2.117:9300]}, publish_address
{inet[/10.158.2.117:9300]}
[2013-08-29 06:54:49,826][INFO ][cluster.service ]
[elasticsearch3] new_master
[elasticsearch3][0-r3DpIaS4aehPk9BQcrAQ][inet[/10.158.2.117:9300]], reason:
zen-disco-join (elected_as_master)
[2013-08-29 06:54:49,835][INFO ][discovery ]
[elasticsearch3] elasticsearch/0-r3DpIaS4aehPk9BQcrAQ
[2013-08-29 06:54:49,861][INFO ][http ]
[elasticsearch3] bound_address {inet[/10.158.2.117:9200]}, publish_address
{inet[/10.158.2.117:9200]}
[2013-08-29 06:54:49,861][INFO ][node ]
[elasticsearch3] started
[2013-08-29 06:54:51,089][INFO ][gateway ]
[elasticsearch3] recovered [8] indices into cluster_state
[2013-08-29 06:54:55,446][INFO ][cluster.service ]
[elasticsearch3] added
{[elasticsearch2][6e4q3UuASI-m08ZvrUoGFw][inet[/10.215.41.203:9300]],},
reason: zen-disco-receive(join from
node[[elasticsearch2][6e4q3UuASI-m08ZvrUoGFw][inet[/10.215.41.203:9300]]])
[2013-08-29 06:55:06,613][INFO ][cluster.service ]
[elasticsearch3] added
{[elasticsearch1][ZwxSqqehRXutdqtWz-THKw][inet[/10.31.146.12:9300]],},
reason: zen-disco-receive(join from
node[[elasticsearch1][ZwxSqqehRXutdqtWz-THKw][inet[/10.31.146.12:9300]]])
[2013-08-29 07:10:57,523][INFO ][node ]
[elasticsearch3] stopping ...
[2013-08-29 07:10:57,611][INFO ][node ]
[elasticsearch3] stopped
[2013-08-29 07:10:57,615][INFO ][node ]
[elasticsearch3] closing ...
[2013-08-29 07:10:57,623][INFO ][node ]
[elasticsearch3] closed
[2013-08-29 07:10:59,383][INFO ][node ]
[elasticsearch3] version[0.90.3], pid[10127],
build[5c38d60/2013-08-06T13:18:31Z]
[2013-08-29 07:10:59,384][INFO ][node ]
[elasticsearch3] initializing ...
[2013-08-29 07:10:59,418][INFO ][plugins ]
[elasticsearch3] loaded [analysis-kuromoji, cloud-aws], sites [bigdesk,
head, paramedic]
[2013-08-29 07:11:03,110][INFO ][node ]
[elasticsearch3] initialized
[2013-08-29 07:11:03,111][INFO ][node ]
[elasticsearch3] starting ...

As you can see at 7:10:57 it's just stopping the node out of the blue
without any apparent reason.

Why would this be the case or how could I get more details about the
reason behind the shutdown and how to prevent it?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


#4

Did you find the reason? or a solution?


(system) #5