How to send zen-disco-join after zen-disco-node_failed

Jae · June 28, 2013, 7:53pm

Hi

I am using internally implemented Discovery module based on Netflix Eureka
in EC2. Netflix Eureka is providing discovery service such as Zookeeper but
it's more dedicated to discovery service. The serious problem is, after one
node marked as failed due to some networking problem, it couldn't be
rejoined in the cluster.

The following is error logs in the master:

stdout.log.2013-06-28:2013-06-28 06:59:06,359 INFO
org.elasticsearch.common.logging.log4j.Log4jESLogger:104
[elasticsearch[i-c2f4c2ac][clusterService#updateTask][T#1]] [internalInfo]
[i-c2f4c2ac] removed
{[i-aa5fb8c6][odcwXfj0TJq-aAA_9lTfZw][inet[/xxxxxxxx]]{rack_id=us-east-1c},},
reason:
zen-disco-node_failed([i-aa5fb8c6][odcwXfj0TJq-aAA_9lTfZw][inet[/xxxxxxxx]]{rack_id=us-east-1c}),
reason failed to ping, tried [3] times, each with maximum [1m] timeout

The following is erro logs in the failure node:

[2013-06-28 06:59:48,325][INFO ][org.elasticsearch.discovery.eureka]
[i-aa5fb8c6] master_left
[[i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d}],
reason [do not exists on master, act as master failure][2013-06-28
06:59:48,342][INFO ][org.elasticsearch.cluster.service] [i-aa5fb8c6] master
{new
[i-9c5fb8f0][CPM84y8wQU6cRPfQr8_uzw][inet[/xxxxxxxx]]{rack_id=us-east-1c},
previous
[i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d}},
removed
{[i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d},},
reason: zen-disco-master_failed
([i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d})
...
2013-06-28 07:02:48,424][INFO ][org.elasticsearch.discovery.eureka]
[i-aa5fb8c6] master_left
[[i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e}],
reason [no longer master]
[2013-06-28 07:02:48,426][INFO ][org.elasticsearch.cluster.service]
[i-aa5fb8c6] master {new
[i-b48665d1][IO5Kc0S6SQWxrn4HP9qNqQ][inet[/xxxxxxxx]]{rack_id=us-east-1e},
previous
[i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e}},
removed
{[i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e},},
reason: zen-disco-master_failed
([i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e})

The problem is above master_left error messages were repeated after its
networking problem solved.

I implemented EurekaDiscovery extending ZenDiscovery and
UnicastHostsProvider to add DiscoveryNode registered under Eureka server
with UP status.

Could you let me know how to make the failed node to rejoin the cluster
automatically without restarting the process? I think that this behavior is
already there in other Discovery logic but what am I missing now in
EurekaDiscovery?

If you want to look at source code, I can show because Netflix Eureka is
also open sourced.

Thank you
Best, Jae

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jae · June 28, 2013, 8:40pm

To add more information, it stopped to find the master after the following
error:

2013-06-28 07:18:48,849 WARN
org.elasticsearch.common.logging.log4j.Log4jESLogger:119
[elasticsearch[i-aa5fb8c6][generic][T#101]] [internalWarn] [i-aa5fb8c6]
failed to get _meta from [kafka]/[kafka1]
java.lang.NullPointerException at
org.elasticsearch.cluster.routing.IndexShardRoutingTable.preferAttributesActiveShardsIt(IndexShardRoutingTable.java:416)
at
org.elasticsearch.cluster.routing.IndexShardRoutingTable.preferAttributesActiveShardsIt(IndexShardRoutingTable.java:399)
at
org.elasticsearch.cluster.routing.operation.plain.PlainOperationRouting.preferenceActiveShardIterator(PlainOperationRouting.java:192)
at
org.elasticsearch.cluster.routing.operation.plain.PlainOperationRouting.getShards(PlainOperationRouting.java:81)
at
org.elasticsearch.action.get.TransportGetAction.shards(TransportGetAction.java:79)

at

org.elasticsearch.action.get.TransportGetAction.shards(TransportGetAction.java:42)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.(TransportShardSingleOperationAction.java:120)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.(TransportShardSingleOperationAction.java:94)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:71)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:46)
at
org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:61)

at

org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92)
at
org.elasticsearch.client.support.AbstractClient.get(AbstractClient.java:179)
at
org.elasticsearch.action.get.GetRequestBuilder.doExecute(GetRequestBuilder.java:112)
at
org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:62)
at
org.elasticsearch.river.RiversService$ApplyRivers.riverClusterChanged(RiversService.java:264)
at
org.elasticsearch.river.cluster.RiverClusterService$1.run(RiverClusterService.java:131)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jae · July 2, 2013, 5:02pm

No update yet? Is this a question worth to be disregarded?

On Friday, June 28, 2013 12:53:24 PM UTC-7, Jae wrote:

Hi

I am using internally implemented Discovery module based on Netflix Eureka
in EC2. Netflix Eureka is providing discovery service such as Zookeeper but
it's more dedicated to discovery service. The serious problem is, after one
node marked as failed due to some networking problem, it couldn't be
rejoined in the cluster.

The following is error logs in the master:

stdout.log.2013-06-28:2013-06-28 06:59:06,359 INFO
org.elasticsearch.common.logging.log4j.Log4jESLogger:104
[elasticsearch[i-c2f4c2ac][clusterService#updateTask][T#1]] [internalInfo]
[i-c2f4c2ac] removed
{[i-aa5fb8c6][odcwXfj0TJq-aAA_9lTfZw][inet[/xxxxxxxx]]{rack_id=us-east-1c},},
reason:
zen-disco-node_failed([i-aa5fb8c6][odcwXfj0TJq-aAA_9lTfZw][inet[/xxxxxxxx]]{rack_id=us-east-1c}),
reason failed to ping, tried [3] times, each with maximum [1m] timeout

The following is erro logs in the failure node:

[2013-06-28 06:59:48,325][INFO ][org.elasticsearch.discovery.eureka]
[i-aa5fb8c6] master_left
[[i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d}],
reason [do not exists on master, act as master failure][2013-06-28
06:59:48,342][INFO ][org.elasticsearch.cluster.service] [i-aa5fb8c6] master
{new
[i-9c5fb8f0][CPM84y8wQU6cRPfQr8_uzw][inet[/xxxxxxxx]]{rack_id=us-east-1c},
previous
[i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d}},
removed
{[i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d},},
reason: zen-disco-master_failed
([i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d})
...
2013-06-28 07:02:48,424][INFO ][org.elasticsearch.discovery.eureka]
[i-aa5fb8c6] master_left
[[i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e}],
reason [no longer master]
[2013-06-28 07:02:48,426][INFO ][org.elasticsearch.cluster.service]
[i-aa5fb8c6] master {new
[i-b48665d1][IO5Kc0S6SQWxrn4HP9qNqQ][inet[/xxxxxxxx]]{rack_id=us-east-1e},
previous
[i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e}},
removed
{[i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e},},
reason: zen-disco-master_failed
([i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e})

The problem is above master_left error messages were repeated after its
networking problem solved.

I implemented EurekaDiscovery extending ZenDiscovery and
UnicastHostsProvider to add DiscoveryNode registered under Eureka server
with UP status.

Could you let me know how to make the failed node to rejoin the cluster
automatically without restarting the process? I think that this behavior is
already there in other Discovery logic but what am I missing now in
EurekaDiscovery?

If you want to look at source code, I can show because Netflix Eureka is
also open sourced.

Thank you
Best, Jae

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Andrew_Spyker · October 15, 2013, 3:16pm

Jae,

Did you resolve this issue?

Also are you willing to share the code to collaborate. I am interested in
doing something similar.

On Tuesday, July 2, 2013 1:02:43 PM UTC-4, Jae wrote:

No update yet? Is this a question worth to be disregarded?

On Friday, June 28, 2013 12:53:24 PM UTC-7, Jae wrote:

Hi

I am using internally implemented Discovery module based on Netflix
Eureka in EC2. Netflix Eureka is providing discovery service such as
Zookeeper but it's more dedicated to discovery service. The serious problem
is, after one node marked as failed due to some networking problem, it
couldn't be rejoined in the cluster.

The following is error logs in the master:

stdout.log.2013-06-28:2013-06-28 06:59:06,359 INFO
org.elasticsearch.common.logging.log4j.Log4jESLogger:104
[elasticsearch[i-c2f4c2ac][clusterService#updateTask][T#1]] [internalInfo]
[i-c2f4c2ac] removed
{[i-aa5fb8c6][odcwXfj0TJq-aAA_9lTfZw][inet[/xxxxxxxx]]{rack_id=us-east-1c},},
reason:
zen-disco-node_failed([i-aa5fb8c6][odcwXfj0TJq-aAA_9lTfZw][inet[/xxxxxxxx]]{rack_id=us-east-1c}),
reason failed to ping, tried [3] times, each with maximum [1m] timeout

The following is erro logs in the failure node:

[2013-06-28 06:59:48,325][INFO ][org.elasticsearch.discovery.eureka]
[i-aa5fb8c6] master_left
[[i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d}],
reason [do not exists on master, act as master failure][2013-06-28
06:59:48,342][INFO ][org.elasticsearch.cluster.service] [i-aa5fb8c6] master
{new
[i-9c5fb8f0][CPM84y8wQU6cRPfQr8_uzw][inet[/xxxxxxxx]]{rack_id=us-east-1c},
previous
[i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d}},
removed
{[i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d},},
reason: zen-disco-master_failed
([i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d})
...
2013-06-28 07:02:48,424][INFO ][org.elasticsearch.discovery.eureka]
[i-aa5fb8c6] master_left
[[i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e}],
reason [no longer master]
[2013-06-28 07:02:48,426][INFO ][org.elasticsearch.cluster.service]
[i-aa5fb8c6] master {new
[i-b48665d1][IO5Kc0S6SQWxrn4HP9qNqQ][inet[/xxxxxxxx]]{rack_id=us-east-1e},
previous
[i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e}},
removed
{[i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e},},
reason: zen-disco-master_failed
([i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e})

The problem is above master_left error messages were repeated after its
networking problem solved.

I implemented EurekaDiscovery extending ZenDiscovery and
UnicastHostsProvider to add DiscoveryNode registered under Eureka server
with UP status.

Could you let me know how to make the failed node to rejoin the cluster
automatically without restarting the process? I think that this behavior is
already there in other Discovery logic but what am I missing now in
EurekaDiscovery?

If you want to look at source code, I can show because Netflix Eureka is
also open sourced.

Thank you
Best, Jae

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jae · October 15, 2013, 8:23pm

Hi Andrew

I didn't resolve this issue yet. I didn't implement Eureka based
Elasticsearch discovery as the plugin, so I am not ready to open the source
code but this is really simple.

Write EurekaDiscovery.java, EurekaDiscoveryModule.java and
EurekaUnicastHostsProvider.java.

EurekaDiscovery is extending ZenDiscovery and adding
EurekaUnicastHostsProvider as HostsProvider. EurekaDiscoveryModule is
extending ZenDiscoveryModule and implementing bindDiscovery method with
binding Discovery.class to EurekaDiscovery.

EurekaUnicastHostsProvider is implementing UnicastHostProvider interface
and method buildDynamicNodes. In buildDynamicNodes(), you can list
DiscoveryNode you want from whatever information you have.

Thank you
Best, Jae

PS) I am not working for NHN(Korean ISP) now.

On Wednesday, October 16, 2013 12:16:16 AM UTC+9, Andrew Spyker wrote:

Jae,

Did you resolve this issue?

Also are you willing to share the code to collaborate. I am interested in
doing something similar.

On Tuesday, July 2, 2013 1:02:43 PM UTC-4, Jae wrote:

No update yet? Is this a question worth to be disregarded?

On Friday, June 28, 2013 12:53:24 PM UTC-7, Jae wrote:

Hi

I am using internally implemented Discovery module based on Netflix
Eureka in EC2. Netflix Eureka is providing discovery service such as
Zookeeper but it's more dedicated to discovery service. The serious problem
is, after one node marked as failed due to some networking problem, it
couldn't be rejoined in the cluster.

The following is error logs in the master:

stdout.log.2013-06-28:2013-06-28 06:59:06,359 INFO
org.elasticsearch.common.logging.log4j.Log4jESLogger:104
[elasticsearch[i-c2f4c2ac][clusterService#updateTask][T#1]] [internalInfo]
[i-c2f4c2ac] removed
{[i-aa5fb8c6][odcwXfj0TJq-aAA_9lTfZw][inet[/xxxxxxxx]]{rack_id=us-east-1c},},
reason:
zen-disco-node_failed([i-aa5fb8c6][odcwXfj0TJq-aAA_9lTfZw][inet[/xxxxxxxx]]{rack_id=us-east-1c}),
reason failed to ping, tried [3] times, each with maximum [1m] timeout

The following is erro logs in the failure node:

[2013-06-28 06:59:48,325][INFO ][org.elasticsearch.discovery.eureka]
[i-aa5fb8c6] master_left
[[i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d}],
reason [do not exists on master, act as master failure][2013-06-28
06:59:48,342][INFO ][org.elasticsearch.cluster.service] [i-aa5fb8c6] master
{new
[i-9c5fb8f0][CPM84y8wQU6cRPfQr8_uzw][inet[/xxxxxxxx]]{rack_id=us-east-1c},
previous
[i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d}},
removed
{[i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d},},
reason: zen-disco-master_failed
([i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d})
...
2013-06-28 07:02:48,424][INFO ][org.elasticsearch.discovery.eureka]
[i-aa5fb8c6] master_left
[[i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e}],
reason [no longer master]
[2013-06-28 07:02:48,426][INFO ][org.elasticsearch.cluster.service]
[i-aa5fb8c6] master {new
[i-b48665d1][IO5Kc0S6SQWxrn4HP9qNqQ][inet[/xxxxxxxx]]{rack_id=us-east-1e},
previous
[i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e}},
removed
{[i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e},},
reason: zen-disco-master_failed
([i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e})

The problem is above master_left error messages were repeated after its
networking problem solved.

I implemented EurekaDiscovery extending ZenDiscovery and
UnicastHostsProvider to add DiscoveryNode registered under Eureka server
with UP status.

Could you let me know how to make the failed node to rejoin the cluster
automatically without restarting the process? I think that this behavior is
already there in other Discovery logic but what am I missing now in
EurekaDiscovery?

If you want to look at source code, I can show because Netflix Eureka is
also open sourced.

Thank you
Best, Jae

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Zen Discovery failure, 0.16.2 Elasticsearch	2	299	July 6, 2017
Zen-disco-node-failed ( Node disconnects often from cluster ) Elasticsearch	1	1642	March 1, 2017
Elasticsearch cluster fail to join on aws Elasticsearch	1	838	July 5, 2017
Elasticsearch 6.1.3 -- failed to discover master after node restart Elasticsearch	6	1240	April 27, 2018
ES cluster with Docker in AWS env Elasticsearch	1	599	December 16, 2016

How to send zen-disco-join after zen-disco-node_failed

Related topics