Hi
I am using internally implemented Discovery module based on Netflix Eureka
in EC2. Netflix Eureka is providing discovery service such as Zookeeper but
it's more dedicated to discovery service. The serious problem is, after one
node marked as failed due to some networking problem, it couldn't be
rejoined in the cluster.
The following is error logs in the master:
stdout.log.2013-06-28:2013-06-28 06:59:06,359 INFO
org.elasticsearch.common.logging.log4j.Log4jESLogger:104
[elasticsearch[i-c2f4c2ac][clusterService#updateTask][T#1]] [internalInfo]
[i-c2f4c2ac] removed
{[i-aa5fb8c6][odcwXfj0TJq-aAA_9lTfZw][inet[/xxxxxxxx]]{rack_id=us-east-1c},},
reason:
zen-disco-node_failed([i-aa5fb8c6][odcwXfj0TJq-aAA_9lTfZw][inet[/xxxxxxxx]]{rack_id=us-east-1c}),
reason failed to ping, tried [3] times, each with maximum [1m] timeout
The following is erro logs in the failure node:
[2013-06-28 06:59:48,325][INFO ][org.elasticsearch.discovery.eureka]
[i-aa5fb8c6] master_left
[[i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d}],
reason [do not exists on master, act as master failure][2013-06-28
06:59:48,342][INFO ][org.elasticsearch.cluster.service] [i-aa5fb8c6] master
{new
[i-9c5fb8f0][CPM84y8wQU6cRPfQr8_uzw][inet[/xxxxxxxx]]{rack_id=us-east-1c},
previous
[i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d}},
removed
{[i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d},},
reason: zen-disco-master_failed
([i-c2f4c2ac][BvGnnVNGRbaSlOAwMHFH8w][inet[/xxxxxxxx]]{rack_id=us-east-1d})
...
2013-06-28 07:02:48,424][INFO ][org.elasticsearch.discovery.eureka]
[i-aa5fb8c6] master_left
[[i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e}],
reason [no longer master]
[2013-06-28 07:02:48,426][INFO ][org.elasticsearch.cluster.service]
[i-aa5fb8c6] master {new
[i-b48665d1][IO5Kc0S6SQWxrn4HP9qNqQ][inet[/xxxxxxxx]]{rack_id=us-east-1e},
previous
[i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e}},
removed
{[i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e},},
reason: zen-disco-master_failed
([i-bc8665d9][DjdIsf0KTK6WWbjX8TSgLg][inet[/xxxxxxxx]]{rack_id=us-east-1e})
The problem is above master_left error messages were repeated after its
networking problem solved.
I implemented EurekaDiscovery extending ZenDiscovery and
UnicastHostsProvider to add DiscoveryNode registered under Eureka server
with UP status.
Could you let me know how to make the failed node to rejoin the cluster
automatically without restarting the process? I think that this behavior is
already there in other Discovery logic but what am I missing now in
EurekaDiscovery?
If you want to look at source code, I can show because Netflix Eureka is
also open sourced.
Thank you
Best, Jae
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.