Nothing showed up on google, so here we go:
Currently 3 nodes don't want to join the cluster of 15+ nodes.
one of the nodes logs:
[2012-11-20 07:17:35,623][DEBUG][discovery.zen.fd ] [es-028]
[master] starting fault detection against master
[[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none}], reason [initial_join]
[2012-11-20 07:17:36,650][DEBUG][discovery.zen.fd ] [es-028]
[master] pinging a master
[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none} but we do not exists on it, act as if its master
failure
[2012-11-20 07:17:36,651][DEBUG][discovery.zen.fd ] [es-028]
[master] stopping fault detection against master
[[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none}], reason [master failure, do not exists on
master, act as master failure]
[2012-11-20 07:17:36,652][INFO ][discovery.zen ] [es-028]
master_left
[[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none}], reason [do not exists on master, act as master
failure]
master reports:
[2012-11-20 07:16:35,257][TRACE][discovery.zen.ping.multicast] [es-030] [1]
received ping_request from
[[es-028][33TXhg8vTR6FrLdlwjwMvw][inet[/10.32.0.135:29300]]{name_attr=es-028,
master=true, river=none}], sending ping_response{target
[[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none}], master
[[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none}], cluster_name[bc]}
[2012-11-20 07:17:05,259][TRACE][discovery.zen.ping.multicast] [es-030] [1]
received ping_request from
[[es-028][33TXhg8vTR6FrLdlwjwMvw][inet[/10.32.0.135:29300]]{name_attr=es-028,
master=true, river=none}], sending ping_response{target
[[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none}], master
[[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none}], cluster_name[bc]}
No network issues.. telneting from port to port works fine.
Config from one of the nodes, removed filters and stuff:
cluster:
name: bc
node:
name: ${SHORT_HOSTNAME}
name_attr: ${SHORT_HOSTNAME}
river: "none"
master: true
index:
number_of_shards: 29
number_of_replicas: 3
gateway.type: local
gateway.recover_after_nodes: 9
gateway.recover_after_time: 5m
gateway.expected_nodes: 10
cluster.routing.allocation.node_initial_primaries_recoveries: 10
cluster.routing.allocation.node_concurrent_recoveries: 5
cluster.routing.allocation.cluster_concurrent_rebalance: 20
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.timeout: 60s
discovery.zen.fd.ping_timeout: 60s
transport.tcp.port: 29300
http.port: 29200
index.search.slowlog.level: TRACE
index.search.slowlog.threshold.query.warn: 60s
index.search.slowlog.threshold.query.info: 10s
index.search.slowlog.threshold.query.debug: 5s
index.search.slowlog.threshold.query.trace: 2s
index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms
already tried restarting the rogue nodes to no avail. Any help would be
appreciated!
--