ES client nor node can not join cluster while it performs maintenance or shard relocation

Hi all

I have following situation. We use 20 node cluster with replication 1. We
have 5 masters and minimal master count is set to 3. When cluster is
initializing, relocating_shards or creating new indice master node utilize
1 CPU core to 100% and during that time other nodes can not join cluster
reporting following error

[2013-03-20 10:50:20,815][DEBUG][discovery.zen ] [srvd1573]
using ping.timeout [3s], master_election.filter_client [true],
master_election.filter_data [false]
[2013-03-20 10:50:20,816][DEBUG][discovery.zen.elect ] [srvd1573]
using minimum_master_nodes [3]
[2013-03-20 10:50:20,817][DEBUG][discovery.zen.fd ] [srvd1573]
[master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2013-03-20 10:50:20,822][DEBUG][discovery.zen.fd ] [srvd1573]
[node ] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2013-03-20 10:50:21,115][DEBUG][gateway.local ] [srvd1573]
using initial_shards [quorum], list_timeout [30s]
[2013-03-20 10:50:21,246][DEBUG][gateway.local.state.meta ] [srvd1573]
using gateway.local.auto_import_dangled [YES], with
gateway.local.dangling_timeout [2h]
[2013-03-20 10:50:21,821][DEBUG][gateway.local.state.meta ] [srvd1573] took
575ms to load state
[2013-03-20 10:50:23,207][DEBUG][gateway.local.state.shards] [srvd1573]
took 1.3s to load started shards state
[2013-03-20 10:50:23,211][INFO ][node ] [srvd1573]
{0.20.4}[12314]: initialized
[2013-03-20 10:50:23,211][INFO ][node ] [srvd1573]
{0.20.4}[12314]: starting ...
[2013-03-20 10:50:23,352][INFO ][transport ] [srvd1573]
bound_address {inet[/0.0.0.0:9300]}, publish_address {inet[/x.x.x.x:9300]}
[2013-03-20 10:50:26,381][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:29,394][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:32,405][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:35,416][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:38,424][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:41,431][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:44,437][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:47,444][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:50,483][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk758][8jPTUZBKQr-OW3qRjNL3uA][inet[/x.x.x.x:9300]]{hosting=KV,
max_local_storage_nodes=1, master=true}], master [null]
--> target
[[srvd1582][ZSLewNkzS8iEqn3q2FSyHw][inet[/x.x.x.x:9300]]{hosting=DL,
max_local_storage_nodes=1, master=true}], master [null]
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:51,205][DEBUG][discovery.zen.fd ] [srvd1573]
[master] starting fault detection against master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage
_nodes=1, master=true}], reason [initial_join]
[2013-03-20 10:50:52,235][DEBUG][discovery.zen.fd ] [srvd1573]
[master] pinging a master
[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true} b
ut we do not exists on it, act as if its master failure
[2013-03-20 10:50:52,236][DEBUG][discovery.zen.fd ] [srvd1573]
[master] stopping fault detection against master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage
_nodes=1, master=true}], reason [master failure, do not exists on master,
act as master failure]
[2013-03-20 10:50:52,236][INFO ][discovery.zen ] [srvd1573]
master_left
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], reason [do n
ot exists on master, act as master failure]
[2013-03-20 10:50:52,238][INFO ][discovery ] [srvd1573]
graylog2/icsr0HBQRZKhCsLp4QM8Nw
[2013-03-20 10:50:52,256][INFO ][http ] [srvd1573]
bound_address {inet[/0.0.0.0:9200]}, publish_address {inet[/x.x.x.x:9200]}
[2013-03-20 10:50:52,256][INFO ][node ] [srvd1573]
{0.20.4}[12314]: started

After new indice is created or shard relocation is done node can easily
join cluster. How this problem can be solved?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Could you run hot_threadshttp://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads.htmlon the master when it goes into this 100% CPU state and post the results
here?

On Thursday, March 21, 2013 8:54:05 AM UTC-4, Konstantins Trusins wrote:

Hi all

I have following situation. We use 20 node cluster with replication 1. We
have 5 masters and minimal master count is set to 3. When cluster is
initializing, relocating_shards or creating new indice master node utilize
1 CPU core to 100% and during that time other nodes can not join cluster
reporting following error

[2013-03-20 10:50:20,815][DEBUG][discovery.zen ] [srvd1573]
using ping.timeout [3s], master_election.filter_client [true],
master_election.filter_data [false]
[2013-03-20 10:50:20,816][DEBUG][discovery.zen.elect ] [srvd1573]
using minimum_master_nodes [3]
[2013-03-20 10:50:20,817][DEBUG][discovery.zen.fd ] [srvd1573]
[master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2013-03-20 10:50:20,822][DEBUG][discovery.zen.fd ] [srvd1573]
[node ] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2013-03-20 10:50:21,115][DEBUG][gateway.local ] [srvd1573]
using initial_shards [quorum], list_timeout [30s]
[2013-03-20 10:50:21,246][DEBUG][gateway.local.state.meta ] [srvd1573]
using gateway.local.auto_import_dangled [YES], with
gateway.local.dangling_timeout [2h]
[2013-03-20 10:50:21,821][DEBUG][gateway.local.state.meta ] [srvd1573]
took 575ms to load state
[2013-03-20 10:50:23,207][DEBUG][gateway.local.state.shards] [srvd1573]
took 1.3s to load started shards state
[2013-03-20 10:50:23,211][INFO ][node ] [srvd1573]
{0.20.4}[12314]: initialized
[2013-03-20 10:50:23,211][INFO ][node ] [srvd1573]
{0.20.4}[12314]: starting ...
[2013-03-20 10:50:23,352][INFO ][transport ] [srvd1573]
bound_address {inet[/0.0.0.0:9300]}, publish_address {inet[/x.x.x.x:9300]}
[2013-03-20 10:50:26,381][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:29,394][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:32,405][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:35,416][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:38,424][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:41,431][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:44,437][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:47,444][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:50,483][DEBUG][discovery.zen ] [srvd1573]
filtered ping responses: (filter_client[true], filter_data[false])
--> target
[[srvk758][8jPTUZBKQr-OW3qRjNL3uA][inet[/x.x.x.x:9300]]{hosting=KV,
max_local_storage_nodes=1, master=true}], master [null]
--> target
[[srvd1582][ZSLewNkzS8iEqn3q2FSyHw][inet[/x.x.x.x:9300]]{hosting=DL,
max_local_storage_nodes=1, master=true}], master [null]
--> target
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=f
alse, hosting=M100, max_local_storage_nodes=1, master=true}]
[2013-03-20 10:50:51,205][DEBUG][discovery.zen.fd ] [srvd1573]
[master] starting fault detection against master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage
_nodes=1, master=true}], reason [initial_join]
[2013-03-20 10:50:52,235][DEBUG][discovery.zen.fd ] [srvd1573]
[master] pinging a master
[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true} b
ut we do not exists on it, act as if its master failure
[2013-03-20 10:50:52,236][DEBUG][discovery.zen.fd ] [srvd1573]
[master] stopping fault detection against master
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage
_nodes=1, master=true}], reason [master failure, do not exists on master,
act as master failure]
[2013-03-20 10:50:52,236][INFO ][discovery.zen ] [srvd1573]
master_left
[[srvk768][0sAOH4RuQYWa3XraCuSOkg][inet[/x.x.x.x:9300]]{data=false,
hosting=M100, max_local_storage_nodes=1, master=true}], reason [do n
ot exists on master, act as master failure]
[2013-03-20 10:50:52,238][INFO ][discovery ] [srvd1573]
graylog2/icsr0HBQRZKhCsLp4QM8Nw
[2013-03-20 10:50:52,256][INFO ][http ] [srvd1573]
bound_address {inet[/0.0.0.0:9200]}, publish_address {inet[/x.x.x.x:9200]}
[2013-03-20 10:50:52,256][INFO ][node ] [srvd1573]
{0.20.4}[12314]: started

After new indice is created or shard relocation is done node can easily
join cluster. How this problem can be solved?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.