Hi All,
I am using ES 1.7.3. I was trying to reallocate shards using shard allocation filtering, and then started seeing tons of tasks piling up on master nodes after a couple of nodes failed to start -
2072096 7.7h IMMEDIATE zen-disco-node_failed([polloi-node-96d0f116][GeU0XNjcQXudnSg5jr3m9w][anon-polloi-famin-seventhreetwosrcc-polloi-44-9][inet[/10.100.44.9:29300]]{data=false, client=true}), reason transport disconnected
2072194 7.7h IMMEDIATE zen-disco-node_failed([polloi-node-dc090774][RTolJcurQ1quNoC7mHjLSw][anon-polloi-famin-seventhreetwosrcc-polloi-39-139][inet[/10.100.39.139:29300]]{data=false, client=true}), reason transport disconnected
2072842 7.7h IMMEDIATE zen-disco-node_failed([polloi-node-c9cf794d][T4lAoXYOSgSF8awZhcX33Q][anon-polloi-famin-seventhreetwosrcc-polloi-39-139][inet[/10.100.39.139:29300]]{data=false, client=true}), reason transport disconnected
2073587 7.7h IMMEDIATE zen-disco-node_failed([polloi-node-c9cf794d][T4lAoXYOSgSF8awZhcX33Q][anon-polloi-famin-seventhreetwosrcc-polloi-39-139][inet[/10.100.39.139:29300]]{data=false, client=true}), reason transport disconnected
2073738 7.7h IMMEDIATE zen-disco-node_failed([polloi-node-ac6c5562][Srl7SEs_QkOYyG0USzyfUg][anon-polloi-famin-seventhreetwosrcc-polloi-39-139][inet[/10.100.39.139:29300]]{data=false, client=true}), reason transport disconnected
2074334 7.6h IMMEDIATE zen-disco-node_failed([polloi-node-c9cf794d][T4lAoXYOSgSF8awZhcX33Q][anon-polloi-famin-seventhreetwosrcc-polloi-39-139][inet[/10.100.39.139:29300]]{data=false, client=true}), reason transport disconnected
2075085 7.6h IMMEDIATE zen-disco-node_failed([polloi-node-8f766892][L-bdY30xRWy6T64MnIV3Rw][anon-polloi-famin-seventhreetwosrcc-polloi-44-9][inet[/10.100.44.9:29300]]{data=false, client=true}), reason transport disconnected
2075270 7.6h IMMEDIATE zen-disco-node_failed([polloi-node-91112d2e][H8_CQKTBTjaBAsX_LyUhyw][anon-polloi-famin-seventhreetwosrcc-polloi-39-139][inet[/10.100.39.139:29300]]{data=false, client=true}), reason transport disconnected
2075831 7.5h IMMEDIATE zen-disco-node_failed([polloi-node-96d0f116][GeU0XNjcQXudnSg5jr3m9w][anon-polloi-famin-seventhreetwosrcc-polloi-44-9][inet[/10.100.44.9:29300]]{data=false, client=true}), reason transport disconnected
These nodes are terminated and not part of cluster anymore. But even then ES somehow thinks they are...
Also, after relocations were kicked off, I started seeing this -
marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started
And then relocations do not happen, they just get stuck. Trying to change cluster settings also fails because the task gets queued up at the end of pending tasks. Here is what cluster health output looks like
curl localhost:9200/_cluster/health?pretty
{
"cluster_name" : "xxxx_elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 120,
"number_of_data_nodes" : 30,
"active_primary_shards" : 8603,
"active_shards" : 25809,
"relocating_shards" : 360,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 288770,
"number_of_in_flight_fetch" : 0
}
Any clues on whats going on?