Tons of IMMEDIATE Tasks piling up in cluster state after node failures


(Madhav Kelkar) #1

Hi All,
I am using ES 1.7.3. I was trying to reallocate shards using shard allocation filtering, and then started seeing tons of tasks piling up on master nodes after a couple of nodes failed to start -

2072096  7.7h IMMEDIATE zen-disco-node_failed([polloi-node-96d0f116][GeU0XNjcQXudnSg5jr3m9w][anon-polloi-famin-seventhreetwosrcc-polloi-44-9][inet[/10.100.44.9:29300]]{data=false, client=true}), reason transport disconnected

2072194 7.7h IMMEDIATE zen-disco-node_failed([polloi-node-dc090774][RTolJcurQ1quNoC7mHjLSw][anon-polloi-famin-seventhreetwosrcc-polloi-39-139][inet[/10.100.39.139:29300]]{data=false, client=true}), reason transport disconnected
2072842 7.7h IMMEDIATE zen-disco-node_failed([polloi-node-c9cf794d][T4lAoXYOSgSF8awZhcX33Q][anon-polloi-famin-seventhreetwosrcc-polloi-39-139][inet[/10.100.39.139:29300]]{data=false, client=true}), reason transport disconnected
2073587 7.7h IMMEDIATE zen-disco-node_failed([polloi-node-c9cf794d][T4lAoXYOSgSF8awZhcX33Q][anon-polloi-famin-seventhreetwosrcc-polloi-39-139][inet[/10.100.39.139:29300]]{data=false, client=true}), reason transport disconnected
2073738 7.7h IMMEDIATE zen-disco-node_failed([polloi-node-ac6c5562][Srl7SEs_QkOYyG0USzyfUg][anon-polloi-famin-seventhreetwosrcc-polloi-39-139][inet[/10.100.39.139:29300]]{data=false, client=true}), reason transport disconnected
2074334 7.6h IMMEDIATE zen-disco-node_failed([polloi-node-c9cf794d][T4lAoXYOSgSF8awZhcX33Q][anon-polloi-famin-seventhreetwosrcc-polloi-39-139][inet[/10.100.39.139:29300]]{data=false, client=true}), reason transport disconnected
2075085 7.6h IMMEDIATE zen-disco-node_failed([polloi-node-8f766892][L-bdY30xRWy6T64MnIV3Rw][anon-polloi-famin-seventhreetwosrcc-polloi-44-9][inet[/10.100.44.9:29300]]{data=false, client=true}), reason transport disconnected
2075270 7.6h IMMEDIATE zen-disco-node_failed([polloi-node-91112d2e][H8_CQKTBTjaBAsX_LyUhyw][anon-polloi-famin-seventhreetwosrcc-polloi-39-139][inet[/10.100.39.139:29300]]{data=false, client=true}), reason transport disconnected
2075831 7.5h IMMEDIATE zen-disco-node_failed([polloi-node-96d0f116][GeU0XNjcQXudnSg5jr3m9w][anon-polloi-famin-seventhreetwosrcc-polloi-44-9][inet[/10.100.44.9:29300]]{data=false, client=true}), reason transport disconnected

These nodes are terminated and not part of cluster anymore. But even then ES somehow thinks they are...

Also, after relocations were kicked off, I started seeing this -

marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started

And then relocations do not happen, they just get stuck. Trying to change cluster settings also fails because the task gets queued up at the end of pending tasks. Here is what cluster health output looks like

curl localhost:9200/_cluster/health?pretty
{
"cluster_name" : "xxxx_elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 120,
"number_of_data_nodes" : 30,
"active_primary_shards" : 8603,
"active_shards" : 25809,
"relocating_shards" : 360,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 288770,
"number_of_in_flight_fetch" : 0
}

Any clues on whats going on?


(Mark Walkom) #2

Do you have a lot of clients, or....?

You should upgrade to 2.X, with the high node count, any cluster state update needs to be sent in full to all nodes.


(system) #3