Shards assigning is taking lot of time after a node is disconnected and added to the cluster

Hi There,

We have 10 node cluster, 3 master, 4 data nodes, 3 client nodes. each is of 56 GB Memory and 8 cores. Allocated 28 GB to Heap. All are Azure Virtual Machines.

We have around 12 indexes, each has 3 replicas and 20 shards. So, ideally every node should have a copy of each shard.

We are getting the below issue and shards are becoming unassigned.
2017-01-31 09:38:20,815][INFO ][cluster.service ] [ITTESPROD-DATA3] removed {{ITTESPROD-DATA4}{a3vFGxwST8i4eKZig5kuzg}{10.158.36.208}{10.158.36.208:9300}{master=false},}, reason: zen-disco-receive(from master [{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}])
[2017-01-31 09:39:33,161][WARN ][monitor.jvm ] [ITTESPROD-DATA3] [gc][young][99342][1301] duration [15.5s], collections [1]/[16.1s], total [15.5s]/[8.9m], memory [25.6gb]->[18.4gb]/[27gb], all_pools {[young] [7.4gb]->[19.4mb]/[7.4gb]}{[survivor] [884.7mb]->[702.5mb]/[955.6mb]}{[old] [17.3gb]->[17.7gb]/[18.6gb]}
[2017-01-31 09:46:48,976][WARN ][monitor.jvm ] [ITTESPROD-DATA3] [gc][old][99533][34] duration [4m], collections [1]/[4m], total [4m]/[54.6m], memory [25.8gb]->[17.1gb]/[27gb], all_pools {[young] [7.4gb]->[21.2mb]/[7.4gb]}{[survivor] [702.5mb]->[0b]/[955.6mb]}{[old] [17.7gb]->[17.1gb]/[18.6gb]}
[2017-01-31 09:46:48,991][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2777725]
[2017-01-31 09:46:48,994][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2777726]
[2017-01-31 09:46:49,354][WARN ][discovery.zen ] [ITTESPROD-DATA3] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{ITTESPROD-CLIENT0}{cGF5_yyrRZOh9D0IyqgD3Q}{10.158.36.220}{10.158.36.220:9300}{data=false, master=false},{ITTESPROD-CLIENT2}{0fDvB6DMSgCRDfCXxE0TBg}{10.158.36.209}{10.158.36.209:9300}{data=false, master=false},{ITTESPROD-DATA3}{7m7OdCyORaKSsGEXduh55g}{10.158.36.204}{10.158.36.204:9300}{master=false},{ITTESPROD-MSTR1}{fnPHHE1xREaNnz4kA6rrSA}{10.158.36.201}{10.158.36.201:9300}{data=false, master=true},{ITTESPROD-CLIENT1}{lQ5OG-qySBmnLpshqTxxfQ}{10.158.36.199}{10.158.36.199:9300}{data=false, master=false},{ITTESPROD-MSTR2}{WtSucYmHRUCUX_Ld7R7fBA}{10.158.36.202}{10.158.36.202:9300}{data=false, master=true},{ITTESPROD-DATA1}{XNRK5gWBR2SnyIvD8Wnz6w}{10.158.36.211}{10.158.36.211:9300}{master=false},{ITTESPROD-DATA2}{vJzgp0a0Q-WXnAyFKdHcKw}{10.158.36.212}{10.158.36.212:9300}{master=false},}
[2017-01-31 09:46:49,354][INFO ][cluster.service ] [ITTESPROD-DATA3] removed {{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true},}, reason: zen-disco-master_failed ({ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true})
[2017-01-31 09:46:54,051][INFO ][cluster.service ] [ITTESPROD-DATA3] detected_master {ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}, added {{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true},}, reason: zen-disco-receive(from master [{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}])
[2017-01-31 09:46:54,556][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2784963]
[2017-01-31 09:46:54,556][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2784962]
[2017-01-31 09:46:55,092][WARN ][transport ] [ITTESPROD-DATA3] Transport response handler not found of id [2784966]
[2017-01-31 09:51:49,367][INFO ][cluster.service ] [ITTESPROD-DATA3] added {{ITTESPROD-DATA4}{a3vFGxwST8i4eKZig5kuzg}{10.158.36.208}{10.158.36.208:9300}{master=false},}, reason: zen-disco-receive(from master [{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}])
[2017-01-31 09:51:49,367][INFO ][cluster.service ] [ITTESPROD-DATA3] added {{ITTESPROD-DATA4}{a3vFGxwST8i4eKZig5kuzg}{10.158.36.208}{10.158.36.208:9300}{master=false},}, reason: zen-disco-receive(from master [{ITTESPROD-MSTR0}{E5JCBHhrQnKBs99HSZOH8Q}{10.158.36.200}{10.158.36.200:9300}{data=false, master=true}])

So, around 9:46:49, node got disconnected from cluster and 9:46:54, it got connected again.
Same time, Data4, got disconnected(9:38:00) and added again at (9:51)

Withing few minutes, shards of all indexes are getting assigned other than a index called custom. Which is of size 2 TB and have 3 replicas.

For that index (which is of 2 TB) it is taking around 2 hours to get the shard back to assigned state.

As 2 shards got impacted, Because of this, our indexing operation is failing because of QUORUM unavailability.

Please help us here, WHY IT IS TAKING LOT OF TIME, TO GET THE SHARED BACK TO ASSIGNED FROM UNASSIGNED? event though a copy of data is already available in that particular data nodes?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.