Small intro about our cluster.
2 dedicated master nodes , 30 data nodes , 15 client nodes. 5000 indices . 28000 shards.
We initially had 30 data nodes 32GB RAM , 2 TB disk , 16 core cpu machines. We hit OOM frequently thus we bought 26 High End Machines (128GB RAM, 1TB x 4 disks , 32 core cpu) and added into our es cluster . Cluster runs perfectly ~40TB in size with 56 data nodes. Our aim is to replace the low end machines with high end machine. so we decommissioned low end machine one by one (we remove 2 or 3 machines per day). We removed 26 machines thus now we have 30 data nodes (26 high end & 4 low end machines) . We thought of migrate to es-1.5.2 from es-1.32 and we updated the same. No issues for 3 days.
From yesterday onwards unable to create index / delete index , on seeing master logs it only logs ProcessClusterEventTimeoutException for any task . We have only 200 pending tasks . We create 1000 shards(200 indices) per day . It only creates 200 shards for this it takes 3+ hours still master logs ProcessClusterEventTimeoutException only .
Our Zen properties are
Any suggestions welcome.