Looking for some suggestions on how to figure out what is bottle necking out cluster.
I have 18 Servers ( >128gb ,and 20 Cores) running 22 Nodes (3 clients, 5 masters, 14 Data Nodes) (JVMs are all at 30GB Heap)
~340 Indexes with ~20K Primary Shards ( Replica 1 so total 40K shards) for a total of 15TB of space (EMC storage ) Indexing about 1.2M events a minute all day long
So if everything is fine I have not issue, but if I do any kind of maintenance , snapshots, rebalancing , restarts etc I get a lot of 503 Cluster Timeouts or Command Timeouts.
What I don't get is that I have very little CPU utilization, and load is about 2.7 and there only about 500Mbs Disk IO on the EMC which is capable of doing a lot more (per benchmarks we have done >4gbs write)
So, I know there are Cat commands for "Hot Threads" and "Pending Tasks" and "tasks" but I have no clue on how to read them, nor what to look for that would be askew.
I have:
- Tuned my Index settings to reduce shards on small indexes so that will reduce some of the sharding
- We are planing a 5.2 upgrade but that is probably still 1 to 2 months away
- I have tuned the Networking and Disk I/O based on RHEL best practices.
I am really baffled what the bottleneck is , any in-site on what exactly to look for or links would be really helpful.