Hi all, I met a query performance issue when performance testing against Elastic Search V2.4.3:
1. When query performance testing against the cluster with 64 shards of one index( 64 = 16*4, 16 shards, 1 primary and 3 replication of each ) on 64 data nodes ( so one shard allocated on one data node ), everything works well as expected, we get QPS capacity as about 1800/Sec. 2. Keep everything the same as #1, I stopped ES service on one of the data node, and saw the shard on it auto-re-allocated to another data node ( the re-allocation took about 20 minutes in our test), so now: a. we have 64 shards allocated on 63 data nodes b. 62 data nodes have one single shard on each, 1 data node has two shards on it
Under circumstance of #2, I re-tested against the cluster with the same test settings including the same query data as #1, we get QPS capacity at about only 1000/Sec.
From the Marvel dashboard, I saw the CPU rate spikes on that data node with the two shards at over 95%, and the other nodes running just as well as #1, so I assume this data node with multiple shards can be the bottleneck of test #2.
The situation is:
we deployed ES cluster on Azure, and Azure will do maintaining work like OS upgrade now and then, so that some machines will just stop work, and the auto-re-allocation will take about 20 minutes, so there will be cases like test #2, and the big capacity drop is unacceptable to the customer traffic.
1. Are we missing anything like ES configuration causing this performance issue? How can we avoid this kind of performance issue? Any advice?
2. We cannot understand: why only one more shard on one single data node causes this big capacity drop? We have 8 cores on each data node.
Thanks in advance,