ES performance reliablity issue
After I fix the issue above, I find another strange issue.
The process of our es query is:
[user application] -> [nginx load balance] -> [ coordinating nodes] -> [data nodes]
The search latency of most queries is less than 50ms. But there are still some queries which take more than 200ms in nginx logs.
If I filter the log by upstream_addr, I can find that each coordinating node has queries like this every 20 minutes. For example, the issue occurs on node A in the 13th、33rd、53rd minutes, and on node B in the 5th、25th、45th minutes.
If I replace the 8C32G coordinating node with a 16C32G server, the number of high latency queries reduce to 1/2.
I guess that the issue is related to search thread pool, because the thread pool size of the 16core ES node is larger, so the scope of one thread issue is smaller.But I can't prove it, I want to know what happened.