The process of our es query is:
[user application] -> [nginx load balance] -> [ coordinating nodes] -> [data nodes]
The search latency of most queries is less than 50ms. But there are still some queries which take more than 200ms in nginx logs.
If I filter the log by upstream_addr, I can find that each coordinating node has queries like this every 20 minutes. For example, the issue occurs on node A in the 13th、33rd、53rd minutes, and on node B in the 5th、25th、45th minutes.
If I replace the 8C32G coordinating node with a 16C32G server, the number of high latency queries reduce to 1/2.
I guess that the issue is related to search thread pool, because the thread pool size of the 16core ES node is larger, so the scope of one thread issue is smaller.But I can't prove it, I want to know what happened.
I confirm that the native realm cache causes the issue.
If I call this API _xpack/security/realm/*/_clear_cache ,the same issue orccurs immediatly.
Now I do this setting(ES 5.6.3) to reduces the number of high latency queries:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.