Hi ,
I have a elastic cluster of 6 data nodes running on AWS m4.2x large, each having 8 cores and 32 g of memory.Whenever load increases,it is able to serve upto a certain number which is
according to the formula - ((no. of cores)*3/2)+1,
a lot less that its search thread pool limit but after that all request starts going to the same node which gets stuffed up in the queue of that node and the response time of all the coming requests increases by about 5x.My question is why is the requests not distributing uniformly on other nodes.I have not used any load balancer so its the default elasticsearch routing.I am also not able to see anything from the slow logs as there is no specific type of query which is taking time.
Please let me know what is going on wrong here. Below are some stats for that node.
It seems the client is distributing requests on a certain factor because you can see from the above thread pool graph that there still are some threads in some nodes which can be used to service the request but all the requests are getting in the queue of a single node.How can I infer on what basis the client is sending request to the nodes ?
Below is the distribution of some of my indexes on the 6 data nodes.Most of them have 3 shards and 3 replica each.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.