hi elastic ninjas,
Currently we have made our cluster live with hot warm architecture as mentioned in How hot/warm architecture enhances query performance? . We are now in rollout phase where we are sending controlled live traffic onto our new 6.2 cluster. So our new cluster has 5 datanodes [r4.xlarge] and 3 master [c5.large] and 2 client nodes [c5.large].
While we were hitting live traffic on new cluster, it performed good till 10000 request per minute with some requests taking more than 2.5 second.Average of the requests took around 10-100 ms. max time of the request was around 2.5 sec or more.
We are trying to debug 2 things
- since all the queries are same with just different user_id , why some of the same requests are taking more than 1-2 sec.
- why after 10000 request most of the requests take more than 1-2 sec to complete
Though we can always add more datanodes and cluster would be able to handle the traffic , We are trying to correlate the resource which is becoming bottleneck in our cluster after 10,000 requests.
at around 10k requests/minute
cpu utilization on all instances is around30-40%
mem used - cache is around 50%
there is no increase in gc count
no increase in search and bulk threadpool queue sizes
no rejections in search and bulk threadpool
Till now we are not able to find any metric which correlates with increased latencies except number of http connections opened per second.
After 10000 requests, this metric[number of http conn opened per second] goes up but my uess is that this increased metric is the result of increased latencies not the otherway round.
any ideas, insight into how can i find which resource is becoming the bottleneck in our cluster would be super helpful.