Finding resource bottleneck in elasticsearch cluster

hi elastic ninjas,
Currently we have made our cluster live with hot warm architecture as mentioned in How hot/warm architecture enhances query performance? . We are now in rollout phase where we are sending controlled live traffic onto our new 6.2 cluster. So our new cluster has 5 datanodes [r4.xlarge] and 3 master [c5.large] and 2 client nodes [c5.large].

While we were hitting live traffic on new cluster, it performed good till 10000 request per minute with some requests taking more than 2.5 second.Average of the requests took around 10-100 ms. max time of the request was around 2.5 sec or more.

We are trying to debug 2 things

  1. since all the queries are same with just different user_id , why some of the same requests are taking more than 1-2 sec.
  2. why after 10000 request most of the requests take more than 1-2 sec to complete

Though we can always add more datanodes and cluster would be able to handle the traffic , We are trying to correlate the resource which is becoming bottleneck in our cluster after 10,000 requests.

at around 10k requests/minute
cpu utilization on all instances is around30-40%
mem used - cache is around 50%
there is no increase in gc count
no increase in search and bulk threadpool queue sizes
no rejections in search and bulk threadpool

Till now we are not able to find any metric which correlates with increased latencies except number of http connections opened per second.

After 10000 requests, this metric[number of http conn opened per second] goes up but my uess is that this increased metric is the result of increased latencies not the otherway round.

any ideas, insight into how can i find which resource is becoming the bottleneck in our cluster would be super helpful.

Can you describe your use case and the data you are indexing? What is the indexing rate into the cluster? How are you indexing data? Are you using time-based indices?

How are you querying data? How many concurrent queries are you running?

1 Like

sure, thanks for taking your time.

we are indexing at around 6000 per minute,
indexing the data from kafka via logstash pipeline. that is bulk json is pushed into kafka from where we use logstash to filter the required fields and finally logstash output plugin is being pushed to elasticsearch.

39 key-value pairs are present in each es document .

we are using indices which are calculated/rolled in following way
there is unique auto increment id - say id
index of each id is calculated by [ indexPrefix + {Math.floor(id / 50000000) * 5} ] which approximately puts 50 million documents in each index, approximate size of each index is around 34 gb, with 4 shards.

We are hitting searches at around 10,000 request per minute on the cluster as a whole and hitting 3 indices at a time. sample query is given below

curl -H "Content-Type: application/json" -XGET 'es_proxy/alias-indexprefix-550,alias-indexprefix-545,alias-indexprefix-540/_search?size=0&routing=1234&ignore_unavailable=true' -d'{
"aggs": {
    "group_by_order_id": {
        "aggs": {
            "order_id": {
                "max": {
                    "field": "order_id"
                }
            }
        },
        "terms": {
            "field": "order_id",
            "order": {
                "order_id": "desc"
            },
            "size": 11
        }
    }
},
"query": {
    "bool": {
        "filter": {
            "bool": {
                "must": [
                    {
                        "term": {
                            "customer_id": 1234
                        }
                    },
                    {
                        "term": {
                            "is_subscription": false
                        }
                    },
                    {
                        "bool": {
                            "should": [
                                {
                                    "range": {
                                        "payment_status": {
                                            "lt": 4
                                        }
                                    }
                                },
                                {
                                    "range": {
                                        "created_at": {
                                            "gt": "2018-06-25T18:30:00Z"
                                        }
                                    }
                                }
                            ]
                        }
                    }
                ],
                "must_not": [
                    {
                        "terms": {
                            "vertical_id": [
                                106
                            ]
                        }
                    }
                ],
                "should": []
            }
        }
    }
}

}
'
Please let me know if any more information is required from my side.

After struggling for multiple days, we thought of changing the elastic load balancer (elb) in front of our elasticsearch client instances. Once we changed the elb, the erratic spikes in latencies went away giving smooth performance.

Once cluster started to behave normally, we made it live with actual production traffic. And there were multiple bottlenecks for the performace, major being cpu utilisation.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.