Setup:
We have a 6 node Elasticsearch 5.2.2 cluster with 3 dedicated masters and 3 data nodes evenly spread across 3 AWS availability zones (see details below). Our indexes have the default 5 primary shards all around 1GB in size and configured to have 2 replica copies spread across the cluster leveraging cluster.routing.allocation.awareness.attributes: aws_availability_zone
and cloud auto attributes configured for the cluster.
Dedicated masters:
- es-master-01 (az: us-east-1b)
- es-master-02 (az: us-east-1c)
- es-master-03 (az: us-east-1d)(primary)
Data nodes:
- es-data-01 (az: us-east-1b)
- es-data-02 (az: us-east-1c)
- es-data-03 (az: us-east-1d)(problematic node)
Background:
The configuration overall is not perfect and we occasionally see CPU usage alerts on busier days but nothing impactful. We recently had a high traffic day in the cluster of both writes and reads that led to CPU alerts but significant slowdowns were also seen to a point where almost all search traffic was unresponsive.
At this point we took a closer look at the settings and traffic behavior of the cluster and noticed the following:
- es-data-03 saw 3-5x more network traffic that the other 2 data nodes
- at one point elasticsearch was restarted on es-data-03 as a troubleshooting step and we saw both cpu usage spike to es-data-01 while it was down.
- Once es-data-03 recovered and the clustered returned to green the traffic resumed along with high cpu on es-data-03 and es-data-01 returned to previous values
- Digging into statistics of the cluster we saw even index traffic across all 3 data nodes. On the other hand we saw 3x more search traffic hitting es-data-03 than the other 2 nodes
- we also noted that while es-data-03 was out of the cluster the search traffic picked evenly between es-data-01 and es-data-02 but again preferred es-data-03 when it came back online
- looking at the high usage indexes we noticed that for some reason all our major indexes had their primary shards located on es-data-02 while es-data-01 and es-data-03 had only the replicas equally of those indexes
Our findings:
Digging through 5.2.2 documentation and getting a better understand on how writes and reads occur we found that while cluster.routing.allocation.awareness.attributes: aws_availability_zone
is set the coordinator nodes will prefer local shards which I take means that the master node will send search/GET traffic to the data node in the same availability zone assuming it has a primary or replica copy of the shard the query needs access to.
In our setup I believe that means if the primary master (es-master-03) is in us-east-1d it will send all or most of the search/GET traffic to a data node that is also in us-east-1d which ends up being es-data-03. If a data node is not available in the same zone or it does not contain a primary/copy of the shard the request requires it will round robin requests to other data nodes.
This seems to be confirmed by unsetting the aws_availability_zone
for cluster.routing.allocation.awareness.attributes
using the cluster settings api (see command below) today and based on the traffic at the time we still saw a preference for es-data-03 prior to the update. After unsetting the awareness attribute we saw the traffic on es-data-01 and es-data-02 pick up and now traffic appears to be round robin the search requests.
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
"transient" : {
"cluster.routing.allocation.awareness.attributes" : ""
}
}'
Question(s):
Assuming our findings are correct and the cluster.routing.allocation.awareness.attributes: aws_availability_zone
config setting is the sole cause of the uneven search traffic is there a way we can ensure the prefer local shards behavior does not overwhelm a single node? I would prefer to not leave cluster.routing.allocation.awareness.attributes
unset in the event any new shards need to be created or existing ones need to be moved.
Also could the fact that all the primary shards on our larger indexes live on es-data-02 be a driving factor for the uneven traffic distribution? Or could the fact that we have uneven primary shard distribution play a role in this behavior?
Thanks in advanced for any advice you may have.