Allocation awareness seems to prefer local shards even when preferred node is at 100% CPU


(Greg Nilchee) #1

Setup:
We have a 6 node Elasticsearch 5.2.2 cluster with 3 dedicated masters and 3 data nodes evenly spread across 3 AWS availability zones (see details below). Our indexes have the default 5 primary shards all around 1GB in size and configured to have 2 replica copies spread across the cluster leveraging cluster.routing.allocation.awareness.attributes: aws_availability_zone and cloud auto attributes configured for the cluster.

Dedicated masters:

  • es-master-01 (az: us-east-1b)
  • es-master-02 (az: us-east-1c)
  • es-master-03 (az: us-east-1d)(primary)

Data nodes:

  • es-data-01 (az: us-east-1b)
  • es-data-02 (az: us-east-1c)
  • es-data-03 (az: us-east-1d)(problematic node)

Background:
The configuration overall is not perfect and we occasionally see CPU usage alerts on busier days but nothing impactful. We recently had a high traffic day in the cluster of both writes and reads that led to CPU alerts but significant slowdowns were also seen to a point where almost all search traffic was unresponsive.

At this point we took a closer look at the settings and traffic behavior of the cluster and noticed the following:

  • es-data-03 saw 3-5x more network traffic that the other 2 data nodes
  • at one point elasticsearch was restarted on es-data-03 as a troubleshooting step and we saw both cpu usage spike to es-data-01 while it was down.
    • Once es-data-03 recovered and the clustered returned to green the traffic resumed along with high cpu on es-data-03 and es-data-01 returned to previous values
  • Digging into statistics of the cluster we saw even index traffic across all 3 data nodes. On the other hand we saw 3x more search traffic hitting es-data-03 than the other 2 nodes
    • we also noted that while es-data-03 was out of the cluster the search traffic picked evenly between es-data-01 and es-data-02 but again preferred es-data-03 when it came back online
  • looking at the high usage indexes we noticed that for some reason all our major indexes had their primary shards located on es-data-02 while es-data-01 and es-data-03 had only the replicas equally of those indexes

Our findings:
Digging through 5.2.2 documentation and getting a better understand on how writes and reads occur we found that while cluster.routing.allocation.awareness.attributes: aws_availability_zone is set the coordinator nodes will prefer local shards which I take means that the master node will send search/GET traffic to the data node in the same availability zone assuming it has a primary or replica copy of the shard the query needs access to.

In our setup I believe that means if the primary master (es-master-03) is in us-east-1d it will send all or most of the search/GET traffic to a data node that is also in us-east-1d which ends up being es-data-03. If a data node is not available in the same zone or it does not contain a primary/copy of the shard the request requires it will round robin requests to other data nodes.

This seems to be confirmed by unsetting the aws_availability_zone for cluster.routing.allocation.awareness.attributes using the cluster settings api (see command below) today and based on the traffic at the time we still saw a preference for es-data-03 prior to the update. After unsetting the awareness attribute we saw the traffic on es-data-01 and es-data-02 pick up and now traffic appears to be round robin the search requests.

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
    "transient" : {
        "cluster.routing.allocation.awareness.attributes" : ""
    }
}'

Question(s):
Assuming our findings are correct and the cluster.routing.allocation.awareness.attributes: aws_availability_zone config setting is the sole cause of the uneven search traffic is there a way we can ensure the prefer local shards behavior does not overwhelm a single node? I would prefer to not leave cluster.routing.allocation.awareness.attributes unset in the event any new shards need to be created or existing ones need to be moved.

Also could the fact that all the primary shards on our larger indexes live on es-data-02 be a driving factor for the uneven traffic distribution? Or could the fact that we have uneven primary shard distribution play a role in this behavior?

Thanks in advanced for any advice you may have.


(Christian Dahlqvist) #2

The purpose of this is usually to ensure that primary and replica shards are not all located to the same availability zone. As you have only one data node per availability zone I do not see why you need this set.

If you are just indexing new data, primary and replica shards do the same amount of work, so it should not matter much. If you however update documents more load could be placed on primary shards.

Dedicated master nodes should ideally be left to manage the cluster and not handle request. This allows them to be small and improve cluster stability. I would recommend directly sending requests directly to the data nodes instead.


(Greg Nilchee) #3

@Christian_Dahlqvist Thanks for the feedback. It makes sense why we don't need to have this setting unless we plan to have multiple nodes per availability zone.

One last question I have is if we do scale to have multiple nodes per availability zone and don't set aws_availability_zone awareness are we able to set it after the fact and Elasticsearch would rebalance the shards or does this need to be set ahead of time before we add nodes into a single availability zone?