Uneven load distribution for search


I have weird issue after migrating from 2.4 to 6.3.
My cluster consists of 10 nodes:
3 master nodes, (m3.medium, 1CPU/4GB RAM)
3 client nodes running ES, Logstash and Kibana. (c5.xlarge 4xCPU/8GB RAM)
4 data nodes (r4.2xlarge, 61 GB RAM/8 CPU)

Data inflow looks like this:

  • Load balancer sends data to logstash TCP listener
  • Kibana and Logstash sends data and queries to localhost client node, which then supposed to load balance them to data nodes.

The problem here is that I see queries take much longer than they used to in 2.4 and Datadog reports that one of the nodes gets more queries than others


I don't understand why...Can anybody help here?

I'm continue to check for any differences in configuration

Any hotspots due to uneven shard distribution?

I've checked, shards are distributed evenly across all data nodes,
I recently switched from 2 shards per index to 4, about a week ago. but timeouts are happening even on shorter periods of time

Do you use routing or any other features that can cause notspots?

I'm using routing awareness, based on availability zone, and I just realized that node-0 and node-3 are in the same AZ, while others are in their own... I will re-deploy new instance in a 4th zone and see if that helps

and now I'm stuck with shard re-balancing :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.