Hi,
We have a cluster hosted on EC2 with the following configuration:
3 Master nodes - t2.Medium
3 Client nodes with Kibana - m4.large
12 'Hot' Data nodes - i3.xlarge
12 'Warm' Data nodes - d2.xlarge
Log data is being sent to an ELB in front of the 3 client nodes from logstash.
We're taking daily index, 9 shards per index, 1 replica - 18 total shards per index. Shard size is at most 30Gb for a full day. Indexes are created on hot nodes, after 2 days routing is changed and the shards are moved to warm nodes. Indices are deleted after 10 days.
Currently we're running 3 extra hot nodes and 3 extra warm nodes, we will scale down to 9 of each. We have 12 currently as some older indices have 12 primary shards.
The cluster is used for non-prod log data, index rate ranges from between 4k/s to 13k/s, cluster remains relatively responsive to searches during this time.
The issue that 6 of the hot nodes are running with high cpu at around 80-90% and the remaining 6 are running at lower cpu, typically around 20-30%. All warm nodes are consistently running at around 20-25%.
Can anyone tell me why the load seems to be uneven across the hot nodes, or point me to anything that would help me diagnose the issue further?