Hey everybody!
We have a "tiered" setup where T1 has the warmest data and T2 keeps the rest.
There are 24 data nodes in the cluster with 3 masters (that are not being queried).
T1 has 6 machines and T2 has the rest (18) they are all m4.4xlarge EC2 instances. All the data nodes have 1000gb General Purpose SSD EBS disks
- m4.4xlarge
- 64gb ram
- 16 cores
Elasticsearch has 30gb allocated for the heap with mlockall enabled.
We have 6 shards and 1 replica per shard per index
Our indices are split up by weeks and T1 currently serves the last 4 weeks, so the topology looks like this right now:
T1:
- index-w-2015.31
- shard size: 42.3gb
- shard size: 42.2gb
- shard size: 42.8gb
- shard size: 42.5gb
- shard size: 42.1gb
- shard size: 42.2gb
- index-w-2015.32
- shard size: 47gb
- shard size: 46.8gb
- shard size: 45.8gb
- shard size: 47.1gb
- shard size: 47gb
- shard size: 46.7gb
- index-w-2015.33
- shard size: 49.5gb
- shard size: 48.8gb
- shard size: 48.3gb
- shard size: 48.5gb
- shard size: 48.6gb
- shard size: 48.7gb
- index-w-2015.34 # Week 34 is not over yet so it's considerably smaller
- shard size: 27.9gb
- shard size: 25.8gb
- shard size: 25gb
- shard size: 25.3gb
- shard size: 26.6gb
- shard size: 26gb
T2:
- index-w-2015.30
- ...
- index-w-2015.00
- index-percolator
Indices are getting larger (doc count and storage wise) every week. The current plan is to add a node to T2 every month and slowly add more disks when it's needed. The current plan is to keep a years worth of data in the cluster (that will probably change in January though)
I have tried having 3 search nodes (data:false, master:false) in front of the cluster but that didn't help on response times, heap usage or GC times.
I am trying to figure out if maybe having more shards per index would decrease heap usage and GC times, because the shards are too big?
T1 is always GCing and T2 rarely does it resulting in heaps with T2 flatlining in the high 80s/mid 90's.
We're getting really slow response times from our app, and it correlates with GC/CPU spikes in the Elasticsearch cluster
I've attached some of the graphs that I've been looking at for the past long time
The durations and counters for GC are turned into something more useful with the derivate
function from graphite http://graphite.readthedocs.org/en/latest/functions.html#graphite.render.functions.derivative