Hey everyone, writing in to ask for advice regarding scaling an Elasticsearch cluster being used for analytics. That is - aggregations only, no searches. The TL;DR is we're seeing high CPU usage during a cardinality aggregation query and are trying to scale it.
The cluster currently consists of 5 m3.2xlarge instances; the spec per instance is as such:
- 8 CPUs
- 30 GB of RAM (-> 15GB are allocated for the JVM heap)
- an EBS volume
Regarding dataset sizes, an index per week is created, with 6 primary shards and 1 replica shard. Right now there are around 30,000 shards active on the cluster - 15k primaries, 15k replicas. The active dataset for the main Analytics scenarios consists of about 42 indices with 213,803,714 documents - the total size is about 26GB.
The query I'm trying to optimize is based a cardinality aggregation on a high cardinality string field. The documents being aggregated are first filtered.
Here are some metrics on the performance:
- Queries running in parallel - between 5 and 10
- 95th percentile latency - 25 seconds
- CPU load hovers around 90% on all 5 nodes when the queries are being run
- Heap usage is between 40-60% on all 5 nodes
- Almost no read operations are seen on the EBS volume, CPU wait is 0 constantly
These metrics lead me to believe that the fielddata is being cached properly, so are the filters. I'd love some advice about where to go from here, apart from more nodes.