We've recently stood up a new cluster for our logging pipeline, and we're seeing some performance issues. We're have 3 nodes running Elasticsearch 6.2.4 on EC2 c5.4xlarges.
Normally the cluster is running fine, and even viewing them in Kibana is usually fine as well, but for one of our index patterns, the query that Kibana's Discover page (when looking at 30 days worth of logs) times out very frequently (using the default 30s Kibana timeout). While I could just bump that timeout, I'd like to identify any underlying issues with the indices.
The indices have ~950 million records. We have multi fields for both text and keyword (if less than 256 characters) for most strings coming in, and we're logging our request and response payloads, so there are a decent amount of sparse fields.
While the indices/mappings are probably not the best (we're working on standardizing it a bit more), I'm curious why the CPU is spiking so high when doing a basic
date_histogram aggregation. We have other indices that contain many more documents, but querying them doesn't spike the CPU. I also increased the replica count from 1 to 2 (so there's 3 total copies of the data including the primary) so that it can utilize all the nodes better, and that didn't seem to help. And this makes me think that adding more nodes won't necessarily help either. Do I just need to vertically scale up our nodes? Here is a gist with the output of
hot_threads when the CPU spikes, though I'm not too sure what to make of it. Any help on how to read this output would be greatly appreciated.