Random 100% CPU Spikes on an staging cluster

We have a 2 node (r3.large.elasticsearch) cluster that's running filter-based queries to retrieve documents from a small index (around 9 million documents total). From time to time, under no particular queries, the CPUs in the cluster would spike to 100% along with a memory spike that goes above 75%. We are not sure whether this is caused by filter-cache eviction or under-provisioned cluster boxes. We do have high-cardinality search fields, however, so it might make sense to disable filter-caches for those and keep filter-caches enabled for low-cardinality fields. We are wondering what would be the right approach, disable-filter-cache or increase box sizes?

Until you know what it is, doing anything will be guessing.
I'd install Monitoring to get a better insight into things./

Unfortunately we are on an AWS-managed Elasticsearch cluster so there is no extra monitoring we can install besides the usual CPU/disk/ram usage, which is attached here

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.