Getting sudden bursts of CPU

Hi there,

We just started using Elasticsearch Services and we're facing a few issues we could get help with.

Our setup and use cases are quite simple:

  • One index containing 14M documents (orders), used for plain text search on our site (across ~20 fields )
  • One index containing 2M documents used for auto complete suggestions
    Both indexes are split in 5 shards, with one replica

Most of the traffic is on the first index with around 2 requests/seconds in peak (requests made from our app to the cluster, not per shard).
A bit less than 1 per second on the second one.
Not a crazy traffic, quite regular across the day.

We tried 2 configurations, 3 nodes with 8GB mem and 2 nodes with 15GB.
Under load, we don't seem to have any memory issue, and CPU usage is below 20%

We regularly experienced one of the node's CPU going up to 100% (in both configs), and then staying stuck for several minutes. We sometimes had to restart the cluster to unlock things.

  • During those episodes, we could observe that the search queue of this node is full
  • The spike of CPU if very sudden, it goes from 20% to 100% in a couple of seconds
  • Kibana doesn't report any search/index any metrics during those episodes
  • We tried reproducing this behaviour on a perf cluster with the same data without any success. When we try to reproduce what we think is the traffic on Prod and we increase rps, response times are degrading progressively, but don't 'break' the cluster like that

We are unsure about the nature of the traffic during those episodes. We see some traces of spikes of search/seconds after the burst of CPU, but it's hard to tell if it's a real cause or some glitches in Kibana reporting as some data points seems to be missing.

We must be missing something obvious, but can't see what it is...

How can we tell what's causing those bursts of CPUs? I know ES can be setup to provide slow queries logs, but on ElasticSearch Services I can't figure out how to get those logs.

We're upgrading to 3*15Gb nodes in the meantime, but it's frustrating not to be able to get to the bottom of the problem.

We're using ES 6.8.8, and for context our app is running on Rails, using the searchkick gem.

Any help greatly appreciated!

It may help to look at the hot threads API when the CPU spikes. This will show you which threads are busy and what they're up to. If you need help interpreting what you see, share the output here.

Thanks David,
After bumping up the cluster to 3 15GB nodes we did not see any new issue "unfortunately" no additional investigation. I'll try to get a better understanding of our traffic to reproduce on a separated cluster

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.