Very high CPU usage of elastic nodes

Hi,

we are running a cluster of elasticsearch with the following specs:

  • we run on AWS EC2's (self managed, not their Saas).
  • 3 nodes (all do everything... data, master)
  • each node is r4.large type (2 CPU, 15GB RAM), 8GB is for heap.
  • SSD disks (AWS EBS gp2 type)
  • Elasticsearch 5.6.0
  • JVM 1.8.0_151

we have a single index with 4 primary shards and no replicas.
the index has around 3,000,000 documents that weigh around 1GB in total (I know its not much at all... this is why i'm surprised at the performance issues).

and when we run some load testing, with about 40 r\s , the CPU of the nodes goes near 100% (not all the nodes, sometimes one or two, and sometimes all. I guess it depends on the queries or searches we perform at any given moment...)

any idea what can be the issue?
what can we check to better troubleshoot?

another piece of info I saw that many people ask to see is the hot_threads output, so here it is from one of the nodes (the most overloaded):

::: {elasticsear6060}{}{}
   Hot threads at 2018-02-28T15:17:27.866Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   41.9% (209.3ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsear6060][search][T#3]'
     4/10 snapshots sharing following 7 elements
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:644)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     2/10 snapshots sharing following 12 elements
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:114)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:252)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:267)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:340)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:337)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1533)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     4/10 snapshots sharing following 7 elements
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1533)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
   41.8% (208.7ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsear6060][search][T#1]'
     5/10 snapshots sharing following 7 elements
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:644)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     5/10 snapshots sharing following 7 elements
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1533)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

I think you should share a typical query, a sample document and the mapping.

this is a sample aggregation query (I know it's quite big..):

and this is another sample query:

this is the mapping:

I looked at the first query.

I don't understand why you are repeating again the bool part inside the agg as aggs are applied by default on the resultset (which has been filtered yet).

When I looked at the main query, well you have a lot of should clauses here. That's is I guess why you are spending a lot of time of running the queries.

You can try to use the profiler tool available in X-Pack Basic in the Kibana interface to get more insights about the performance you are seeing.

cc @jpountz who might have ideas to optimize your queries may be.

Like you said your index is not especially large, but on the other hand you are using quite expensive features like parent/child and nested. You might also not be using aggregations optimally like David noted.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.