we are running a cluster of elasticsearch with the following specs:
we run on AWS EC2's (self managed, not their Saas).
3 nodes (all do everything... data, master)
each node is r4.large type (2 CPU, 15GB RAM), 8GB is for heap.
SSD disks (AWS EBS gp2 type)
Elasticsearch 5.6.0
JVM 1.8.0_151
we have a single index with 4 primary shards and no replicas.
the index has around 3,000,000 documents that weigh around 1GB in total (I know its not much at all... this is why i'm surprised at the performance issues).
and when we run some load testing, with about 40 r\s , the CPU of the nodes goes near 100% (not all the nodes, sometimes one or two, and sometimes all. I guess it depends on the queries or searches we perform at any given moment...)
any idea what can be the issue?
what can we check to better troubleshoot?
another piece of info I saw that many people ask to see is the hot_threads output, so here it is from one of the nodes (the most overloaded):
::: {elasticsear6060}{}{}
Hot threads at 2018-02-28T15:17:27.866Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
41.9% (209.3ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsear6060][search][T#3]'
4/10 snapshots sharing following 7 elements
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:644)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
2/10 snapshots sharing following 12 elements
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:114)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:252)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:267)
org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:340)
org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:337)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1533)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
4/10 snapshots sharing following 7 elements
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1533)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
41.8% (208.7ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsear6060][search][T#1]'
5/10 snapshots sharing following 7 elements
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:644)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
5/10 snapshots sharing following 7 elements
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1533)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
I don't understand why you are repeating again the bool part inside the agg as aggs are applied by default on the resultset (which has been filtered yet).
When I looked at the main query, well you have a lot of should clauses here. That's is I guess why you are spending a lot of time of running the queries.
You can try to use the profiler tool available in X-Pack Basic in the Kibana interface to get more insights about the performance you are seeing.
cc @jpountz who might have ideas to optimize your queries may be.
Like you said your index is not especially large, but on the other hand you are using quite expensive features like parent/child and nested. You might also not be using aggregations optimally like David noted.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.