Very high CPU usage of elastic nodes

dboss101 · February 28, 2018, 3:25pm

Hi,

we are running a cluster of elasticsearch with the following specs:

we run on AWS EC2's (self managed, not their Saas).
3 nodes (all do everything... data, master)
each node is r4.large type (2 CPU, 15GB RAM), 8GB is for heap.
SSD disks (AWS EBS gp2 type)
Elasticsearch 5.6.0
JVM 1.8.0_151

we have a single index with 4 primary shards and no replicas.
the index has around 3,000,000 documents that weigh around 1GB in total (I know its not much at all... this is why i'm surprised at the performance issues).

and when we run some load testing, with about 40 r\s , the CPU of the nodes goes near 100% (not all the nodes, sometimes one or two, and sometimes all. I guess it depends on the queries or searches we perform at any given moment...)

any idea what can be the issue?
what can we check to better troubleshoot?

dboss101 · February 28, 2018, 3:27pm

another piece of info I saw that many people ask to see is the hot_threads output, so here it is from one of the nodes (the most overloaded):

::: {elasticsear6060}{}{}
   Hot threads at 2018-02-28T15:17:27.866Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   41.9% (209.3ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsear6060][search][T#3]'
     4/10 snapshots sharing following 7 elements
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:644)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     2/10 snapshots sharing following 12 elements
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:114)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:252)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:267)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:340)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:337)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1533)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     4/10 snapshots sharing following 7 elements
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1533)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
   41.8% (208.7ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsear6060][search][T#1]'
     5/10 snapshots sharing following 7 elements
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:644)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     5/10 snapshots sharing following 7 elements
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1533)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

dadoonet · February 28, 2018, 3:29pm

I think you should share a typical query, a sample document and the mapping.

dboss101 · February 28, 2018, 4:09pm

this is a sample aggregation query (I know it's quite big..):

gist.github.com

https://gist.github.com/dboss101/21e897d957ef44e6f9a73b9fc39a8305

aggs_query.json

{
	"query": {
		"bool": {
			"filter": {
				"bool": {
					"must": [{
						"term": {
							"inactiveVisibility": "ByAll"
						}
					}, {

This file has been truncated. show original

and this is another sample query:

gist.github.com

https://gist.github.com/dboss101/70336066332014ec8980d940209b48fe

main_query.json

{
	"query": {
		"has_child": {
			"type": "search_item",
			"query": {
				"function_score": {
					"query": {
						"bool": {
							"filter": {
								"bool": {

This file has been truncated. show original

this is the mapping:

gist.github.com

https://gist.github.com/dboss101/cd9f207ca98486964dd8f14879d3e156

mapping

{
	"indexName": {
		"mappings": {
			"search_item": {
				"_all": {
					"enabled": false
				},
				"_parent": {
					"type": "search_group"
				},

This file has been truncated. show original

dadoonet · February 28, 2018, 10:03pm

I looked at the first query.

I don't understand why you are repeating again the bool part inside the agg as aggs are applied by default on the resultset (which has been filtered yet).

When I looked at the main query, well you have a lot of should clauses here. That's is I guess why you are spending a lot of time of running the queries.

You can try to use the profiler tool available in X-Pack Basic in the Kibana interface to get more insights about the performance you are seeing.

cc @jpountz who might have ideas to optimize your queries may be.

jpountz · March 1, 2018, 1:20am

Like you said your index is not especially large, but on the other hand you are using quite expensive features like parent/child and nested. You might also not be using aggregations optimally like David noted.

system · March 29, 2018, 1:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High cpu usage (90%-100%) Elasticsearch	1	330	July 6, 2017
Elasticsearch -Understanding Hot Threads Elasticsearch	5	2986	October 26, 2018
High CPU usage 80/90% on elastic search cluster Elasticsearch	1	370	November 19, 2020
Elasticsearch eat 100% of cpu Elasticsearch	6	1148	October 20, 2020
Single thread with high CPU usage Elasticsearch	3	2631	July 6, 2017

Very high CPU usage of elastic nodes

Related topics