Need help to overcome 100% CPU

Gang · April 12, 2017, 9:29am

Hi, everybody I`m stuck on cluster performance tuning/scale and need help.

I'm implementing solution on top of Elasticsearch. Now I`ve started load testing and 5 parallel thread requests put my cluster down.
In short, I have this configuration:

1 node - all roles, 2x6 cores, 64 GB RAM.
ES_HEAP_SIZE=30g
bootstrap.mlockall: true
indices.fielddata.cache.size: 20%
network.tcp.blocking: true
Others by default

My main index is now 5+ million documents and about 80 GB.
For the future expansion, it`s laid out for 12 shards.

The basic query is quite heavy. It filters nothing but 2 types (now is only 2 of them) but runs several (5-7) aggregations on the whole set of documents. With 1 thread query time is acceptable, about 350-700 mils. But in a multithreaded test mode CPU immediately flies up to 100%

In hot_thread i see
100.1% (500.4ms out of 500ms) cpu usage by thread 'elasticsearch[node-4][search][T#9]
94.7% (473.6ms out of 500ms) cpu usage by thread 'elasticsearch[node-4][search][T#15]
92.8% (463.9ms out of 500ms) cpu usage by thread 'elasticsearch[node-4][search][T#25]
(Can provide more details if needed)
And even EsRejectedExecutionException in els.log

If I profile the query, I see that most of the time (and apparently CPU) costs goes for the aggregations.
"took": 462,
.....
"query": [
{
"query_type": "ConstantScoreQuery",
"lucene": "ConstantScore((ConstantScore(_type:bidutp) ConstantScore(_type:prgos))~1)",
"time": "81.26062800ms",
.....
"name": "MultiCollector",
"reason": "search_multi",
"time": "348.6118540ms",
(Can post full if needed)

So now we have came to questions.
What am I doing wrong?
Is it the meter of shards count, or i should reduce the heap, or monitor GC?
Maybe investigate some more?
I will for sure add some nodes to my cluster (2-3, i don't have tons of them in my pocket) , but I need to understand whether this will be enough.

Will be grateful for any advice to help

Gang · April 12, 2017, 3:00pm

Guys, please give me some hint to dig further

Gang · April 14, 2017, 3:04pm

I'v added 2 nodes with 8 cores per one. Now it handles 9 reqs/sec till 100% cpu. It seems i need tuning more then HW expansion. But i dont know what to tune (((

jpountz · April 14, 2017, 3:12pm

Can you share the output of the nodes hot threads API under load?

Gang · April 14, 2017, 3:23pm

In 2 hours. On my way home now

Gang · April 14, 2017, 5:45pm

Hello again! Gist of hot https://gist.github.com/anonymous/8b0da9373cdf0a6e87be67fb46c9419c

Gang · April 14, 2017, 5:52pm

And query https://gist.github.com/anonymous/919b5d2059d6543f9475ebd68dc2184b

Gang · April 14, 2017, 5:57pm

I'm on 2.4.4 if metters

Gang · April 17, 2017, 11:11am

Now i have nginx+post cache in front of elastic. It helps a bit against dummy F5-s on main page, but i still have trouble with cpu query cost. Can someone help?

Christian_Dahlqvist · April 17, 2017, 11:15am

What does disk I/O and iowait look like? What type of storage do you have?

Gang · April 17, 2017, 11:31am

I do not see any changes in the disk load. It is less then 5% regardless of my tests. I think my 30GB/node cache prevents the load to reach the disk level.

Gang · April 17, 2017, 11:34am

CPU under test (not most havy)

Gang · April 17, 2017, 11:36am

And Disk utilization at same time

Gang · April 17, 2017, 2:34pm

Could hashed fields be handy for my cardinality aggregation?

Gang · April 20, 2017, 7:50am

If somebody interested Im still facing the problem. Any advice please?

jpountz · April 20, 2017, 1:50pm

Hot threads suggest your node is just busy running aggregations, I can't think of ways to speed this up significantly. I think you would just need to add more processing power. One thing surprised me from the hot threads: you seem to be using the niofs directory, did you opt in for it explicitly? Switching to mmapfs might help read directly into the FS cache rather than copying memory from the FS cache to Java. But I don't expect it to bring significant speedups.

Gang · April 20, 2017, 4:00pm

Adrien, thanks for your reply! Now after several days of research, I also think so. I hoped only that I missed something in the configurations. I will discuss your remark about FS with our OS administrators. Thanks again.

Gang · April 27, 2017, 9:00am

Can someone tell about niofs to mmapfs switсh procedure. Is it dynamic or I`ll need to open/close or rebuild my index?

system · May 25, 2017, 9:02am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Very high CPU usage of elastic nodes Elasticsearch	6	2599	March 29, 2018
Massive performance issues on our production cluster Elasticsearch	5	2585	July 6, 2017
High CPU load on some node Elasticsearch	1	870	April 2, 2018
Single thread with high CPU usage Elasticsearch	3	2629	July 6, 2017
High cpu usage (90%-100%) Elasticsearch	1	330	July 6, 2017

Need help to overcome 100% CPU

Related topics