Many slow queries and high cpu load on load test (50req/sec)

Mehmet_Cem_Gunturkun · October 20, 2015, 10:51pm

Hello everyone,
I have some problems running elasticsearch on production.

but that did not actually help me

Before deploying the application into production, we want to run load tests on beta machine so that we want to be sure.

Beta Machine Hardware Specs is as follows:

CPU vendor: Intel
CPU model: Xeon (2001 MHz)
CPU total logical cores: 24
CPU cache: 15kb

VM name: Java HotSpot(TM) 64-Bit Server VM
VM vendor: Oracle Corporation
VM version: 25.31-b07
Java version: 1.8.0_31

Index Settings and Mappings:

Settings: http://pastebin.com/Juin26k8

Mappings: http://pastebin.com/W3CqtiFU

Sample Query

http://pastebin.com/W5PgRVf7

After running these queries with similar terms, around 50/60 request per second, suddenly search requests get slower like from 60ms to 2-3 secs.

One of the lines from slow_search.log

[metadatav3][1] took[1.9s], took_millis[1996], types[track], stats[], search_type[DFS_QUERY_THEN_FETCH], total_shards[5], source[{"from":0,"size":20,"timeout":5000,"query":{"function_score":{"query":{"bool":{"must":{"match":{"fullString":{"query":"Charlie Feathers","type":"boolean","operator":"OR","minimum_should_match":"6","cutoff_frequency":0.01}}},"must_not":{"term":{"isHidden":"true"}},"should":[{"match":{"fullString.token":{"query":"Charlie Feathers","type":"phrase"}}},{"match":{"fullString.edgeNGNoSplit":{"query":"Charlie Feathers","type":"phrase"}}}]}},"field_value_factor":{"field":"popularity"}}},"explain":false}], extra_source[],

My main problems that I do not fully understand:

1. We are going to have two nodes but the es configuration has 5 shard. We are leaving that with 5 shard because we might put additional nodes in future. 
Might shard configuration cause this cpu load?

2. Analyzers and query does not seem to be complex but I am not really an expert.
Might query and analyzer settings cause this cpu load?

Also I have founded that thread:
https://groups.google.com/forum/#!searchin/elasticsearch/load$20test/elasticsearch/8wZgMdEzHvk/i0PxqRNYYm4J

jpountz · October 20, 2015, 10:57pm

Maybe have a look at the hot threads before and after the slow down to try to understand what changes.

Mehmet_Cem_Gunturkun · October 22, 2015, 6:46pm

Hey adrien,
Here is the hot thread dump

http://pastebin.com/5sFEZJa5

also if you want I can send a copy of thread dump.

dakrone · November 4, 2015, 12:27am

Might shard configuration cause this cpu load?

In this case this is likely not the shard configuration causing the cpu load.

Might query and analyzer settings cause this cpu load?

It's possible, what would be a good way to test this out is to separate out your query into its separate parts, for example, separate out the field_value_factor into a separate query, the fullString query into a separate query, the fullString.token into a separate one, etc.

Then you can run each query by itself and see which is taking the bulk of the time.

jpountz · November 4, 2015, 4:14pm

The hot threads look sane. When the throughput suddenly decreases, is your machine still fully utilized (CPU and/or I/O at 100%) or does it look idle? Can you correlate the slowdown with garbage collection activity?

Mehmet_Cem_Gunturkun · November 4, 2015, 4:51pm

I will try these queries separately and share the results

Mehmet_Cem_Gunturkun · November 4, 2015, 4:53pm

what do you mean throughput decreases? Do you mean that responses start to last longer? If yes, yeah CPU stays utilized during the load test when I check with "top" command I can see that %CPU is around 2200-2300.
If you want I can get a screenshot from marvel plugin also during load test for garbage collector monitoring.

jpountz · November 4, 2015, 4:54pm

That could be helpful, thanks.

Mehmet_Cem_Gunturkun · November 5, 2015, 9:44pm

Hello Adrien, I am going to add query.json, marvel screenshot during a load test and a detailed gatling output about the load test.

I do not know that you are familiar with gatling tool but there is an index.html under the folder(a2-1446748547160), you can view the details from there if you would like.
I will be online for next 4-5 hours you can ask anything or want another load test with different parameters:

Mehmet_Cem_Gunturkun · November 10, 2015, 9:04pm

Hey Adrien, do you need any additional info?
Also I might found the reason about the really high load average?
I mean search.threadppol.type was cached. I guess this type is unbounded, this is why it generates almost 2K threads, I thought context-swtich might occupy cpu instead of search operation.

However, now, we have EsRejectedExecutionException, probably due to the fixed threadpool and queue size.

I am still confused, does that mean that I need new machines if I still want to ngram?
Mehmet -

jpountz · November 12, 2015, 11:45am

Wow, indeed cached is not a good type for the search threadpool, and 2k threads can easily cause issues with your scheduler. We recently changed thread pool types to not be updateable to avoid such problems: https://github.com/elastic/elasticsearch/pull/14367

The EsRejectedExecutionException tells you that you are sending more requests than Elasticsearch can handle: it means that not only all threads of the search thread pool are busy, but also that the queue is full. When this happens, you need to stop sending requests for some time so that some space is going to be made in the queue, and then start again. The same mechanism is used at index time when indexing too many documents concurrently.

Mehmet_Cem_Gunturkun · November 12, 2015, 5:41pm

so what is you suggestion about ngrams?
I am shocked actually and 2 possible ideads came up:

everyone using ngrams adds machine to support huge loads.
nobody cares more thatn 50rps
I am really missing a point?

Have you checked gc graphs?
I saw some gc young activity during these loads is that normal?

Topic		Replies	Views
High cpu load on load test with 300 rps Elasticsearch	8	1867	July 6, 2017
Elasticsearch cpu/load high with search thread pool queues high Elasticsearch	5	4038	December 28, 2018
Elasticsearch very high load - 100% cpu Elasticsearch	4	7325	July 5, 2017
High CPU load on some node Elasticsearch	1	877	April 2, 2018
Need help to overcome 100% CPU Elasticsearch	18	14343	May 25, 2017