After increasing heap size, my search failed requests suddenly increased significantly, why?


(Cherry Zhang) #1

Initially, I set heap size to 2g, with 500 concurrent users in jmeter test scripts, the result is acceptable:

summary + 3750 in 00:00:30 = 125.0/s Avg: 3914 Min: 0 Max: 7517 Err: 7 (0.19%) Active: 500 Started: 500 Finished: 0
summary = 85013 in 00:10:01 = 141.5/s Avg: 3214 Min: 0 Max: 27116 Err: 25 (0.03%)
summary + 4714 in 00:00:30 = 156.9/s Avg: 3217 Min: 0 Max: 8294 Err: 2 (0.04%) Active: 500 Started: 500 Finished: 0
summary = 89727 in 00:10:31 = 142.2/s Avg: 3214 Min: 0 Max: 27116 Err: 27 (0.03%)

but after I expand heap size to 4g, the failure rate suddenly increased a lot:

summary = 101099 in 00:10:05 = 167.1/s Avg: 2729 Min: 0 Max: 10645 Err: 4816 (4.76%)
summary + 5844 in 00:00:30 = 194.4/s Avg: 2603 Min: 0 Max: 6303 Err: 421 (7.20%) Active: 500 Started: 500 Finished: 0
summary = 106943 in 00:10:35 = 168.4/s Avg: 2722 Min: 0 Max: 10645 Err: 5237 (4.90%)
summary + 5746 in 00:00:30 = 192.7/s Avg: 2579 Min: 0 Max: 6456 Err: 423 (7.36%) Active: 500 Started: 500 Finished: 0
summary = 112689 in 00:11:05 = 169.5/s Avg: 2714 Min: 0 Max: 10645 Err: 5660 (5.02%)

and I set back to 2g, the failure rate kept growing:
summary + 9938 in 00:00:30 = 331.5/s Avg: 1547 Min: 0 Max: 4150 Err: 2039 (20.52%) Active: 500 Started: 500 Finished: 0
summary = 123983 in 00:07:53 = 261.9/s Avg: 1698 Min: 0 Max: 7726 Err: 27478 (22.16%)
summary + 9930 in 00:00:30 = 331.0/s Avg: 1465 Min: 0 Max: 4117 Err: 1987 (20.01%) Active: 500 Started: 500 Finished: 0
summary = 133913 in 00:08:23 = 266.0/s Avg: 1681 Min: 0 Max: 7726 Err: 29465 (22.00%)
summary + 10212 in 00:00:30 = 340.5/s Avg: 1456 Min: 0 Max: 3628 Err: 2201 (21.55%) Active: 500 Started: 500 Finished: 0
summary = 144125 in 00:08:53 = 270.2/s Avg: 1665 Min: 0 Max: 7726 Err: 31666 (21.97%)

there is no OOM error,

[2016-11-04 02:36:31,065][DEBUG][action.search ] [Hazmat] [test][0], node[fq4R4SE9QBaWLLRU3BvI3Q], [P], v[18], s[STARTED], a[id=L6TvrcpAST2YrpsDE3nsRg]: Failed to execute [org.elasticsearch.action.search.SearchRequest@2b02bfa5] lastShard [true]
RemoteTransportException[[Hazmat][127.0.0.1:9300][indices:data/read/search[phase/query]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@56ef01c5 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1cab2829[Running, pool size = 7, active threads = 7, queued tasks = 1000, completed tasks = 303297]]];
Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@56ef01c5 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1cab2829[Running, pool size = 7, active threads = 7, queued tasks = 1000, completed tasks = 303297]]]
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:85)
at org.elasticsearch.transport.TransportService.sendLocalRequest(TransportService.java:372)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:327)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:299)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:142)
at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.sendExecuteFirstPhase(SearchQueryThenFetchAsyncAction.java:66)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.performFirstPhase(AbstractSearchAsyncAction.java:144)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:126)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:115)

I want to what might cause this kind of situation? How can I make my test resume to its initial status, with 500 concurrent users but only 0.03 failure rate?

P.S my machine is ubuntu16.04 os, with 20g memory and 4cpus.


(Mark Walkom) #2

Looks like you are exceeding your queue capacity.
What does the jmeter config look like?


(Cherry Zhang) #3

with 500 threads, each send 2 query document requests. My concern is why the failed requests increase when expanding the heap size, is there a clue for me to figure out the root cause?


(Runar Myklebust) #4

Since this happened after the heap-size-increase; consider logging the garbage-collection times just to check if the CG's are stalling the system.


(system) #5