After increasing heap size, my search failed requests suddenly increased significantly, why?

Initially, I set heap size to 2g, with 500 concurrent users in jmeter test scripts, the result is acceptable:

summary + 3750 in 00:00:30 = 125.0/s Avg: 3914 Min: 0 Max: 7517 Err: 7 (0.19%) Active: 500 Started: 500 Finished: 0
summary = 85013 in 00:10:01 = 141.5/s Avg: 3214 Min: 0 Max: 27116 Err: 25 (0.03%)
summary + 4714 in 00:00:30 = 156.9/s Avg: 3217 Min: 0 Max: 8294 Err: 2 (0.04%) Active: 500 Started: 500 Finished: 0
summary = 89727 in 00:10:31 = 142.2/s Avg: 3214 Min: 0 Max: 27116 Err: 27 (0.03%)

but after I expand heap size to 4g, the failure rate suddenly increased a lot:

summary = 101099 in 00:10:05 = 167.1/s Avg: 2729 Min: 0 Max: 10645 Err: 4816 (4.76%)
summary + 5844 in 00:00:30 = 194.4/s Avg: 2603 Min: 0 Max: 6303 Err: 421 (7.20%) Active: 500 Started: 500 Finished: 0
summary = 106943 in 00:10:35 = 168.4/s Avg: 2722 Min: 0 Max: 10645 Err: 5237 (4.90%)
summary + 5746 in 00:00:30 = 192.7/s Avg: 2579 Min: 0 Max: 6456 Err: 423 (7.36%) Active: 500 Started: 500 Finished: 0
summary = 112689 in 00:11:05 = 169.5/s Avg: 2714 Min: 0 Max: 10645 Err: 5660 (5.02%)

and I set back to 2g, the failure rate kept growing:
summary + 9938 in 00:00:30 = 331.5/s Avg: 1547 Min: 0 Max: 4150 Err: 2039 (20.52%) Active: 500 Started: 500 Finished: 0
summary = 123983 in 00:07:53 = 261.9/s Avg: 1698 Min: 0 Max: 7726 Err: 27478 (22.16%)
summary + 9930 in 00:00:30 = 331.0/s Avg: 1465 Min: 0 Max: 4117 Err: 1987 (20.01%) Active: 500 Started: 500 Finished: 0
summary = 133913 in 00:08:23 = 266.0/s Avg: 1681 Min: 0 Max: 7726 Err: 29465 (22.00%)
summary + 10212 in 00:00:30 = 340.5/s Avg: 1456 Min: 0 Max: 3628 Err: 2201 (21.55%) Active: 500 Started: 500 Finished: 0
summary = 144125 in 00:08:53 = 270.2/s Avg: 1665 Min: 0 Max: 7726 Err: 31666 (21.97%)

there is no OOM error,

[2016-11-04 02:36:31,065][DEBUG][action.search ] [Hazmat] [test][0], node[fq4R4SE9QBaWLLRU3BvI3Q], [P], v[18], s[STARTED], a[id=L6TvrcpAST2YrpsDE3nsRg]: Failed to execute [org.elasticsearch.action.search.SearchRequest@2b02bfa5] lastShard [true]
RemoteTransportException[[Hazmat][127.0.0.1:9300][indices:data/read/search[phase/query]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@56ef01c5 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1cab2829[Running, pool size = 7, active threads = 7, queued tasks = 1000, completed tasks = 303297]]];
Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@56ef01c5 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1cab2829[Running, pool size = 7, active threads = 7, queued tasks = 1000, completed tasks = 303297]]]
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:85)
at org.elasticsearch.transport.TransportService.sendLocalRequest(TransportService.java:372)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:327)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:299)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:142)
at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.sendExecuteFirstPhase(SearchQueryThenFetchAsyncAction.java:66)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.performFirstPhase(AbstractSearchAsyncAction.java:144)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:126)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:115)

I want to what might cause this kind of situation? How can I make my test resume to its initial status, with 500 concurrent users but only 0.03 failure rate?

P.S my machine is ubuntu16.04 os, with 20g memory and 4cpus.

Looks like you are exceeding your queue capacity.
What does the jmeter config look like?

with 500 threads, each send 2 query document requests. My concern is why the failed requests increase when expanding the heap size, is there a clue for me to figure out the root cause?

Since this happened after the heap-size-increase; consider logging the garbage-collection times just to check if the CG's are stalling the system.