[search_phase_execution_exception Caused by: es_rejected_execution_exception: rejected execution

I am using grafana k6 stress tool to generate some load in elastic and sooner or later i always get this error es_rejected_execution_exception. All are search operations to get 30 docs.

  • On small indices of 20-30mb it works quite fine and respone time is 1sec until it gets to 500 virtual users and give the error.

  • On other indices that are 300mb it fails quite early just after a few requests and response time get to 14s and fails.

Same happen if we just do 3000 sequential requests that is not much instead of parallel users, i think this is very low load for elastic. After it get saturated elastic start rejecting queries and show this error in monitoris: Node xxx is reporting 10800 search rejections at day XXX

Elastic is 8.9.1 on k8s 1 node with 8GB and 4GB heap size

In monitoring i dont see any metric going very high.

Anyone can explain wthat is this error and why is caused? memory? cpu? any optimization to do?

Error:

[search_phase_execution_exception Caused by: es_rejected_execution_exception: rejected execution of TimedRunnable{original=org.elasticsearch.action.search.FetchSearchPhase$1@115599a3, creationTimeNanos=36855012795314, startTimeNanos=0, finishTimeNanos=-1, failedOrRejected=false} on TaskExecutionTimeTrackingEsThreadPoolExecutor[name = elasticsearch-es-master-0/search, queue capacity = 1000, task execution EWMA = 164ms, total task execution time = 18m, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@20706e70[Running, pool size = 4, active threads = 4, queued tasks = 1000, completed tasks = 144548]] Root causes: es_rejected_execution_exception: rejected execution of TimedRunnable{original=ActionRunnable#wrap[org.elasticsearch.search.SearchService$$Lambda$7936/0x00007f344d306668@33107600], creationTimeNanos=36855012420891, startTimeNanos=0, finishTimeNanos=-1, failedOrRejected=false} on TaskExecutionTimeTrackingEsThreadPoolExecutor[name = elasticsearch-es-master-0/search, queue capacity = 1000, task execution EWMA = 164ms, total task execution time = 18m, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@20706e70[Running, pool size = 4, active threads = 4, queued tasks = 1000, completed tasks = 144548]]: search_phase_execution_exception Caused by: es_rejected_execution_exception: rejected execution of TimedRunnable{original=org.elasticsearch.action.search.FetchSearchPhase$1@115599a3, creationTimeNanos=36855012795314, startTimeNanos=0, finishTimeNanos=-1, failedOrRejected=false} on TaskExecutionTimeTrackingEsThreadPoolExecutor[name = elasticsearch-es-master-0/search, queue capacity = 1000, task execution EWMA = 164ms, total task execution time = 18m, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@20706e70[Running, pool size = 4, active threads = 4, queued tasks = 1000, completed tasks = 144548]] Root causes: es_rejected_execution_exception: rejected execution of TimedRunnable{original=ActionRunnable#wrap[org.elasticsearch.search.SearchService$$Lambda$7936/0x00007f344d306668@33107600], creationTimeNanos=36855012420891, startTimeNanos=0, finishTimeNanos=-1, failedOrRejected=false} on TaskExecutionTimeTrackingEsThreadPoolExecutor[name = elasticsearch-es-master-0/search, queue capacity = 1000, task execution EWMA = 164ms, total task execution time = 18m, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@20706e70[Running, pool size = 4, active threads = 4, queued tasks = 1000, completed tasks = 144548]]]:

What is the full output of the cluster stats API?

What is the CPU allocation?

This is output of /_cluster/stats while doing the testing queries:

CPU requests and limit it is 2 CPUs

{
"_nodes": {
"total": 1,
"successful": 1,
"failed": 0
},
"cluster_name": "elasticsearch",
"cluster_uuid": "xxx",
"timestamp": 1693514057160,
"status": "yellow",
"indices": {
"count": 190,
"shards": {
"total": 190,
"primaries": 190,
"replication": 0,
"index": {
"shards": {
"min": 1,
"max": 1,
"avg": 1
},
"primaries": {
"min": 1,
"max": 1,
"avg": 1
},
"replication": {
"min": 0,
"max": 0,
"avg": 0
}
}
},
"docs": {
"count": 37647365,
"deleted": 2764735
},
"store": {
"size_in_bytes": 14779611603,
"total_data_set_size_in_bytes": 14779611603,
"reserved_in_bytes": 0
},
"fielddata": {
"memory_size_in_bytes": 110056,
"evictions": 0,
"global_ordinals": {
"build_time_in_millis": 580
}
},
"query_cache": {
"memory_size_in_bytes": 8539942,
"total_count": 1163459,
"hit_count": 422535,
"miss_count": 740924,
"cache_size": 5563,
"cache_count": 13079,
"evictions": 7516
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 1178,
"memory_in_bytes": 0,
"terms_memory_in_bytes": 0,
"stored_fields_memory_in_bytes": 0,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 0,
"points_memory_in_bytes": 0,
"doc_values_memory_in_bytes": 0,
"index_writer_memory_in_bytes": 317696,
"version_map_memory_in_bytes": 179,
"fixed_bit_set_memory_in_bytes": 961688,
"max_unsafe_auto_id_timestamp": 1693502487357,
"file_sizes": {}
},
"mappings": {
"total_field_count": 16758,
"total_deduplicated_field_count": 7840,
"total_deduplicated_mapping_size_in_bytes": 76399,
"field_types": [
{
"name": "alias",
"count": 76,
"index_count": 1,
"script_count": 0
},
{
"name": "binary",
"count": 2,
"index_count": 2,
"script_count": 0
},
{
"name": "boolean",
"count": 327,
"index_count": 69,
"script_count": 0
},
{
"name": "constant_keyword",
"count": 6,
"index_count": 2,
"script_count": 0
},
{
"name": "date",
"count": 779,
"index_count": 168,
"script_count": 0
},
{
"name": "date_range",
"count": 6,
"index_count": 6,
"script_count": 0
},
{
"name": "double",
"count": 93,
"index_count": 19,
"script_count": 0
},
{
"name": "flattened",
"count": 39,
"index_count": 6,
"script_count": 0
},
{
"name": "float",
"count": 168,
"index_count": 43,
"script_count": 0
},
{
"name": "geo_point",
"count": 24,
"index_count": 3,
"script_count": 0
},
{
"name": "half_float",
"count": 56,
"index_count": 14,
"script_count": 0
},
{
"name": "integer",
"count": 648,
"index_count": 101,
"script_count": 0
},
{
"name": "ip",
"count": 53,
"index_count": 5,
"script_count": 0
},
{
"name": "keyword",
"count": 7235,
"index_count": 173,
"script_count": 0
},
{
"name": "long",
"count": 2221,
"index_count": 107,
"script_count": 0
},
{
"name": "match_only_text",
"count": 270,
"index_count": 3,
"script_count": 0
},
{
"name": "nested",
"count": 385,
"index_count": 80,
"script_count": 0
},
{
"name": "object",
"count": 2153,
"index_count": 51,
"script_count": 0
},
{
"name": "scaled_float",
"count": 24,
"index_count": 6,
"script_count": 0
},
{
"name": "short",
"count": 89,
"index_count": 44,
"script_count": 0
},
{
"name": "text",
"count": 1991,
"index_count": 157,
"script_count": 0
},
{
"name": "version",
"count": 13,
"index_count": 13,
"script_count": 0
},
{
"name": "wildcard",
"count": 100,
"index_count": 37,
"script_count": 0
}
],
"runtime_field_types":
},
"analysis": {
"char_filter_types": ,
"tokenizer_types": [
{
"name": "path_hierarchy",
"count": 124,
"index_count": 124
}
],
"filter_types": [
{
"name": "lowercase",
"count": 124,
"index_count": 124
}
],
"analyzer_types": [
{
"name": "custom",
"count": 372,
"index_count": 124
}
],
"built_in_char_filters": ,
"built_in_tokenizers": [
{
"name": "standard",
"count": 124,
"index_count": 124
},
{
"name": "whitespace",
"count": 124,
"index_count": 124
}
],
"built_in_filters": ,
"built_in_analyzers": [
{
"name": "ana_path_lc",
"count": 77,
"index_count": 27
},
{
"name": "ana_s_lc",
"count": 1353,
"index_count": 123
},
{
"name": "ana_w_lc",
"count": 1243,
"index_count": 123
}
]
},
"versions": [
{
"version": "8.4.0",
"index_count": 82,
"primary_shard_count": 82,
"total_primary_bytes": 2946839713
},
{
"version": "8.6.2",
"index_count": 71,
"primary_shard_count": 71,
"total_primary_bytes": 7233327463
},
{
"version": "8.9.0",
"index_count": 33,
"primary_shard_count": 33,
"total_primary_bytes": 4247081827
},
{
"version": "8.9.1",
"index_count": 4,
"primary_shard_count": 4,
"total_primary_bytes": 352362600
}
],
"search": {
"total": 94783,
"queries": {
"regexp": 2,
"match_phrase": 5,
"bool": 82800,
"terms": 10696,
"prefix": 420,
"match": 6630,
"match_phrase_prefix": 1,
"exists": 12470,
"range": 26036,
"term": 91921,
"nested": 183,
"simple_query_string": 387
},
"sections": {
"highlight": 19,
"runtime_mappings": 16,
"stored_fields": 38,
"query": 94348,
"script_fields": 38,
"pit": 39,
"_source": 16025,
"fields": 54,
"collapse": 3659,
"aggs": 18119
}
}
},
"nodes": {
"count": {
"total": 1,
"coordinating_only": 0,
"data": 1,
"data_cold": 0,
"data_content": 0,
"data_frozen": 0,
"data_hot": 0,
"data_warm": 0,
"index": 0,
"ingest": 1,
"master": 1,
"ml": 0,
"remote_cluster_client": 1,
"search": 0,
"transform": 0,
"voting_only": 0
},
"versions": [
"8.9.1"
],
"os": {
"available_processors": 2,
"allocated_processors": 2,
"names": [
{
"name": "Linux",
"count": 1
}
],
"pretty_names": [
{
"pretty_name": "Ubuntu 20.04.6 LTS",
"count": 1
}
],
"architectures": [
{
"arch": "amd64",
"count": 1
}
],
"mem": {
"total_in_bytes": 8589934592,
"adjusted_total_in_bytes": 8589934592,
"free_in_bytes": 2310959104,
"used_in_bytes": 6278975488,
"free_percent": 27,
"used_percent": 73
}
},
"process": {
"cpu": {
"percent": 100
},
"open_file_descriptors": {
"min": 2062,
"max": 2062,
"avg": 2062
}
},
"jvm": {
"max_uptime_in_millis": 11587715,
"versions": [
{
"version": "20.0.2",
"vm_name": "OpenJDK 64-Bit Server VM",
"vm_version": "20.0.2+9-78",
"vm_vendor": "Oracle Corporation",
"bundled_jdk": true,
"using_bundled_jdk": true,
"count": 1
}
],
"mem": {
"heap_used_in_bytes": 1405540880,
"heap_max_in_bytes": 4294967296
},
"threads": 58
},
"fs": {
"total_in_bytes": 20957446144,
"free_in_bytes": 6121500672,
"available_in_bytes": 6104723456
},
"plugins": ,
"network_types": {
"transport_types": {
"security4": 1
},
"http_types": {
"security4": 1
}
},
"discovery_types": {
"multi-node": 1
},
"packaging_types": [
{
"flavor": "default",
"type": "docker",
"count": 1
}
],
"ingest": {
"number_of_pipelines": 11,
"processor_stats": {
"date": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
},
"dot_expander": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
},
"foreach": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
},
"geoip": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
},
"json": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
},
"pipeline": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
},
"remove": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
},
"rename": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
},
"script": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
},
"set": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
},
"set_security_user": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
},
"uri_parts": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
},
"user_agent": {
"count": 0,
"failed": 0,
"current": 0,
"time_in_millis": 0
}
}
},
"indexing_pressure": {
"memory": {
"current": {
"combined_coordinating_and_primary_in_bytes": 0,
"coordinating_in_bytes": 0,
"primary_in_bytes": 0,
"replica_in_bytes": 0,
"all_in_bytes": 0
},
"total": {
"combined_coordinating_and_primary_in_bytes": 0,
"coordinating_in_bytes": 0,
"primary_in_bytes": 0,
"replica_in_bytes": 0,
"all_in_bytes": 0,
"coordinating_rejections": 0,
"primary_rejections": 0,
"replica_rejections": 0
},
"limit_in_bytes": 0
}
}
},
"snapshots": {
"current_counts": {
"snapshots": 0,
"shard_snapshots": 0,
"snapshot_deletions": 0,
"concurrent_operations": 0,
"cleanups": 0
},
"repositories": {}
}
}

Even though the index sizes you mentioned are very small, the total size of the data set is around 14GB, which is more than can fit in the operating system page cache. Depending on how you query your data, it is therefore possible that disk I/O could be a bottleneck.

When you run your benchmark, how many indices are you querying in each request? How many indices are queried in total across all requests?

That does not sound like a lot for the number of concurrent queries you are sending. Does the behaviour change if you double this allocation?

I increased cpu limits to 4cpu and same result dont see a difference.

I do 2 load tests , the first test against just one index (200-300mb) and the second test just one index query (small index) I dont mix.

what i notice is when i run the 500 user query test over the bigger index that all are the same search query over and over it reports latency issues it is alway the same query should not be cached?

When i do the same test over the smaller index i dont get any latency issues and run fine.

First test: same simple search query over index tasks (200-300mb) - latency spike to 100ms - 78.30% requests fails

data_received..................: 8.0 MB 265 kB/s
data_sent......................: 668 kB 22 kB/s
http_req_blocked...............: avg=313.14ms min=0s med=0s max=1.63s p(90)=1.17s p(95)=1.44s
http_req_connecting............: avg=66.66ms min=0s med=0s max=1.05s p(90)=184.18ms p(95)=189.13ms
http_req_duration..............: avg=5.1s min=39.94ms med=281.84ms max=20.83s p(90)=20.37s p(95)=20.65s
{ expected_response:true }...: avg=20.06s min=17.54s med=20.34s max=20.83s p(90)=20.71s p(95)=20.74s
http_req_failed................: 78.30% ✓ 1061 ✗ 294
http_req_receiving.............: avg=8.91ms min=0s med=0s max=194.04ms p(90)=6.25ms p(95)=81.78ms
http_req_sending...............: avg=38.28µs min=0s med=0s max=1ms p(90)=0s p(95)=515.7µs
http_req_tls_handshaking.......: avg=240.64ms min=0s med=0s max=1.44s p(90)=959.08ms p(95)=1.26s
http_req_waiting...............: avg=5.1s min=39.94ms med=281.84ms max=20.83s p(90)=20.32s p(95)=20.62s
http_reqs......................: 1355 45.152482/s
iteration_duration.............: avg=6.42s min=1.04s med=1.28s max=22.47s p(90)=22.01s p(95)=22.22s
iterations.....................: 1355 45.152482/s
vus............................: 500 min=500 max=500
vus_max........................: 500 min=500 max=500

Second test: same simple query over small index (20-30mb). Latency under 0.5ms

data_received..................: 75 MB 2.5 MB/s
data_sent......................: 1.1 MB 36 kB/s
http_req_blocked...............: avg=62.41ms min=0s med=0s max=1.61s p(90)=0s p(95)=503.03ms
http_req_connecting............: avg=14.85ms min=0s med=0s max=1.04s p(90)=0s p(95)=145.61ms
http_req_duration..............: avg=1.45s min=96.47ms med=1.49s max=2.01s p(90)=1.72s p(95)=1.81s
{ expected_response:true }...: avg=1.46s min=211.94ms med=1.5s max=2.01s p(90)=1.72s p(95)=1.81s
http_req_failed................: 0.57% ✓ 34 ✗ 5829
http_req_receiving.............: avg=537.65µs min=0s med=0s max=201.51ms p(90)=380.84µs p(95)=562.59µs
http_req_sending...............: avg=26.25µs min=0s med=0s max=1.03ms p(90)=0s p(95)=192.3µs
http_req_tls_handshaking.......: avg=47.5ms min=0s med=0s max=1.43s p(90)=0s p(95)=345.57ms
http_req_waiting...............: avg=1.45s min=96.47ms med=1.49s max=2.01s p(90)=1.72s p(95)=1.8s
http_reqs......................: 5863 195.33764/s
iteration_duration.............: avg=2.52s min=1.1s med=2.5s max=4.35s p(90)=2.8s p(95)=3.31s
iterations.....................: 5720 190.573307/s
vus............................: 500 min=500 max=500
vus_max........................: 500 min=500 max=500

I attach a picture of monitoring during both tests 1 and 2 and you can see while running test 1 latency increase a lot.

why this performance difference between some indexes?

It looks to me like you are saturating CPU in both benchmarks, albeit for a shorter period in the second test.

I would recommend running multiple tests and gradually increase the number of concurrent requests. Start low and keep track of latency and failure rate at each throughput rate. This way you will see at what level problems start (failures or too high latencies) and that will tell you what your cluster is able to handle. I would also recommend running a realistic query load. Rerunning the same exact query is unlikely to give you useful results.

just using 10 virtual users same results but now not al CPU of 4 cpus is consumed just 50% so it is not CPU bottleneck but latency for index 1 requests goes up a lot to 60ms while same requests to other index no problem.

Also why is heap size moving constantly from 3gb to 1gb to 3gb to 1gb and so on? any help on how to check what is causing the latency?

anyone can help understand this latency issue? the disk is the same for all data.

i have found that this only happens when queries are done by NEST client but if the request are done by http requests no problem.

You can see difference between stress loading with http requests or NEST client

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.