ThreadPoolExecutor overused

opendoc42 · March 31, 2016, 8:40am

Hello everyone,
we try to migrate our old cluster 1.4 to our new cluster 2.2.1.
Currently, half of our web site use the new cluster and we encounter very big problems.

here is the error:

Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2@2238365d on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@3911c81[Running, pool size = 13, active threads = 13, queued tasks = 1000, completed tasks = 71304]]]
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:85)
        ... 31 more

I read this topic : Courier Fetch: X of Y shards failed

It is not advisable to increase the number of thread pool. However how do I carry the load ?
Why I had no problem with my old cluster? ElasticSearch 2.2.1 is less efficient?

For information, on my old cluster :

stats/thread_pool/search :

 "search" : {
          "threads" : 27,
          "queue" : 0,
          "active" : 3,
          "rejected" : 71529,
          "largest" : 27,
          "completed" : 19144205488
        },

on my new cluser

"search" : {
          "threads" : 13,
          "queue" : 0,
          "active" : 0,
          "rejected" : 13352,
          "largest" : 13,
          "completed" : 679045
        },

I do not understand why, with the same number of CPU, I have 2 times less thread_pool ...

Best regards,
Alexandre.

dadoonet · April 1, 2016, 3:11pm

It's hard to tell without knowing details about what you are exactly doing. Could you share with us the documents you have and the typical query you are running?

Ideally what was the response time on 1.x and then on 2.x with the exact same query.

Also, could you add profile: true to your query so we might be able to get more information?

Look at https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-profile.html

Alex_Alexandre · April 7, 2016, 2:12pm

This is the mapping part of our geopoints

{
	"prod_geopoints": {
		"mappings": {
			"pt_fra4": {
				"_id": {
					"store": true,
					"index": "not_analyzed"
				},
				"properties": {
					"altitude": {
						"type": "long"
					},
					"distance_max": {
						"type": "long"
					},
					"echeance_max": {
						"type": "long"
					},
					"elasticsearch_maj": {
						"type": "date",
						"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd",
						"locale": "fr"
					},
					"geoloc": {
						"type": "geo_point",
						"store": true,
						"lat_lon": true
					},
					"num_zone": {
						"type": "string"
					},
					"priorite": {
						"type": "long"
					},
					"terre": {
						"type": "long"
					},
					"type_previs_model": {
						"type": "long"
					}
				}
			}
		}
	}
}

And that's what the query looks like :

{
	"from": 0,
	"size": 1,
	"sort": [],
	"fields": ["_type", "_id", "altitude", "num_zone", "terre", "type_previs_model"],
	"query": {
		"filtered": {
			"query": [{
				"match_all": {}
			}],
			"filter": {
				"and": [{
					"term": {
						"terre": "2"
					}
				}, {
					"range": {
						"echeance_max": {
							"gte": "0"
						}
					}
				}, {
					"geo_distance": {
						"distance": "80km",
						"geoloc": {
							"lat": 48.8594,
							"lon": 2.34056
						}
					}
				}]
			}
		}
	},
	"aggs": {
		"par_types": {
			"terms": {
				"field": "_type"
			},
			"aggs": {
				"les_plus_proches": {
					"top_hits": {
						"sort": [{
							"_geo_distance": {
								"geoloc": {
									"lat": 48.8594,
									"lon": 2.34056
								},
								"order": "asc",
								"unit": "km"
							}
						}],
						"_source": {
							"include": ["_type", "_id", "altitude", "num_zone", "terre", "type_previs_model"]
						},
						"size": 1
					}
				}
			}
		}
	}
}

The response is :

{
	"took": 17,
	"timed_out": false,
	"_shards": {
		"total": 5,
		"successful": 5,
		"failed": 0
	},
	"hits": {
		"total": 20273,
		"max_score": 1,
		"hits": [{
			"_index": "dev_geopoints",
			"_type": "pt_climato",
			"_id": "200060",
			"_score": 1,
			"fields": {
				"altitude": [-999],
				"type_previs_model": [2],
				"num_zone": ["climato_decade_MF"],
				"terre": [2]
			}
		}]
	},
	"aggregations": {
		"par_types": {
			"doc_count_error_upper_bound": 0,
			"sum_other_doc_count": 0,
			"buckets": [{
				"key": "pt_climato",
				"doc_count": 20233,
				"les_plus_proches": {
					"hits": {
						"total": 20233,
						"max_score": null,
						"hits": [{
							"_index": "dev_geopoints",
							"_type": "pt_climato",
							"_id": "263151",
							"_score": null,
							"_source": {
								"altitude": -999,
								"terre": 2,
								"type_previs_model": 2,
								"num_zone": "climato_decade_MF"
							},
							"sort": [0.24342125975685]
						}]
					}
				}
			}, {
				"key": "pt_cfs_france",
				"doc_count": 40,
				"les_plus_proches": {
					"hits": {
						"total": 40,
						"max_score": null,
						"hits": [{
							"_index": "dev_geopoints",
							"_type": "pt_cfs_france",
							"_id": "2309",
							"_score": null,
							"_source": {
								"altitude": -999,
								"terre": 2,
								"type_previs_model": 2,
								"num_zone": "cfs_france"
							},
							"sort": [13.844561395009]
						}]
					}
				}
			}]
		}
	}
}

We have made some tests by using this query and set lat/lon with random values.

dadoonet · April 7, 2016, 2:26pm

Would be lovely if you could prettify your response...

jpountz · April 7, 2016, 2:57pm

The number of threads in the search thread pool was decreased in https://github.com/elastic/elasticsearch/pull/9165. One common reason for having this issue is when your indices are over sharded, since elasticsearch will use one thread for every queried shard. How many nodes do you have in your cluster and how many shards do you have in total?

Alex_Alexandre · April 7, 2016, 3:23pm

Thanks for your response.
The cluster is made of 3 nodes, 5 shards per index, 7 index and 2 replicas. That makes 23 shards per nodes... I think you explained my problem, i'm gonna change that.

jpountz · April 7, 2016, 3:49pm

Note that replicas do not play a role here, the more relevant figure is that there are 5*7=35 primary shards so each node has to start about 35/3~=12 threads per search request. So the thread pool is almost filled for a single request.

Yet since you manage te fill up the thread pool queue, this means that you are sending hundreds of concurrent requests, is it expected?

Alex_Alexandre · April 8, 2016, 6:59am

We have about 1800 queries/sec on this cluster. But some of my indexes are very small, i can set less shards for them.

Topic		Replies	Views
Thread_pools increase value Elasticsearch	4	3267	April 25, 2017
Increasing thread pool / queue size Elasticsearch	3	654	July 6, 2017
Thread pool and channels exploding Elasticsearch	5	1879	July 5, 2017
ESRejectedExecutionException Elasticsearch	3	374	July 6, 2017
Courier Fetch: X of Y shards failed Elasticsearch	4	3407	July 5, 2017

ThreadPoolExecutor overused

Related topics