ThreadPoolExecutor overused

Hello everyone,
we try to migrate our old cluster 1.4 to our new cluster 2.2.1.
Currently, half of our web site use the new cluster and we encounter very big problems.

here is the error:

Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2@2238365d on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@3911c81[Running, pool size = 13, active threads = 13, queued tasks = 1000, completed tasks = 71304]]]
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:85)
        ... 31 more

I read this topic : Courier Fetch: X of Y shards failed

It is not advisable to increase the number of thread pool. However how do I carry the load ?
Why I had no problem with my old cluster? ElasticSearch 2.2.1 is less efficient?

For information, on my old cluster :

stats/thread_pool/search :

 "search" : {
          "threads" : 27,
          "queue" : 0,
          "active" : 3,
          "rejected" : 71529,
          "largest" : 27,
          "completed" : 19144205488
        },

on my new cluser

"search" : {
          "threads" : 13,
          "queue" : 0,
          "active" : 0,
          "rejected" : 13352,
          "largest" : 13,
          "completed" : 679045
        },

I do not understand why, with the same number of CPU, I have 2 times less thread_pool ...

Best regards,
Alexandre.

It's hard to tell without knowing details about what you are exactly doing. Could you share with us the documents you have and the typical query you are running?

Ideally what was the response time on 1.x and then on 2.x with the exact same query.

Also, could you add profile: true to your query so we might be able to get more information?

Look at https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-profile.html

This is the mapping part of our geopoints

{
	"prod_geopoints": {
		"mappings": {
			"pt_fra4": {
				"_id": {
					"store": true,
					"index": "not_analyzed"
				},
				"properties": {
					"altitude": {
						"type": "long"
					},
					"distance_max": {
						"type": "long"
					},
					"echeance_max": {
						"type": "long"
					},
					"elasticsearch_maj": {
						"type": "date",
						"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd",
						"locale": "fr"
					},
					"geoloc": {
						"type": "geo_point",
						"store": true,
						"lat_lon": true
					},
					"num_zone": {
						"type": "string"
					},
					"priorite": {
						"type": "long"
					},
					"terre": {
						"type": "long"
					},
					"type_previs_model": {
						"type": "long"
					}
				}
			}
		}
	}
}

And that's what the query looks like :

{
	"from": 0,
	"size": 1,
	"sort": [],
	"fields": ["_type", "_id", "altitude", "num_zone", "terre", "type_previs_model"],
	"query": {
		"filtered": {
			"query": [{
				"match_all": {}
			}],
			"filter": {
				"and": [{
					"term": {
						"terre": "2"
					}
				}, {
					"range": {
						"echeance_max": {
							"gte": "0"
						}
					}
				}, {
					"geo_distance": {
						"distance": "80km",
						"geoloc": {
							"lat": 48.8594,
							"lon": 2.34056
						}
					}
				}]
			}
		}
	},
	"aggs": {
		"par_types": {
			"terms": {
				"field": "_type"
			},
			"aggs": {
				"les_plus_proches": {
					"top_hits": {
						"sort": [{
							"_geo_distance": {
								"geoloc": {
									"lat": 48.8594,
									"lon": 2.34056
								},
								"order": "asc",
								"unit": "km"
							}
						}],
						"_source": {
							"include": ["_type", "_id", "altitude", "num_zone", "terre", "type_previs_model"]
						},
						"size": 1
					}
				}
			}
		}
	}
}

The response is :

{
	"took": 17,
	"timed_out": false,
	"_shards": {
		"total": 5,
		"successful": 5,
		"failed": 0
	},
	"hits": {
		"total": 20273,
		"max_score": 1,
		"hits": [{
			"_index": "dev_geopoints",
			"_type": "pt_climato",
			"_id": "200060",
			"_score": 1,
			"fields": {
				"altitude": [-999],
				"type_previs_model": [2],
				"num_zone": ["climato_decade_MF"],
				"terre": [2]
			}
		}]
	},
	"aggregations": {
		"par_types": {
			"doc_count_error_upper_bound": 0,
			"sum_other_doc_count": 0,
			"buckets": [{
				"key": "pt_climato",
				"doc_count": 20233,
				"les_plus_proches": {
					"hits": {
						"total": 20233,
						"max_score": null,
						"hits": [{
							"_index": "dev_geopoints",
							"_type": "pt_climato",
							"_id": "263151",
							"_score": null,
							"_source": {
								"altitude": -999,
								"terre": 2,
								"type_previs_model": 2,
								"num_zone": "climato_decade_MF"
							},
							"sort": [0.24342125975685]
						}]
					}
				}
			}, {
				"key": "pt_cfs_france",
				"doc_count": 40,
				"les_plus_proches": {
					"hits": {
						"total": 40,
						"max_score": null,
						"hits": [{
							"_index": "dev_geopoints",
							"_type": "pt_cfs_france",
							"_id": "2309",
							"_score": null,
							"_source": {
								"altitude": -999,
								"terre": 2,
								"type_previs_model": 2,
								"num_zone": "cfs_france"
							},
							"sort": [13.844561395009]
						}]
					}
				}
			}]
		}
	}
}

We have made some tests by using this query and set lat/lon with random values.

Would be lovely if you could prettify your response...

The number of threads in the search thread pool was decreased in https://github.com/elastic/elasticsearch/pull/9165. One common reason for having this issue is when your indices are over sharded, since elasticsearch will use one thread for every queried shard. How many nodes do you have in your cluster and how many shards do you have in total?

Thanks for your response.
The cluster is made of 3 nodes, 5 shards per index, 7 index and 2 replicas. That makes 23 shards per nodes... I think you explained my problem, i'm gonna change that.

Note that replicas do not play a role here, the more relevant figure is that there are 5*7=35 primary shards so each node has to start about 35/3~=12 threads per search request. So the thread pool is almost filled for a single request.

Yet since you manage te fill up the thread pool queue, this means that you are sending hundreds of concurrent requests, is it expected?

We have about 1800 queries/sec on this cluster. But some of my indexes are very small, i can set less shards for them.