I am building out my first single node ELK stack server and all is working well until I back load old data (about 6 months worth) and try to use a Kibana Dashboard to display more than 30 days worth of data. When I try this I get the following:
In the Kibana screen: Courier Fetch: X of 185 shards failed
In Elastic search log: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@6cc27f99 on EsThreadPoolExecutor[search, queue capacity = 1000
First I went to the sizing guide and increased the java heap from the default of 1GB to 4GB. This did not appear to have any effect. Next I doubled the search queue size settings (reference below for anyone who may come across this).
This solved the above issue. However can anyone tell me if it is wise to increase this value from the default or is there a better way to approach this?
I've also tried switching to 2 shards which allows me to search back to around 45 days, but until I increase the threadpool search size, the problem persists.
The issue here is in how elasticsearch performs queries like you are executing. When using Dashboard, it acts differently than if you are using Discover. But, that is a separate question.
For queries like this, ES has to issue a query for each shard involved. If you have a new index per day, and 6 months of data, that is 180 days and 180 indices. if each index has 5 shards, that is 900 total shards. If you are using a dashboard that has many visualizations on it, you could have * 900 = total queries to execute.
How many can you execute? ES 2.x uses, I believe, the formula of 1.5 * CPUs * queue size. Say you have one node, 4 CPUs, then you can handle ( 1.5 * 4 * ) queries at once - I think the default queue size is 500 so that would be 3000 total. But, if you had 6 visualizations each needing 900 shards, that would be 5400 queries, which will overflow your search queue.
Obviously, if you increase your queue size, then you don't overflow your queue. The next question that comes to my mind is, how long does that take? If the search performance is good, then it may not hurt to have a larger tread pool search queue. Kibana will only wait 30 seconds by default, but that can be configured.
Since Elasticsearch is horizontally scalable, the most obvious alternative suggestion is to add another node. The cluster will distribute your shards evenly and this will give you 2x the CPUs, 2x the search queue size and 2x the search performance. With replication, having a second node can result in data redundancy and fault tolerance as well.
If you small amounts of data, changing to a single shard per index instead of the default 5, and possibly even going to weekly or monthly indices would also help, and most likely allow you to handle larger data volumes.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.