I'm tuning my cluster and I believe I need to increase threadpool.search.queue_size based on https://github.com/elastic/kibana/issues/3221, heavy query use, and our large number of shards. However, before changing this I want to make sure I understand which nodes it should be applied to. Which nodes manage this threadpool? Can I only apply this to the client nodes, or should this be applied to the data nodes?
Increasing the threadpool is a short term fix and likely to cause other issues. If you have high shard count you should reduce that for starters.
However all nodes have threadpools.
Well, I have a large number of shards because I have a large number of indexes because I have a large amount of data (close to a billion documents). I create a new index hourly, and before I did that I had other problems due to large index size degrading the cluster. So I don't think reducing the shard count is the right solution. I only have 5+1 shards per index.
Based on my query load and shard count I don't think it's a short term fix, I have a large number of developers running a large number of complex visualizations against a large number of shards. I'd like to up the queue_size to 10,000, but only on the client nodes where the incoming queries are directed. I do not want to increase the memory footprint of the data nodes as they're already heavily loaded with indexing. Can I increase the thread pool on the client nodes, or are the client nodes simply forwarding on to the data nodes requiring a change to the data nodes as well?
How many shards, how many indices, how much (GB) data?
2,362 shards (5 primaries and 5 replicas per data index, plus 2 for .kibana)
1.32 TB of data
So about half a gig per shard, that's really pretty wasteful to be honest.
If you reduced the count then you'd reduce the requirement to increase this threadpool by an order of magnitude.
So what would you recommend as the "sweet spot" for shard size, then? In researching it I've found a fair amount of hearsay and conflicting information. We used to run about 6-24GB/shard but that had issues with the cluster being yellow for hours if it had to move the shards for some reason. Currently my larger indexes are about 1GB/shard and, with the exception of this issue which just cropped up today, the cluster is much more stable.
We recommend keeping them under 50GB max size. Obviously there are multiple factors though, as you have found.
Was it causing issues? Yellow shouldn't be a problem.
The ideal shard size will depend on the use case as well as the environment. Which version of Elasticsearch are you using? How many nodes do you have in the cluster? What type of hardware are you using? What is the specification of the nodes?
Yes, it was. While it was recovering our write rate would be reduced causing message loss as the write rate was lower than the incoming data rate. It sounds like the recommendation is the shard size we used to run and for our setup I don't wish to go back to the larger shard size, I've had fewer issues with a larger number of smaller shards. I believe this issue started today due to increasing visualization use, otherwise my ES cluster has been quite healthy and responsive with the smaller shards.
What other issues would increasing the queue size cause? Based on
https://github.com/elastic/kibana/issues/3221#issuecomment-170871441 my current impression is the risk is in the memory footprint, which is why I was asking about which nodes will be impacted. My tribe nodes which Kibana hits are using less than 50% of their memory, so I'm comfortable increasing the footprint there. On my data nodes I'm much more concerned about memory issues.
For our cluster we're running ES 2.4 with 10 data nodes, 3 masters, 3 clients, and 3 tribes. Each node has 8 cpus and 32GB of ram.
Having hourly indices for this volume of data seem excessive, even though the total shard count for a cluster of this size is not. If you instead e.g. used daily indices with 10 primary and 10 replica shards, you would still be able to spread out indexing across all nodes in the cluster and generate 20 shards per day instead of 240. If I calculate correctly this should give you an average shard size around 5 or 6 GB, which for most use cases I come across is fine with respect to search performance as well as recovery speed.
As I said I used to run daily indexes and it was not satisfactory in our environment. Going back to that config will reopen a whole host of other problems with backups and recovery from disruptions. That is not a good fix for our environment, for us it's a short term fix that will cause other problems.
As such I'd really like to go back to my original question as I would like to understand the implications of tuning the queue size. I've read it can cause problems if tuned improperly but have not been able to find details about what those problems would be. I'd also like to understand which nodes hold this queue, as the answer I've gotten so far is "all nodes" which doesn't completely make sense to me with the topology of queries going to the client nodes.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.