We have a 7 node ES cluster in EC2, and query performance is far worse than I would expect given the load. I have tried to rewrite the query syntax from filters to queries (as our retrieval is all against non_analyzed, exact term matches) - but that has had no real improvement. We frequently see spikes of 100% CPU. Recently one node when to 1000 thread pool search queue (while at 100% cpu) and stayed there for about 15 minutes. Searches in the slow log were taking 8 to 10 seconds to complete. We seem to go through these periods when a tiny bit additional load, just seems to make the cluster on the brink of tipping over.
Here is a snippet of the search syntax => https://gist.github.com/jaydanielian/e374a401560f3e3b1812#file-gistfile1-txt
Here is a snapshot from hot_threads =>
- We are using EBS for the disk storage, and I do see a fair amount of disk reads, but wondering why that would also cause CPU to spike and hold at 100%
- I see several cache evictions per second (4 or 5), even though I dont use filters anymore since our searches are all specific (not going to be reused), so not sure why its still cache churning things
- We have custom routing enabled, so we only hit one shard per query
- Our index is 1 segment and optimized as it is read only
- We have six shards with 820 million docs total in the index (118 GB)
- we use the multisearch API to bulk our requests together usually in chunks of 50 or 100
- during these CPU spike times it is not uncommon for 2 to 3 nodes to be 100% CPU and the others to be virtually idle on CPU.
Any gurus out there who can offer some guidance? We are at the end of our rope dealing with this here, and may have to consider moving off ES if we can't get our cluster to handle the current load more efficiently.