We are on ES 5.2.2, just recently upgraded from ES 2.3, and are looking to reduce our shard count. We have been told numerous times by people involved with ES that our shard count seems very high. However, when we try to decrease it we run into horrible search performance and always end up back at our original high number.
For example we have one index(our indexes are split up by users) that is 385GB and we have 40 shards allocated for it. We tried to cut that number in half to 20 but found that search times doubled and were at a level that was too slow for users.
Here are our mappings. Our parent to child ratio is around 225/1. To do the search testing we ran 10 searches in a row with aggregations and benchmarked the overall time it took to return the results.
Any thoughts as to why we are unable to lower our shard counts without taking a big hit to search performance?
this post needs more context for anyone to weigh in, like the kind of queries you are executing (using special features like routing), and all those nitty gritty details like the same test environment (for the filesystem cache to kick in), if you reindexed fully, before you compared old vs. new. A little explanation how your server setup looks like.
Also, ten searches are not really a lot, for example filter caches dont kick in at that point in time (the caching behaviour might be different). Also those ten searches span across different replicas, so maybe some parts are not cached (I still assume you did proper warm ups before running those tests).
Also, does 5.2.2 have the exact same behaviour than 2.3, if you just upgrade?
All of these EC2 instances have SSD-backed instance storage. All of our ES configs use 50% of the RAM on each of the above instances for ES. The data nodes do not go over the recommended 31.5Gb RAM limit.
Here are the searches we used for benchmarking. We do not use any routing besides the parent/child relationship. We have not been worried about the cacheing bc we tested the same searches on both indexes which had the same settings and still got drastically different results. If the cache did not kick in on one then I would assume it would not kick in on the other and the results would be comparable. If there is a better way to go about testing this we would love to know.
Both indexes before testing had the exact same data in them. We took the original and used the reindex API to reindex it to a new index with 20 shards then search tested it.
ES 2.3 gave us the same results as ES 5.2.2. This is a problem we have been trying solve for a while but due to other priorities it has been slipping to the way side. Our scale is really starting to ramp up now so it is something we would like to fix or at least understand sooner rather than later.
That is almost 10GB per shard, which in my opinion for many use cases is a good size. The ideal shard size will however depend on your data and queries, so unless you have very small shards, e.g. less than 1GB, I would always recommend benchmarking before making any changes.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.