I am new to ES and although I have read several blog posts and articles, I still cannot find the answers I am looking for. I need to ensure that shards in my ES database will have the optimal size (performance-wise).
I understand that the optimal size of a shard depends on multiple factors and that it (I suppose) can only be determined by tests with using real data.
What I don't understand is how data is distributed into shards within an index. For example, if there are five (primary) shards in an index and I set the maximal index size to 100 GB, will each shard be 20 GB when the index reaches its limit? So, if I find out the optimal shard size I can just multiply it by the number of shards in an index and use this value as the maximal index size? Or is it not this simple?
And more generally, why is it that there is no tool for simply specifying the maximal shard size instead of the maximal index size (such se the Rollover API) ?
I will most certainly go through the recourses you suggested. However, I am a bit perplexed by you saying
because in the very article you recommend me (and that I have already read) "How many shards should I have in my Elasticsearch cluster?", they say that "a shard size of 50GB is often quoted as a limit that has been seen to work". So doesn't it make sense to limit the size of indices then (because they consist of shards)? Or how else do I control the size of shards? I feel like I am missing something here...
Yes. That's a theorical limit. If you send more data, elasticsearch will still accept it and index it.
So there is no "limit" as an index settings or something like this.
I'd say that the only limit I know for now is the available disk space.
So doesn't it make sense to limit the size of indices
yes. But elasticsearch does not have that built-in setting as I said.
Ok so just to clarify I am on the same page: the way to control shard size (keep data under a given limit per shard) is to control index size & number of shards per index. do you agree on that?
thank you very much your answers are very helpful to me!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.