In our ES 6.8, we're planning to rethink our current indices partition as we acknowledged, month after month, that the starting decisions about indices was not properly planned.
We now got 3 TB of data (6 TB with replicas) in 145 indices, but there’s one of them that stores 99.9% of all data alone. Since we’re reworking them, we could use some advice.
We got a cluster of 3 identical nodes (also, a poor choice, but we will ask for other suggestions maybe later) with 16 CPUs, 64 GB RAM and disks with 2TB each. We’re planning to create a certain number of indices to spread the load of the single one, but we’re not sure about how many indices can we use in order to prevent a drop in performance.
Can you suggest the proper order of magnitude (10, 100, 1000, more?) regarding that number?
Here is more information:
Our indices have more or less 100 fields (mappings) per document.
Our single “super-index” is using 40 shards (this should be properly set according to the guidelines) while the other 144 indices have the default (5 for this version of ES).
A single 3TB index split into 40 shards doesn't seem totally unreasonable. If you're not adding nodes are you sure that more/smaller indices will have benefits? What specific problem are you actually trying to solve?
Here's an article about shard sizing that might help:
Note that it recommends shards in the tens-of-GBs range, so you'd typically need less than 100 shards for 3TB of data however you end up dividing them across indices. If there's lots of different kinds of data in there then you might see some benefits in a few more indices, but tiny shards are inefficient.
This is exactly the information I was asking for.
Apparently 3TB are totally a reasonable size for 40 shards, but we're talking about 4.629.375.631 documents, that's the reason why we were thinking about splitting.
Furthermore, your suggestion gave us clues about the proper sizing of a document. We're now thinking about reshaping the information in a less number of bigger documents.
This should help us improving performance.
We experienced several errors like "Too many requests" or "Socket timeout" when we perform high load operations (massive bulks with massive parallelism). If the partitioning is not so bad, then it's probably related to our bad cluster architecture (as I said, a cluster with 3 identical nodes cannot be the best solution).
Can you provide us some guidelines about cluster configuration? Maybe drilling down on differences among Master node, Data node and Client node could be useful to reshape the general nodes architecture.
The max number of documents permitted in a single shard is 2.147.483.519, you only have just over twice that, so your document count doesn't indicate that you need more than 40 shards either. If it was just about the document count you'd be better off thinking about fewer shards not more
The biggest impact thing you can do will be upgrading to 7.x: you're missing out on 2½ years of new development since 6.8 was released, much of which materially improves Elasticsearch's performance or behaviour under load.
There's nothing inherently wrong with running 3 identical nodes - adding nodes makes things more complicated for sure but your dataset is pretty small so you may not see many tangible benefits of a larger cluster.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.