Let’s say the cluster is empty and we have the choice between:
1 index with 5 50 GB shards
5 indices, each with 1 50 GB shard
Is there any difference between these 2 options purely from a performance standpoint? Assuming that we will index into and search across all shards concurrently in both cases.
I think I'll benchmark this anyway but I'd just like to know from a theoretical perspective, what would be the cause of any potential performance differences here?
if you have five machine five shard will be on seperate systems
same way your index will be in five different systems.
but when you read a data all first loaded data will be in one machine and all last loaded data in last machine (last index)
this means you have hotspot.
but if you have five shard it is random and hence you hitting different systems. ( this is logic if data is being writting on different shard at a time) just like raid5 on storage.
Thanks for the responses. Yeah, my assumption was as you say, David, that there shouldn't be any performance impact but I'll test it out.
So I largely just wanted to understand the impact of moving from daily indices to ILM and how many shards I should be assigning to indices when we do so.
We currently have some daily indices which we configure with 20-50 pri, 1 rep, which results in approximately 50gb shards once the day is done. From previous experience, we’re pretty sure that by no longer indexing into all of these shards at a time, we should decrease the stress that indexing puts on the cluster. And if we go with 1 shard for the write index and roll over at 50gb, then we’ll be in a similar place to my original question:
Before: 1 index, 50 shards
After: 50 indices, 1 shard each
I realise that I originally asked about indexing into all shards/indices concurrently, which won’t actually be the case, but I was just curious about that scenario too. We will, however, be searching concurrently.
We’ll also need to check whether indexing into 1 shard gives us sufficient throughput. Although, even if it does, I’m guessing we could end up in a situation where some of the busier indices are all located on the same node, which could be troublesome.
Switching to rollover with a smaller number of primary shards should be positive for indexing speed as bulk requests would be less spread out. As the older indices will no longer be written to, they can be optimised (e.g. forcemerge) earlier, which may also help with resource usage and search speed.
If you have index patterns with varying size you may want to set larger ones to have multiple primary shards in order to better spread out the write load. If you have indices with multiple primary shards you can force these to be spread out across different nodes using the index.routing.allocation.total_shards_per_node setting. Be careful with this as it may prevent shards from being allocated on node failures if you are too aggressive.
Thanks @Christian_Dahlqvist . Yeah, we use that setting currently with our daily indices. So we have about 5-10 index patterns with tens of shards, then about 100 or so with 1 or 2 shards. I'm wondering if it's worth doing some allocation include/exclude settings on the larger index patterns to prevent a bunch of the write indices all being allocated on the same node.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.