If we set up an Elasticsearch cluster with 2 nodes, and an index configured for 4 shards with 1 replica, and then we begin indexing data, I understand our data will be placed into shards based on the hash of the routing ID I give it (in our case, we use the account ID, so that each account's data is kept in a single shard), and I understand how replica shards function.
But, if later down the road, we decide to add 2 more nodes (rather than increasing the CPU/memory size of the existing 2 nodes) to double our resources by scaling out instead of up. IOW, we'd have something like 2 nodes, each with 4 CPU cores and 16 GB of RAM, and then rather than increasing that to 8 cores and 32 GB of RAM, we'd simply add 2 more nodes, each with 4 CPU cores and 16 GB of RAM.
If we do this sort of upgrade, will Elasticsearch automatically move the indexed data around to different shards based on the routing IDs originally given with each object? Is there something I would need to do manually during this process? Is this the completely wrong way of thinking about all of this?
If you don't use routing at all, it's just based on the object ID, right?
We were using routing because each account will only ever search for objects in their own account, so my understanding (based on a blog post, albeit) was that it would be more efficient to have such a query be against a single shard. So, for instance, if account 123 has 5000 objects and the routing is based on the 123 ID, all 5000 of those objects will reside in a single shard, so when I send a query request in, I can also route that based on 123 and be automatically only looking in the node that has all of that account's 5000 objects. Is this incorrect / a bad idea?
Oh... I did not expect that this was the case (re: "No."). So, increasing the number of nodes (aka scaling out vs up) is not a great idea, unless I'm prepared to re-index everything or manually split the index. Is that a good understanding?
This can make sense if you have a search heavy use case with very high query concurrency or a very large data set, but is generally not needed. If you have accounts of very varying sizes you can easily end up with imbalanced shard sizes. I would avoid using it unless you experience performance problems.
It depends on how the split index API handles this, which I do not know. I suspect it should be fine though as I believe it keeps track of this behind the scenes, but someone with more knowledge would need to confirm this. If you do not use this API you need to reindex to change number of primary shards.
Thanks for this advice. We do have a very search-heavy use case with relatively high query concurrency (if I had to guess, I'd say 80/20 read/write - possibly even 90/10 once we spend the time to eliminate some of the extraneous "ok now let's reindex this object" events that we currently have baked into the system).
We also have a large number of objects (6+ million) spread over thousands of accounts, but the total data size is small (since we only index key fields). Only ~15 GB source data for now, but likely double in about 1.5 years.
I also asked a separate question where I described the workload in a little more detail. I didn't want to pile that question onto this one, but your response earlier inspired me to rethink the initial setup strategy.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.