In my data model i store data based on language, so i am thinking to distribute documents based on one language per shard. Let say if i have product with id 10, and have fields for product like Label, cost or color
So i store data, language wise like 10_1, 10_2,10_3, .....etc here 1,2,3 are language ids.
10_1=> Label : english_label, color : color of the product in english lang.,
10_2=> Label : japanese_label, color : color of the product in japanese_lang
So my concern here currently elastic use routing formula is
shard no. = hash(routing) % no. of primary shards
then in my case routing = _id =10_1 like that.
So please help me to find out unique formula to send all same language products to single shard. Because no matter what you do, hash() function internally changes the final value.
That's not a great idea, because language use is lumpy. A lot of people speak english (either natively or as an alternate language), so your shard for that will be huge.
Not many people speak australian, so that shard would be small.
And then how do you manage shards when you want to add/remove languages? Do you just start a single index with 500 shards and hope you use them all? (Note, no, you should never do that, it's a huge waste).
So the question is, what value do you see having everything from the same language in the same shard?
Yes i can understand your point regarding wastage size or having poor architecture, but here are the answers to points and questions
i ensure that all language will have the data for every product.
on addition of language of i'll add that language's data in to any one existing shard, obviously i'll write logic that if no. of languages are more than no. of shards by particular percent then i would reindex data by increasing shards (again shard per language) . no Elasticsearch' actual reindex but yes similar kind of that.
well here i see all data of that product for that language, i come with approach that initially elastic node broadcast to all shards and then gather all data and then perform other operations like sorting, relevance score and all that, so instead of that i'll tell elastic to go on particular shard so that time and operations at elastic will be saved.
Now, i hope you get the idea.
So do you know how to improve routing mechanism, please let me know
Routing ensures that all data related to a specific routing key ends up in the same shard, but it does not make a specific shard hold data for just a single routing key. If this is what you are looking for you MAY be better off having multiple indices with a single primary shard each.
No i have not tested it because currently i am not able to distribute data as per language and one shard consist more than one language data. It will be really great if you tell me to do like this, either some value for "routing" or can we design our own formula.
Yes I have considered your comment and thinking on it and will propose this to my team members and discuss, but before that thinking to change the formula some way.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.