Need a hep on how to choose a particular shard to index a document to. I know about the routing concept that we can let similar type of documents to reside in same shard. But this is what happening with our production environment of elasticsearch:
A 3 data cluster node which has 32 primary shards. The data is basically time series data and we try to group together data to a month in a single shard using the routing principle. This works absolutely fine but what we are observing is that certain shards contains most of months whereas certain shards just contain a single month or so.
The result is if we want to work on certain months of data, then we cannot utilize the power of sharding since the query goes to a single node only.
Is there a way where in along with routing value I can also specify this is the shard I want my data to reside in or at least some functionality where ES distributes data according to our need.
Hi, why don't you use time-based indexes (monthly)? Using it you can always be sure that data in it equally distributed between data nodes and contains only month you needed. As for querying it you can use wildcards in index name or aliases.
Thanks @warkolm and @rusty for the answers. I agree that the data model we choose was not at all good and later on going through different blogs etc we came to know about time based indices. Agreed that the time based indices are the perfect use case here but my question is more general.
"Is there a way while using specific routing, I can actually divide the data equally among all the shards and does not cause all the different route values to reside on a single shard?"
Let me think of an example. There are million of documents which are characterized by two columns i and j. For value of i ranging from 1 to 10, j value would range for 1 to 1 million. Now what I want is I want to have my data residing in 20 shards such that for i, I would divide my j into two halves and each half is persisted on a different shard. This way, I know exactly where my data is going. Reason, my queries are such that this is the best model I could think of. One example query is every 1 hour, I have to get 10,000 rows from each shard. This way atleast I know my data is divided equally and hence my query will spread to each shard properly. Had it been on one big bulky shard and other shards sitting idle does not necessarily boost the query performance.
Any hints?
UPDATED QUESTION - @warkolm - Mark, going back to our first use case what if I have a monthly index and I want to have 5 shards in it. With routing I want to put a logic that every 6 days of data must reside in a single shard. Do I have the guarantee that all the 5 different route values will force data to reside on different shards? If yes, hows that possible?
The reasons for an ask of data getting distributed equally on all shards while using routing is:
1.) If I am using Spark, for each shard, Spark creates it own task. Imagine one task doing a lot of work and other kind of sitting idle because of very less data over it. (Think of if we are operating on a whole data set in an index).
2.) During shard reallocation, my heavy shards takes a comparative high duration of time as compared to shards with less data volume.
3.) If because of my above stated problem, only one shard is doing most of the work where other shards could have been used if the data distribution was proper then I could have gained more(I am thinking in terms of distributed computing which noSQL provides).
So the question again boils down to if I do routing, how is it possible to not to clutter the data with different routing values in a single shard only and divide the data equally.
If I talk about the use case of i and j, what should be a good data modelling recommendations from ES team.
Help me @warkolm. Time series I got make month based indexes and do something..but I guess what I am asking for is a valid use case. Should it reach the github??
Apart from it, this i and j use case is troubling me a lot since I was actually counting on proper distribution of data in case of using routing. Need some recommendations.. Help!! Help!!
Sorry, but this is not a recommended application of the product or best practise, so I won't help you, because eventually it'll fall over and you will come back asking for help to fix it. So it's better to do it right the first time.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.