We are ingesting the data into Elasticsearch, using 3 routing values (namely 3g, 4g, 5g). We assigned 3 primary shards and 1 replica. Index Rollover is configured to 1gb (max per primary shard) for testing.
When we send only 3g data, only 1 shard is getting filled completely and rest of the 2 are empty. The rollover is completed after 1gb is reached and the remaining 2 shards are empty.
We also tried configuring index.partition_size, but the data isnt getting distributed as expected.
Why are you using routing in the first place? If you do not use it data will be evenly distributed. If you use routing all data related to a specific routing value will go to a single shard. Several different routing valiues can also end up going to the same shard. If you have few routing values you are therefore likely to cause imbalances.
Currently we are receiving large data. Any query executed is searching all the shards available in our data stream. For that reason, we are using routing approach to decrease the query response time (by decreasing the number of shards queried).
We might have 3-4 routing values in our current implementation. Any suggestion on, how to configure routing such that the shards are utilized to the fullest for each routing value.
But we have dependencies on configuring the input to Elastic if we are separating indices based on user1/user2/user3.... And also many data streams will be created and index templates must also be configured for them separately.
For that reason we want to use routing and limit the number of shards that will be queried.
Routing is primarily useful when you have a large number of routing values. I would not recommend it for your use case. Separate data streams would be better as each stream could roll over separately based on data volume.
If we calculate and split data using user1,user2 and user3, the number of data streams reach 1482 for 2 years. Is this approach healthy for Elastic cluster?
Why would you have 1482 data streams for 3 users?? Would you not have one data stream per user with a single primary shard and set the rollover size to somewhere between 25GB and 50GB?
I do not understand. Data streams are generally used for time-series data, so I do not understand why you would create one per month. If you have 3 users and 20 different types of data that need separate streams (is that really the case?) you would end up with 60 streams. Each stream would in turn be backed by a number of indices covering a specific time period. You could e.g. set each stream to generate a new index once it covers a full month or is larger than e.g. 50GB in size.
It would probably help if you explained your use case in more detail.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.