I have an ES 6.8.6 cluster containing about 1200 day-wise indices. Today indexing happens only on today's day-wise index. There are no updates. This cluster contains about 45 TB data including preserving historical data up to last 7 years.
Each day-wise index is configured with 1 shard and 1 replica. The size of primary averages around 16 GB and with 1 replica, the total size of a day-wise index averages 32 GB
Currently, to search past 18 months data, it hits 18 * 30 = 540 shards and even with powerful data nodes, it takes about 22 seconds to return response. There are very few fields in mapping.
Is the response time better if I search less number of bigger-sized shards versus large number of smaller-sized shards?
I was thinking to combine the past 16-GB-1-shard-day-wise indices into 480-GB-10-shards-monthly indices? E.g today is 6-Nov-2020 so 18 months before would be 6-May-2019. So, re-index the 30 day-wise indices for the month of May-2019 into a monthly index. Each monthly index would have 10 shards. Same for June-2019 all the way upto Oct-2020. For the month of Nov-2020, I can keep it day-wise and once Nov-2020 is over, re-index them into monthly index.
With approach, the last 18 months data search would hit at the max - say on 30th Nov - `(17 months * 10 shards) + (1 current month * 30 shards) = 200 shards. For 6th Nov, it would hit 176 shards.
If this approach works, then I can look to replace day-wise indices with monthly indices altogether.
I would not be surprised if it was faster but it may depend on the data and queries. I would recommend reindexing one month and compare the latency when querying the monthly index compared to the daily ones. If it is better or close I suspect switching to monthly indices could clearly be worthwhile.
Thanks @Christian_Dahlqvist. That makes sense. And I suppose having a monthly index is much better than force-merging all day-wise indices? All past day-wise indices are read-only by default since no updates happen.
Understand. I was just wondering if instead of re-indexing daily to monthly, I can forcemerge daily indices and improve response. I agree that forcemerging can also be performed on read-only monthly indices as well.
@Christian_Dahlqvist - I reindexed 6-months daily indices into 6 monthly indices and was able to reduce the response time to less than 5 seconds even with 0 replicas. Thanks a bunch
One more question: Planning to forcemerge the monthly indices now since they are read-only. The average size of a monthly index is 550 GB and I chose 12 shards so that each shard is within the limit of 50 GB.
I am thinking to set max_num_segements = 5 for forcemerge. Is that okay or I should set it to 1?
If I set to 1, then each segment is likely to be a whopping 50GB in size. Even with 2 segments, per segment it will be 25 GB which too seems higher. So thought to go with 5 which will mean each segment will be approx 10 GB.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.