I work for a software company that indexes & archives millions of word, pdf, excel documents.
How would you time shard data in a way that would scale to 10 billion documents over 10 years?
- Index per year => Index sizes would be too large
- Index per month => Still very large indexes, + number of indexes would be excessive.
- Index per data type => Indexes would be too large
- Index per week => Number of indexes would be excessive.
How else could you scale elasticsearch so that it could handle searches across millions of documents?
Our current best idea is to create a separate elasticsearch cluster per year. If a customer is looking for a word document from 2016, our application would query the 2016 elasticsearch cluster.
This would keep our indexes small & the number of indexes per cluster manageable. There are obvious downsizes to this approach (multiple clusters is hard to automate & manage)
Is there a better way?