How would you 'time shard' documents by year?

I work for a software company that indexes & archives millions of word, pdf, excel documents.

How would you time shard data in a way that would scale to 10 billion documents over 10 years?

  • Index per year => Index sizes would be too large
  • Index per month => Still very large indexes, + number of indexes would be excessive.
  • Index per data type => Indexes would be too large
  • Index per week => Number of indexes would be excessive.

How else could you scale elasticsearch so that it could handle searches across millions of documents?

Our current best idea is to create a separate elasticsearch cluster per year. If a customer is looking for a word document from 2016, our application would query the 2016 elasticsearch cluster.

This would keep our indexes small & the number of indexes per cluster manageable. There are obvious downsizes to this approach (multiple clusters is hard to automate & manage)

Is there a better way?

10 years is long enough that you need to think about upgrade cycles. A cluster per year is fairly good way of approaching it because you can upgrade each one as you need to.

If you did that you could have an index per day without any trouble if you didn't subdivide. If you have different types of indexes then you should think about doing an index per week per type - but you need to be careful that types don't grow out of control. Sparsity vs explosion in the number of indexes is a thing to be concerned about but I think your plan is good.

1 Like