How would you 'time shard' documents by year?

spuder · August 1, 2016, 3:54pm

I work for a software company that indexes & archives millions of word, pdf, excel documents.

How would you time shard data in a way that would scale to 10 billion documents over 10 years?

Index per year => Index sizes would be too large
Index per month => Still very large indexes, + number of indexes would be excessive.
Index per data type => Indexes would be too large
Index per week => Number of indexes would be excessive.

How else could you scale elasticsearch so that it could handle searches across millions of documents?

Our current best idea is to create a separate elasticsearch cluster per year. If a customer is looking for a word document from 2016, our application would query the 2016 elasticsearch cluster.

This would keep our indexes small & the number of indexes per cluster manageable. There are obvious downsizes to this approach (multiple clusters is hard to automate & manage)

Is there a better way?

nik9000 · August 1, 2016, 4:09pm

10 years is long enough that you need to think about upgrade cycles. A cluster per year is fairly good way of approaching it because you can upgrade each one as you need to.

If you did that you could have an index per day without any trouble if you didn't subdivide. If you have different types of indexes then you should think about doing an index per week per type - but you need to be careful that types don't grow out of control. Sparsity vs explosion in the number of indexes is a thing to be concerned about but I think your plan is good.

Topic		Replies	Views
Sharding question Elasticsearch	12	1279	July 3, 2017
Recommended # of Shards for 1.5 TB = 2.752.250.197 Docs Elasticsearch	6	1349	March 29, 2018
Sharding a big index by name Elasticsearch	3	370	July 6, 2017
Tips on Optimization Elasticsearch	10	1403	November 6, 2017
Index design for web activity Elasticsearch	3	434	July 6, 2017

How would you 'time shard' documents by year?

Related topics