Reindexing a large collection into time based indices


(Yoitsro) #1

Assuming I have a 30 shard index with over 200 million documents in it and I wanted to split these out into a time based index, how would I do this without affecting response times? The other issue is storage space, but I could easily scale up the instances before reindexing.

Cheers,
Ro.


(Nik Everett) #2

200 million is usually fine.... Splitting it into smaller indexes will help
if you can write your queries so they only target the indexes that contain
the docs. In 5.0 we rewrite the queries on the target shards so that if an
index doesn't have any docs in the time range then it becomes a match_none
so it is cheap.

Anyway, yeah, your best bet is to reindex using the time ranges in the
filter. I'd add more space to the cluster rather than try and juggle thing,
delete-by-query isn't a good way to free space so you can't easily juggle
the free space.


(Yoitsro) #3

Hey Nick,

Thanks very much for this. The other issue is that the index is a live index with full read/write access across the index. How would I ensure there's no data loss? And wouldn't there be any latency increase across the cluster if I was reindexing the documents?


(Nik Everett) #4

We don't really have a thing for live indexes. Sadly, that is a thing
you'll have to work out.

Do you have any restrictions on your access patterns? Sometimes that helps.

You could have the index write to both, but that can be difficult
depending.


(Yoitsro) #5

Ahh! That would be perfect actually! I think that's possible using the system we have.


(Nik Everett) #6

Oh no! I mistyped. Misphoned. Something. ES doesn't have a thing to have
the write forked to two indexes. Thatd be a thing you'd have to do in your
application. Sorry!


(Yoitsro) #7

No, that's all good actually. We can do this application side without much fuss. Thank you!


(system) #8