Assuming I have a 30 shard index with over 200 million documents in it and I wanted to split these out into a time based index, how would I do this without affecting response times? The other issue is storage space, but I could easily scale up the instances before reindexing.
200 million is usually fine.... Splitting it into smaller indexes will help
if you can write your queries so they only target the indexes that contain
the docs. In 5.0 we rewrite the queries on the target shards so that if an
index doesn't have any docs in the time range then it becomes a match_none
so it is cheap.
Anyway, yeah, your best bet is to reindex using the time ranges in the
filter. I'd add more space to the cluster rather than try and juggle thing,
delete-by-query isn't a good way to free space so you can't easily juggle
the free space.
Thanks very much for this. The other issue is that the index is a live index with full read/write access across the index. How would I ensure there's no data loss? And wouldn't there be any latency increase across the cluster if I was reindexing the documents?
Oh no! I mistyped. Misphoned. Something. ES doesn't have a thing to have
the write forked to two indexes. Thatd be a thing you'd have to do in your
application. Sorry!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.