We rely heavily on the parent-child structure and want to migrate from a 1.7 cluster to a 2.3 cluster.
I have researched this a bit - I can see that I need to make sure we reindex our data, in order to take advantage of the new way parent-child is implemented in 2.x. I was planning to do this with a fairly straightforward scan and scroll, but I have recently seen mention of a reindex api.
Could someone tell me what is now the best way to accomplish this goal (sorry to ask a question that I should probably be able to answer from my own research, but things seem to be moving so fast, it can be hard to be sure of having found the most up-to-date information)
If you're upgrading to a recent enough version to have Reindex, that'd probably be the easiest way to go about it. Under the covers, the Reindex API is essentially doing a scan + bulk just like you would. It just manages the process for you, and ties into the new Task Management API so you can kill the process if you want.
It also has some niceties, like being able to execute some basic scripts to update docs on the fly, etc.
But if you already have scan + bulk scripts written and are more comfortable using those, it will give you an identical end result. The Reindex API doesn't do anything special really.
Awesome - thanks for the quick, comprehensive answer. I don't have the scripts written - am just getting started on the process. OK, I'll look at the Reindex API. Thanks again
Np! Make sure you take a backup/snapshot before starting. I'm not aware of any outstanding bugs in the reindex API, but it's relatively new, so when dealing with big operations like this I try to be cautious. Better to have a backup and not need it, than run into a bug and not have a much-needed backup
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.