If we have 100s of TBs of data in cluster and want to apply the model to all, is reindexing with the pipeline (and deleting the old indexes) the best approach?
And if so, and if we need to iteratively experiment with different models side-by-side, each time a new model is introduced, it would be reindexing the whole cluster with an updated ingest pipeline?
You would indeed need to reindex when adding new types of embeddings.
I'd recommend using a subset of your data in a separate index for experimenting and comparing results rather than reindexing TBs of data with multiple models.
You can also add multiple models (processors) with different NLP techniques in the same pipeline so you can generate multiple embeddings for the same data in a single reindex.
Once you settle on a model, you can then reindex your entire original data, and set up the ingest pipeline to automatically add the processor on whatever new data comes in from that point onwards.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.