Best way to introduce multiple new NLP models to existing indexes?

I am following this documentation: Add NLP inference to ingest pipelines | Machine Learning in the Elastic Stack [8.14] | Elastic and I learned that I can set up an ingest pipeline and reindex existing indexes into a new one with it.

If we have 100s of TBs of data in cluster and want to apply the model to all, is reindexing with the pipeline (and deleting the old indexes) the best approach?

And if so, and if we need to iteratively experiment with different models side-by-side, each time a new model is introduced, it would be reindexing the whole cluster with an updated ingest pipeline?

Hi!

You would indeed need to reindex when adding new types of embeddings.

I'd recommend using a subset of your data in a separate index for experimenting and comparing results rather than reindexing TBs of data with multiple models.

You can also add multiple models (processors) with different NLP techniques in the same pipeline so you can generate multiple embeddings for the same data in a single reindex.
Once you settle on a model, you can then reindex your entire original data, and set up the ingest pipeline to automatically add the processor on whatever new data comes in from that point onwards.

1 Like