Vectorizing documents - very slow

I am trying to create an NLP pipeline using Charangan/MedBERT · Hugging Face. Ingesting documents with this model and an ES ML pipeline is running very slowly: With a dockerized setup on my local machine with 10GB of RAM (8GB dedicated just to Elasticsearch via MEM_LIMIT), 1.5GB swap, and 4CPUs, I'm getting roughly 1 document every 22s. We have ~1M documents, which means it will take months to index due to the vectorization. Indexing all the docs without vectorizing is reasonably fast (< 30 minutes). Anyone have any suggestions for speeding up the vectorization process? Anything I might be missing here? I got this info log which made me think that this model isn't optimized for task-type text_embedding, which is what I'm trying to use.

Some weights of the model checkpoint at Charangan/MedBERT were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

@Cole_Crawford what is your Elasticsearch version?

By default, I think model deployments only use a single thread. You must increase the number of allocations for better throughput: Start trained model deployment API | Elasticsearch Guide [8.6] | Elastic

Additionally, that model is not a text_embedding model. It is a base model that has the fill mask task. You should either optimize this model yourself text embedding or utilize a different model.

Also @Cole_Crawford PyTorch model inference is done off the JVM heap, so you should probably decrease the JVM size that Elasticsearch will use.

Thanks @BenTrent . I decreased the JVM heap relative to the total available RAM. It looks like it's recommended to be 35-40% for ML nodes: Sizing for Machine Learning with Elasticsearch | Elastic Blog

I switched to pritamdeka/S-PubMedBert-MS-MARCO · Hugging Face which looks to be a better fit for this application. I increased the number of allocations and threads per allocation; the way I'm understanding this is that the number of allocations * number of threads per allocation can't exceed the total number of CPUs allocated to the container (or available on the host)? So if I have 6CPUs dedicated to Docker, and am using a couple on a web app and Kibana, then I should only allocate 4 CPUs to the ML node? As 2 allocations with 2 threads per allocation?

2 allocations with 2 threads per allocation?

Would be a good place to start. Or 4 allocations. At indexing time usually you are concerned about throughput. So more allocations is better.