I am trying to create an NLP pipeline using Charangan/MedBERT · Hugging Face. Ingesting documents with this model and an ES ML pipeline is running very slowly: With a dockerized setup on my local machine with 10GB of RAM (8GB dedicated just to Elasticsearch via MEM_LIMIT), 1.5GB swap, and 4CPUs, I'm getting roughly 1 document every 22s. We have ~1M documents, which means it will take months to index due to the vectorization. Indexing all the docs without vectorizing is reasonably fast (< 30 minutes). Anyone have any suggestions for speeding up the vectorization process? Anything I might be missing here? I got this info log which made me think that this model isn't optimized for task-type text_embedding, which is what I'm trying to use.
Some weights of the model checkpoint at Charangan/MedBERT were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
I switched to pritamdeka/S-PubMedBert-MS-MARCO · Hugging Face which looks to be a better fit for this application. I increased the number of allocations and threads per allocation; the way I'm understanding this is that the number of allocations * number of threads per allocation can't exceed the total number of CPUs allocated to the container (or available on the host)? So if I have 6CPUs dedicated to Docker, and am using a couple on a web app and Kibana, then I should only allocate 4 CPUs to the ML node? As 2 allocations with 2 threads per allocation?