We are planning to build a vector-based search application with pretrained machine learning model which is based on mBERT model.
So now I wrote some code to check how Elasticsearch works and I found that pytorch_inference process randomly disappear during vector computation step (reindex).
There is no stacktrace or Elasticsearch WARN/ERROR logs about it.
Our program keeps polling to check reindex task finished or not, the progress shows "created" number never increases after pytorch_inference process had gone.
The timing of pytorch_inference disappear seems very random, sometimes after computed 1000 documents, sometimes after computed 48000...
I run Elasticsearch on docker container (docker-compose) and I gave 12 cores to make Elasticsearch compute ML inference faster.
The pytorch_inference process is where the model is evaluated so if that has disappeared the reindex task cannot progress. I am very surprised there is nothing in the logs, try searching the logs for the model Id of your model.
Depending on the cause of the failure the GET _stats endpoint may report something interesting.
GET _ml/trained_models/<model_id>/_stats
Look for the fields deployment_stats. allocation_status and deployment_stats.nodes.routing_state, what is the status?
I will try to recreate the problem of pytorch_inference disappearing. What version of Elasticsearch are you using and are you using the Elasticsearch docker image from https://www.docker.elastic.co/ or did you create it yourself?
The pytorch_inference process is where the model is evaluated so if that has disappeared the reindex task cannot progress. I am very surprised there is nothing in the logs, try searching the logs for the model Id of your model.
I search about the model Id and machine learning related things from container stdout output and looked for in /var/log/ and /usr/share/elasitcsaerch inside the container but I could not find Elasticserach log file.
Does Elasticsearch docker image have not only stdout but also logfile output?
Look for the fields deployment_stats. allocation_status and deployment_stats.nodes.routing_state, what is the status?
OK, I will try during my work time tomorrow (I live in Japan so it is going to be night now...)
I will try to recreate the problem of pytorch_inference disappearing. What version of Elasticsearch are you using and are you using the Elasticsearch docker image from https://www.docker.elastic.co/ or did you create it yourself?
Here is our Dockerfile. It is based on elasticsearch 8.7.0 official image and I think there's not so big addition.
FROM docker.elastic.co/elasticsearch/elasticsearch:8.7.0
# set the specific password for Elasticsearch
ENV ELASTIC_PASSWORD=elastic
RUN bin/elasticsearch-plugin install analysis-kuromoji
RUN bin/elasticsearch-plugin install analysis-icu
RUN sysctl -w vm.max_map_count=262144
It is quite likely that the pytorch_inference process is being terminated by the Out Of Memory Killer (the process runs at lower priority so the OOM killer will choose to terminated pytorch_inference rather than Elasticsearch)
Can you run the container with more memory? Note this the not the same as the JVM heap memory used by Elasticsearch, there must be memory in the container outside of the JWM for the pytorch_inference process.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.