ELSER v2 model inference getting stuck while inferencing

Hello Team,

We need your expert advice on the issue of ELSER v2 inference getting stuck.

Elasticsearch version: 8.11.1
ML nodes: 2
Allocations are attached below

We have two separate deployment models for query and ingestion. But we have noticed the inference getting stuck. We have seen Pending Request getting increased and inference gets stuck.

At the time of the issue, we saw the ML node CPU and memory utilisation were normal but we could not find any logs that caused this. For an interim solution we have restarted the deployment and it started working. But with frequent issues in production, we want to find the root cause of this.

Hence, we require your support to know more about the observability of this. If any error happens where could we find the root cause of the error? You suggestion around this would help us.

note:

Ingestion is through Ingest pipeline to create embeddings we have onFailure step. But we noticed document is not getting ingested and keeping our ingestion in a hung state.

Query we are using text-expansion query which is timing out post 10s.

There is a known issue with Elasticsearch 8.11 where the model will freeze or stop working. The cause appeared to be the use of the IPEX library which was added in 8.11 to enhance model inference speed on Intel hardware. Unfortunately there were side effects and the library was removed in 8.12.

I recommend an upgrade to the latest version or 8.12 as a minimum, that should fix the timeouts.

Thanks @dkyle for your suggestion. Could you please advise us on what aspects to observe and how to implement observability for ELSER inference?