Hello Team,
We need your expert advice on the issue of ELSER v2 inference getting stuck.
Elasticsearch version: 8.11.1
ML nodes: 2
Allocations are attached below
We have two separate deployment models for query and ingestion. But we have noticed the inference getting stuck. We have seen Pending Request getting increased and inference gets stuck.
At the time of the issue, we saw the ML node CPU and memory utilisation were normal but we could not find any logs that caused this. For an interim solution we have restarted the deployment and it started working. But with frequent issues in production, we want to find the root cause of this.
Hence, we require your support to know more about the observability of this. If any error happens where could we find the root cause of the error? You suggestion around this would help us.
note:
Ingestion is through Ingest pipeline to create embeddings we have onFailure step. But we noticed document is not getting ingested and keeping our ingestion in a hung state.
Query we are using text-expansion query which is timing out post 10s.