The issue began when the Security AI Knowledge Base in Kibana remained stuck on the “Setup Knowledge Base” screen. After investigation, it was determined that the root cause was a failure in the Machine Learning native controller process on node my_node-2. The ELSER model deployment repeatedly failed with errors indicating that the PyTorch inference process could not start because the native controller process had stopped. This prevented Elastic’s semantic search model from running, which in turn caused the AI Knowledge Base setup to hang indefinitely.
The failure was not related to memory, disk space, or AppArmor restrictions. Instead, the ML native controller process had entered a broken state and could no longer spawn new inference processes. To resolve the issue safely in a production cluster with active shards and replicas, shard allocation was temporarily restricted to primaries only. Elasticsearch on the affected node was then restarted in a controlled manner. After the restart, the native controller process was properly recreated, the cluster recovered and reallocated shards successfully, and the overall cluster health returned to green. The ELSER deployment transitioned to a healthy “started” state, and the Knowledge Base interface began functioning normally again.