Deployment down because of an ml job

vmesis · July 2, 2025, 1:42pm

Hello everyone,
Recently, I had an issue with my deployment. I create an outliner detection jobs and it run out of memory. However, before I could stop it or reconfigure it the deployment went down.

I already tried to load a precedent snapshot but kibana is also down so I cannot acces the "Snapshot & restore" and the request from the API console doesn't return anything except error.
I've also tried to temporaly increase the RAM but the change failed.

I contacted the support team but I've only got an automatic response that it'll be look upon within 3 business day. I can't afford to wait this long so if anyone have a suggestion I'll gladly accept it !

Thank you for your understanding.

stephenb · July 2, 2025, 3:16pm

Hi @vmesis Welcome to the community.

Did you open the ticket as a Sev 1? I suspect you opened it as a Sev 3.

I would close that ticket and open an new Sev 1 Ticket with Cluster Down / Red in the Subject

I would not try to go straight to Snapshot and Restore, you cluster needs to be healthy before you could do that.

We are not support here so we can not really help with underlying deployment issues

vmesis · July 2, 2025, 3:34pm

Hi @stephenb, thanks for your reply.
Indeed, I opened it as a Sev 3 but not on purpose. I tried to open a case directly from the support page but the technical support button is grey and deactivated. I'll add that I have the subscription for this.
So instead, I sent an email directly to the support and it have been put in Sev 3 automatically.

stephenb · July 2, 2025, 3:40pm

Perhaps someone else in your organization can use the support portal to properly open a ticket.

You may not be added as a support contact.

stephenb · July 2, 2025, 8:37pm

I think you are getting a little help now...

Topic		Replies	Views
Standard Cloud Deployment unresponsive +12h Elasticsearch	1	135	August 1, 2023
Elastic clound 100% memory on instances, snapshot is stuck in IN_PROGRESS Elasticsearch	4	400	January 11, 2021
ELSER deployments crash kibana and fail deployment Elasticsearch	1	732	July 19, 2023
Cluster management: 2000+ open active shards Elasticsearch elastic-stack-monitoring	30	2327	May 3, 2021
No ML nodes with sufficient capacity for trained model deployment Elasticsearch elastic-stack-machine-learning	9	1738	September 19, 2024

Deployment down because of an ml job

Related topics