Transform stability

alexs1 · June 3, 2021, 12:43pm

Hello

I've started to use transforms, and they are great- saving lots of pain of using external tools.

What concerns me, is the I had few times a load on the cluster, which caused a node to leave the cluster, and caused the transform to fail.

When I looked in the morning, it was on failed status, and when stopping/starting it, it was filling the data successfully without issues.

Then, I started thinking, what would happen if I took few days off, and didn't do that manually..
In my case, the client would see gap in our front end charts.

Is there any way to monitor a transform, and start it automatically in such a case?

warkolm · June 7, 2021, 4:59am

There's cat transforms API | Elasticsearch Guide [7.13] | Elastic, but perhaps that's something that we can also expose via Kibana(?). It might be worth creating a feature request around this

Hendrik_Muhs · June 8, 2021, 8:00am

Can you share the error you saw? The reason field in the status should tell you why it failed.

Transform is designed to be fail-safe:

it differs between permanent and re-occurring errors. Permanent errors are e.g. configuration errors or errors in scripts, mapping issues etc., with other words everything that won't fix itself with a retry.
if not permanent, a transform retries up to 10 times based on frequency. It must fail 10 times in a row to go into failed state. Every successful operation resets the counter.

You mentioned cluster load, which indicates a non permanent error. If you could share the error, I can verify transform picked the right category.

As mitigation you can increase the number of retries by setting num_transform_failure_retries to a higher value.

In addition you can create a script that checks _stats (or the mentioned cat interface) and based on the output call _start. You could also call _start regularly and ignore the error if it's already running.

In future we plan to improve the way transform retries: Today the interval between each retry is static and based on frequency. We want to decouple retry from frequency and use exponential backoff.

Last but not least, it might be transform itself that causes your cluster issues. You might want to checkout or general guidance for optimizing transform: Working with transforms at scale | Elasticsearch Guide [7.13] | Elastic.

alexs1 · June 8, 2021, 12:18pm

Hi @Hendrik_Muhs
I just checked the transforms screen in Kibana, and found a transform in failed state. unfortunately this happens a lot.
I think today I got this error when upgrading a version (Elastic cloud)

this is the message from the messages tab:

Failed to index documents into destination index due to permanent error: [BulkIndexingException[Bulk index experienced [48] failures and at least 1 irrecoverable [pipeline with id [TRANSFORM_NAME-pipeline] does not exist]. Other failures: ]; nested: IllegalArgumentException[pipeline with id [TRANSFORM_NAME-pipeline] does not exist];; java.lang.IllegalArgumentException: pipeline with id [TRANSFORM_NAME-pipeline] does not exist]

Failed to start transform. Please stop and attempt to start again. Failure: Unable to start transform [TRANSFORM_NAME] as it is in a failed state with failure: [Failed to index documents into destination index due to permanent error: [BulkIndexingException[Bulk index experienced [48] failures and at least 1 irrecoverable [pipeline with id [TRANSFORM_NAME-pipeline] does not exist]. Other failures: ]; nested: IllegalArgumentException[pipeline with id [TRANSFORM_NAME-pipeline] does not exist];; java.lang.IllegalArgumentException: pipeline with id [TRANSFORM_NAME-pipeline] does not exist]]. Use force stop and then restart the transform once error is resolved.

Hendrik_Muhs · June 8, 2021, 2:00pm

Thanks for the error message.

Can you tell me more about the failing transform? Did you created this one (a transform might also be used as part of a solution)?

As the message says, this is a permanent error: The configuration refers to a pipeline that does not seem to exist. Restarting the transform won't help.

I wonder, this does not fit your initial post, is this the same or another transform?

Can you check the pipeline exists by e.g. listing all your pipelines:

GET /_ingest/pipeline

alexs1 · June 8, 2021, 2:21pm

Hey thanks for replying so fast.
yes the pipeline exists.
I've seen this error before, but I suspect the error message doesnt reflect the error state of the cluster.
Yes I did create it. I can't send the body here in the forum, but I can do that in private if you need.

I started it and it works fine now.

yes it's another transform, but similar idea: it seems that (sometimes) when ever something changes in the cluster (load> node leave , node extending etc) the transform just fails

Hendrik_Muhs · June 8, 2021, 3:44pm

As an elastic cloud customer you can create a support request. The support engineer can request additional help from development(me) on demand, just refer to this conversation.

At least in this case, it seems to be issue with ingest pipelines being unavailable. This might not be a transform problem, but an ingest instability. Still, I like to find out what's happening. Please always check, if its this error or whether you see other failure reasons and let me know.

Unfortunately the suggested num_transform_failure_retries won't help, because transform classifies this as a permanent error, which never gets retried. The only workaround for now is a script that regularly checks the state.

If I remember correctly, another user uses watcher to do something like this.

alexs1 · June 9, 2021, 11:00am

thanks @Hendrik_Muhs
Do you have an example to such a watcher script? (or anything similar)

Hendrik_Muhs · June 9, 2021, 12:38pm

I don't have an example at the moment, but I suggest to have a look at watcher http input.

alexs1 · June 9, 2021, 6:01pm

just happened again, with the same pipeline error.
unfortunately what I had to do is stop it and then start, it again so I'm not sure it'll help to call start.
I'll try the support but it's very limited for standard customers

Hendrik_Muhs · June 10, 2021, 5:49am

Thanks for the heads up, I will investigate it. Unfortunately a failed transform requires a force stop and start. There is no force start. Nevertheless, the loss of pipelines should not happen.

Was there anything of interest happening before, e.g.

a node that dropped?
change of master?

How many ingest nodes do you have?
What version are you using?

alexs1 · June 10, 2021, 6:38am

Attached the cluster architecture as an image. I hope it's easy to read

No changes were done. (not by us anyway)
version 7.13.1

so to ensure stability, I need to:

call cat/transforms
if state is failed > force stop> start
else call start and ignore "already started" error?

thanks for the help

Hendrik_Muhs · June 10, 2021, 7:22am

Thanks!

Step 1 and 2 should be sufficient, if you already know the state, there is no need to start it again.

Instead of using _cat you can use _stats, it might be easier to parse json instead of text.

If I found out why your ingest pipelines drop, I let you know. It still might be good to open a support case and send diagnostics this way.

Do you also have transforms without ingest pipelines? They should not be affected, right?

alexs1 · June 10, 2021, 7:43am

good point.

correct.

I removed the pipeline from the transform and moved it to the index settings.
I have a feeling it will solve it (at least for me )

Hendrik_Muhs · June 15, 2021, 7:47am

Thanks for the update.

Did the error happen again after you switched to using the index template based pipeline?

(FYI I received the diagnostics from support, thanks for that.)

alexs1 · June 16, 2021, 6:12pm

Thanks.
Correct. For now, no errors, but it didn't happen every day

alexs1 · July 1, 2021, 5:40pm

Hey @Hendrik_Muhs
It did happen again yesterday without the pipline definition

Hendrik_Muhs · July 5, 2021, 6:14am

Without the pipeline in transform, but with the one in index settings?
Do you have the error message?

alexs1 · July 5, 2021, 9:29am

correct
Failed to start transform. Please stop and attempt to start again. Failure: Unable to start transform [xxx] as it is in a failed state with failure: [Failed to index documents into destination index due to permanent error: [BulkIndexingException[Bulk index experienced [500] failures and at least 1 irrecoverable [pipeline with id [xxx_pipeline] does not exist]. Other failures: ]; nested: IllegalArgumentException[pipeline with id [xxx_pipeline] does not exist];; java.lang.IllegalArgumentException: pipeline with id [xxx_pipeline] does not exist]]. Use force stop and then restart the transform once error is resolved.

alexs1 · July 5, 2021, 10:28am

just got this again when upgrading to latest version using the elastic cloud interface

Topic		Replies	Views
Failed to start transform Elasticsearch transforms	9	1278	September 29, 2020
Backwards update for transform data Elasticsearch transforms	4	465	December 15, 2020
Transform Exception - all shards failed Elasticsearch	3	1263	March 4, 2020
Elasticsearch 7.13 - monitoring execution failed Elasticsearch elastic-stack-monitoring	6	1899	August 2, 2021
Watcher with conditions fails on transform Elasticsearch elastic-stack-alerting	6	2315	October 11, 2018

Transform stability

Related topics