Transform stability

Hello

I've started to use transforms, and they are great- saving lots of pain of using external tools.

What concerns me, is the I had few times a load on the cluster, which caused a node to leave the cluster, and caused the transform to fail.

When I looked in the morning, it was on failed status, and when stopping/starting it, it was filling the data successfully without issues.

Then, I started thinking, what would happen if I took few days off, and didn't do that manually..
In my case, the client would see gap in our front end charts.

Is there any way to monitor a transform, and start it automatically in such a case?

1 Like

There's cat transforms API | Elasticsearch Guide [7.13] | Elastic, but perhaps that's something that we can also expose via Kibana(?). It might be worth creating a feature request around this :slight_smile:

Can you share the error you saw? The reason field in the status should tell you why it failed.

Transform is designed to be fail-safe:

  • it differs between permanent and re-occurring errors. Permanent errors are e.g. configuration errors or errors in scripts, mapping issues etc., with other words everything that won't fix itself with a retry.
  • if not permanent, a transform retries up to 10 times based on frequency. It must fail 10 times in a row to go into failed state. Every successful operation resets the counter.

You mentioned cluster load, which indicates a non permanent error. If you could share the error, I can verify transform picked the right category.

As mitigation you can increase the number of retries by setting num_transform_failure_retries to a higher value.

In addition you can create a script that checks _stats (or the mentioned cat interface) and based on the output call _start. You could also call _start regularly and ignore the error if it's already running.

In future we plan to improve the way transform retries: Today the interval between each retry is static and based on frequency. We want to decouple retry from frequency and use exponential backoff.

Last but not least, it might be transform itself that causes your cluster issues. You might want to checkout or general guidance for optimizing transform: Working with transforms at scale | Elasticsearch Guide [7.13] | Elastic.

Hi @Hendrik_Muhs
I just checked the transforms screen in Kibana, and found a transform in failed state. unfortunately this happens a lot.
I think today I got this error when upgrading a version (Elastic cloud)

this is the message from the messages tab:

Failed to index documents into destination index due to permanent error: [BulkIndexingException[Bulk index experienced [48] failures and at least 1 irrecoverable [pipeline with id [TRANSFORM_NAME-pipeline] does not exist]. Other failures: ]; nested: IllegalArgumentException[pipeline with id [TRANSFORM_NAME-pipeline] does not exist];; java.lang.IllegalArgumentException: pipeline with id [TRANSFORM_NAME-pipeline] does not exist]

Failed to start transform. Please stop and attempt to start again. Failure: Unable to start transform [TRANSFORM_NAME] as it is in a failed state with failure: [Failed to index documents into destination index due to permanent error: [BulkIndexingException[Bulk index experienced [48] failures and at least 1 irrecoverable [pipeline with id [TRANSFORM_NAME-pipeline] does not exist]. Other failures: ]; nested: IllegalArgumentException[pipeline with id [TRANSFORM_NAME-pipeline] does not exist];; java.lang.IllegalArgumentException: pipeline with id [TRANSFORM_NAME-pipeline] does not exist]]. Use force stop and then restart the transform once error is resolved.

Thanks for the error message.

Can you tell me more about the failing transform? Did you created this one (a transform might also be used as part of a solution)?

As the message says, this is a permanent error: The configuration refers to a pipeline that does not seem to exist. Restarting the transform won't help.

I wonder, this does not fit your initial post, is this the same or another transform?

Can you check the pipeline exists by e.g. listing all your pipelines:

GET /_ingest/pipeline

Hey thanks for replying so fast.
yes the pipeline exists.
I've seen this error before, but I suspect the error message doesnt reflect the error state of the cluster.
Yes I did create it. I can't send the body here in the forum, but I can do that in private if you need.

I started it and it works fine now.

yes it's another transform, but similar idea: it seems that (sometimes) when ever something changes in the cluster (load> node leave , node extending etc) the transform just fails

As an elastic cloud customer you can create a support request. The support engineer can request additional help from development(me) on demand, just refer to this conversation.

At least in this case, it seems to be issue with ingest pipelines being unavailable. This might not be a transform problem, but an ingest instability. Still, I like to find out what's happening. Please always check, if its this error or whether you see other failure reasons and let me know.

Unfortunately the suggested num_transform_failure_retries won't help, because transform classifies this as a permanent error, which never gets retried. The only workaround for now is a script that regularly checks the state.

If I remember correctly, another user uses watcher to do something like this.

thanks @Hendrik_Muhs
Do you have an example to such a watcher script? (or anything similar)

I don't have an example at the moment, but I suggest to have a look at watcher http input.

1 Like

just happened again, with the same pipeline error.
unfortunately what I had to do is stop it and then start, it again so I'm not sure it'll help to call start.
I'll try the support but it's very limited for standard customers

Thanks for the heads up, I will investigate it. Unfortunately a failed transform requires a force stop and start. There is no force start. Nevertheless, the loss of pipelines should not happen.

Was there anything of interest happening before, e.g.

  • a node that dropped?
  • change of master?

How many ingest nodes do you have?
What version are you using?

Attached the cluster architecture as an image. I hope it's easy to read :slight_smile:

No changes were done. (not by us anyway)
version 7.13.1

so to ensure stability, I need to:

  1. call cat/transforms
  2. if state is failed > force stop> start
  3. else call start and ignore "already started" error?

thanks for the help

Thanks!

Step 1 and 2 should be sufficient, if you already know the state, there is no need to start it again.

Instead of using _cat you can use _stats, it might be easier to parse json instead of text.

If I found out why your ingest pipelines drop, I let you know. It still might be good to open a support case and send diagnostics this way.

Do you also have transforms without ingest pipelines? They should not be affected, right?

good point.

correct.

I removed the pipeline from the transform and moved it to the index settings.
I have a feeling it will solve it (at least for me :slight_smile: )

Thanks for the update.

Did the error happen again after you switched to using the index template based pipeline?

(FYI I received the diagnostics from support, thanks for that.)

Thanks.
Correct. For now, no errors, but it didn't happen every day

Hey @Hendrik_Muhs
It did happen again yesterday without the pipline definition

Without the pipeline in transform, but with the one in index settings?
Do you have the error message?

  1. correct
  2. Failed to start transform. Please stop and attempt to start again. Failure: Unable to start transform [xxx] as it is in a failed state with failure: [Failed to index documents into destination index due to permanent error: [BulkIndexingException[Bulk index experienced [500] failures and at least 1 irrecoverable [pipeline with id [xxx_pipeline] does not exist]. Other failures: ]; nested: IllegalArgumentException[pipeline with id [xxx_pipeline] does not exist];; java.lang.IllegalArgumentException: pipeline with id [xxx_pipeline] does not exist]]. Use force stop and then restart the transform once error is resolved.

just got this again when upgrading to latest version using the elastic cloud interface