I've started to use transforms, and they are great- saving lots of pain of using external tools.
What concerns me, is the I had few times a load on the cluster, which caused a node to leave the cluster, and caused the transform to fail.
When I looked in the morning, it was on failed status, and when stopping/starting it, it was filling the data successfully without issues.
Then, I started thinking, what would happen if I took few days off, and didn't do that manually..
In my case, the client would see gap in our front end charts.
Is there any way to monitor a transform, and start it automatically in such a case?
Can you share the error you saw? The reason field in the status should tell you why it failed.
Transform is designed to be fail-safe:
it differs between permanent and re-occurring errors. Permanent errors are e.g. configuration errors or errors in scripts, mapping issues etc., with other words everything that won't fix itself with a retry.
if not permanent, a transform retries up to 10 times based on frequency. It must fail 10 times in a row to go into failed state. Every successful operation resets the counter.
You mentioned cluster load, which indicates a non permanent error. If you could share the error, I can verify transform picked the right category.
As mitigation you can increase the number of retries by setting num_transform_failure_retries to a higher value.
In addition you can create a script that checks _stats (or the mentioned cat interface) and based on the output call _start. You could also call _start regularly and ignore the error if it's already running.
In future we plan to improve the way transform retries: Today the interval between each retry is static and based on frequency. We want to decouple retry from frequency and use exponential backoff.
Hi @Hendrik_Muhs
I just checked the transforms screen in Kibana, and found a transform in failed state. unfortunately this happens a lot.
I think today I got this error when upgrading a version (Elastic cloud)
this is the message from the messages tab:
Failed to index documents into destination index due to permanent error: [BulkIndexingException[Bulk index experienced [48] failures and at least 1 irrecoverable [pipeline with id [TRANSFORM_NAME-pipeline] does not exist]. Other failures: ]; nested: IllegalArgumentException[pipeline with id [TRANSFORM_NAME-pipeline] does not exist];; java.lang.IllegalArgumentException: pipeline with id [TRANSFORM_NAME-pipeline] does not exist]
Failed to start transform. Please stop and attempt to start again. Failure: Unable to start transform [TRANSFORM_NAME] as it is in a failed state with failure: [Failed to index documents into destination index due to permanent error: [BulkIndexingException[Bulk index experienced [48] failures and at least 1 irrecoverable [pipeline with id [TRANSFORM_NAME-pipeline] does not exist]. Other failures: ]; nested: IllegalArgumentException[pipeline with id [TRANSFORM_NAME-pipeline] does not exist];; java.lang.IllegalArgumentException: pipeline with id [TRANSFORM_NAME-pipeline] does not exist]]. Use force stop and then restart the transform once error is resolved.
Can you tell me more about the failing transform? Did you created this one (a transform might also be used as part of a solution)?
As the message says, this is a permanent error: The configuration refers to a pipeline that does not seem to exist. Restarting the transform won't help.
I wonder, this does not fit your initial post, is this the same or another transform?
Can you check the pipeline exists by e.g. listing all your pipelines:
Hey thanks for replying so fast.
yes the pipeline exists.
I've seen this error before, but I suspect the error message doesnt reflect the error state of the cluster.
Yes I did create it. I can't send the body here in the forum, but I can do that in private if you need.
I started it and it works fine now.
yes it's another transform, but similar idea: it seems that (sometimes) when ever something changes in the cluster (load> node leave , node extending etc) the transform just fails
As an elastic cloud customer you can create a support request. The support engineer can request additional help from development(me) on demand, just refer to this conversation.
At least in this case, it seems to be issue with ingest pipelines being unavailable. This might not be a transform problem, but an ingest instability. Still, I like to find out what's happening. Please always check, if its this error or whether you see other failure reasons and let me know.
Unfortunately the suggested num_transform_failure_retries won't help, because transform classifies this as a permanent error, which never gets retried. The only workaround for now is a script that regularly checks the state.
If I remember correctly, another user uses watcher to do something like this.
just happened again, with the same pipeline error.
unfortunately what I had to do is stop it and then start, it again so I'm not sure it'll help to call start.
I'll try the support but it's very limited for standard customers
Thanks for the heads up, I will investigate it. Unfortunately a failed transform requires a force stop and start. There is no force start. Nevertheless, the loss of pipelines should not happen.
Was there anything of interest happening before, e.g.
a node that dropped?
change of master?
How many ingest nodes do you have?
What version are you using?
Failed to start transform. Please stop and attempt to start again. Failure: Unable to start transform [xxx] as it is in a failed state with failure: [Failed to index documents into destination index due to permanent error: [BulkIndexingException[Bulk index experienced [500] failures and at least 1 irrecoverable [pipeline with id [xxx_pipeline] does not exist]. Other failures: ]; nested: IllegalArgumentException[pipeline with id [xxx_pipeline] does not exist];; java.lang.IllegalArgumentException: pipeline with id [xxx_pipeline] does not exist]]. Use force stop and then restart the transform once error is resolved.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.