Something strange happened with our Elasticsearch v7.17 cluster - I can't start pivot transforms due to an error {"root_cause":[{"type":"status_exception","reason":"Could not start transform, allocation explanation [Not starting transform [test-transform], reasons []]"}],"type":"status_exception","reason":"Could not start transform, allocation explanation [Not starting transform [test-transform], reasons []]"}
I tried to create a few pivot transforms, with different sets of parameters and different sources, but always got the same error. I guess, this happened after I removed the oldest and weakest Elasticsearch node from cluster. I did this because for unclear reasons, transforms always ran on that node, despite that in cluster available a few much more powerful nodes with "transform" role. Before removing that old node from the cluster, I migrated data to other nodes, then used node shutdown API with "type": "remove" option. Later, I tried to return the old node to the cluster, but this didn't help.
I checked Elasticsearch logs on nodes but found nothing useful. Also, I checked common cluster issues, described in Fix common cluster issues | Elasticsearch Guide [7.17] | Elastic, but no obvious bottleneck was found.
I'm stuck and have no idea how to debug this further.
IDK why, but before I removed oldest node from the cluster, transforms complained on start "Could not start transform, allocation explanation [Not starting transform [test-transform], reasons [node_id:not a transform node]]"},"status":429} - and that was for every node in cluster, that doesn't have "transform" role yet. So, I forced to assign this role to all nodes.
Thank you, @Patrick_Whelan for advice! Unfortunately, I found no signs of cluster instability - no repeated master elections (except when I restarted the master), no node join failures, and no flapping node connections... Your advice prompted me to check "discovery.seed_hosts" lists in nodes' elasticsearch.yml - seed_hosts indeed were partially outdated. I deleted from the "discovery.seed_hosts" non-existed any more nodes, updated them to the current list of nodes, and restarted nodes accordingly - nothing changed, test transform still can't start, the error is the same - empty reasons[]. Are there any other ways to troubleshoot this transform failure?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.