Transform has failed; experienced: [task encountered irrecoverable failure: field name cannot be null.]

Hi all.

I created a transform on ES 7.7.0 from Kibana gui, that uses some group_by with script to handle any missing fields in the docs:

 "pivot": {
        "group_by": {
          "acquisition.utm_campaign.keyword": {
            "terms": {
              "script": {
                "source": "try {return doc['acquisition.utm_campaign.keyword'].value } catch (Exception e) {return '--';}",
                "lang": "painless"
              }
            }
          },
          "geoip.country_name.keyword": {
            "terms": {
              "script": {
                "source": "try {return doc['geoip.country_name.keyword'].value } catch (Exception e) {return '--';}",
                "lang": "painless"
              }
            }
          },
          "geoip.continent_code.keyword": {
            "terms": {
              "script": {
                "source": "try {return doc['geoip.continent_code.keyword'].value } catch (Exception e) {return '--';}",
                "lang": "painless"
              }
            }
          },
    ....
    ....

The transform is set to run continuous mode.

After starting the transform, it goes to Indexing and the to Started, but after a while it goes to "Failed".
In ES logs we have this error:

    {"type": "server", "timestamp": "2020-05-28T15:44:13,120Z", "level": "WARN", "component": "o.e.x.t.t.TransformIndexer", "cluster.name": "XXXXX", "node.name": "elasticsearch", "message": "[sold_products-22] transform encountered an exception: ", "cluster.uuid": "B4prQaEHQi-852irJc1MkA", "node.id": "qQwxRFsESLqzYLVmn6z42Q" ,
                "stacktrace": ["java.lang.IllegalArgumentException: field name cannot be null.",
                "at org.elasticsearch.index.query.TermsQueryBuilder.<init>(TermsQueryBuilder.java:162) ~[elasticsearch-7.7.0.jar:7.7.0]",
                "at org.elasticsearch.xpack.core.transform.transforms.pivot.TermsGroupSource.getIncrementalBucketUpdateFilterQuery(TermsGroupSource.java:63) ~[x-pack-core-7.7.0.jar:7.7.0]",
    ....
    ....

and then

{"type": "server", "timestamp": "2020-05-28T15:44:13,124Z", "level": "ERROR", "component": "o.e.x.t.t.TransformTask", "cluster.name": "XXXXX", "node.name": "elasticsearch", "message": "[sold_products-22] transform has failed; experienced: [task encountered irrecoverable failure: field name cannot be null.].", "cluster.uuid": "B4prQaEHQi-852irJc1MkA", "node.id": "qQwxRFsESLqzYLVmn6z42Q"  }

On Kibana we can see the following Stats

Stats

pages_processed     25
documents_processed     12521
documents_indexed     9321
trigger_count     41
index_time_in_ms     5151
index_total     19
index_failures     0
search_time_in_ms     28695
search_total     25
search_failures     0
processing_time_in_ms     1163
processing_total     25
exponential_avg_checkpoint_duration_ms     34999
exponential_avg_documents_indexed     9321
exponential_avg_documents_processed     12521

Any hints on how to debug the issue?

Hi,

can you go into dev console and run stats from there?

GET _transform/sold_products-22/_stats

This will contain some more information than you copy pasted out of kibana. I suspect a problem with your use of scripts in the group_by in combination with continuous.

Any chance to fix your original data regarding the missing buckets you workaround with the use of scripts?

(There is an open issue about adding support for missing_bucket, so its a know limitation)

here is the output:

    {
      "count" : 1,
      "transforms" : [
        {
          "id" : "sold_products-22",
          "state" : "failed",
          "reason" : "task encountered irrecoverable failure: field name cannot be null.",
          "node" : {
            "id" : "qQwxRFsESLqzYLVmn6z42Q",
            "name" : "elasticsearch",
            "ephemeral_id" : "DWbF53mPS8-S-wFUaVQi9w",
            "transport_address" : "172.33.46.50:9300",
            "attributes" : { }
          },
          "stats" : {
            "pages_processed" : 26,
            "documents_processed" : 12521,
            "documents_indexed" : 9321,
            "trigger_count" : 42,
            "index_time_in_ms" : 5151,
            "index_total" : 19,
            "index_failures" : 0,
            "search_time_in_ms" : 28722,
            "search_total" : 26,
            "search_failures" : 0,
            "processing_time_in_ms" : 1163,
            "processing_total" : 26,
            "exponential_avg_checkpoint_duration_ms" : 34999.0,
            "exponential_avg_documents_indexed" : 9321.0,
            "exponential_avg_documents_processed" : 12521.0
          },
          "checkpointing" : {
            "last" : {
              "checkpoint" : 1,
              "timestamp_millis" : 1590607434542,
              "time_upper_bound_millis" : 1590607374542
            },
            "next" : {
              "checkpoint" : 2,
              "checkpoint_progress" : {
                "docs_indexed" : 0,
                "docs_processed" : 0
              },
              "timestamp_millis" : 1590685280173,
              "time_upper_bound_millis" : 1590685220173
            },
            "operations_behind" : 508,
            "changes_last_detected_at" : 1590685280171
          }
        }
      ]
    }

From what i can see, the first "run" of the transform is running smoothly, but then, when the trigger is started, something doesn't work anymore.

It's a bit complicated, because the documents in the source index come from some logs that contain some "optional" fields, and the data model changed a few times in the past; furthermore, I can't exclude that the data model for the documents will change again in the future.

If the problem is in the continuous mode, maybe I can schedule a batch transform every n minutes.
What are your suggestions?
Thank you

Thanks for the feedback. I created an issue.

As described there, this is a bug: continuous mode and scripts in group_by can not be combined. The change detection of continuous transform tries to minimize updates, this does not work in the case of scripted group_by.

I see you are working around https://github.com/elastic/elasticsearch/issues/48243 (you might want to subscribe to that issue, to get notified when it changes).

Until either of the 2 bugs are fixed I see only 3 options:

  • do not use scripts, which would mean you loose all buckets that have a missing value in either of your 3 groupings, I guess this is not acceptable
  • run in batch, which unfortunately means a full re-run every time and as you can not re-run a batch transform (on the list of enhancements) you have to re-create the transform
  • fix your data: you could re-index and set the missing values or you apply an ingest pipeline to your original incoming data with a set processor and override set to false (old data has to be migrated with reindex)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.