Transform Mapping Errors

Hello,

I've set up a transform to aggregate on process.name, however it continues to fail exactly on the same date with the following error -

Failed to index documents into destination index due to permanent error: [org.elasticsearch.xpack.transform.transforms.BulkIndexingException: Bulk index experienced [1] failures and at least 1 irrecoverable [org.elasticsearch.xpack.transform.transforms.TransformException: Destination index mappings are incompatible with the transform configuration.; org.elasticsearch.index.mapper.MapperParsingException: failed to parse field [process.name.terms] of type [flattened] in document with id 'AEd3S3VLUMATGYPzlKmOr_-PAAAAAAAA'. Preview of field's value: 'null'; java.lang.IllegalArgumentException: field name cannot be an empty string].; org.elasticsearch.xpack.transform.transforms.TransformException: Destination index mappings are incompatible with the transform configuration.; org.elasticsearch.index.mapper.MapperParsingException: failed to parse field [process.name.terms] of type [flattened] in document with id 'AEd3S3VLUMATGYPzlKmOr_-PAAAAAAAA'. Preview of field's value: 'null'; java.lang.IllegalArgumentException: field name cannot be an empty string]

The destination mappings were created by the transform itself. Should I adjust the destination mappings? Should I change the query? The data is all coming from the endpoint process index, so it doesn't seem like there would be any null process names.

The json for my transform should ensure that org id and process name always exist, but I'm still getting the null error.

How can I best handle these errors so it doesn't kill my transform? Also, how would I change the mapping so its not a flattened field and I can aggregate via visualizations? -

value={
  "id": "brent-process6",
  "version": "8.3.3",
  "create_time": 1661870704986,
  "source": {
    "index": [
      "logs-endpoint.events.process*"
    ],
    "query": {
      "bool": {
        "filter": [
          {
            "bool": {
              "should": [
                {
                  "exists": {
                    "field": "organization.id"
                  }
                }
              ],
              "minimum_should_match": 1
            }
          },
          {
            "bool": {
              "should": [
                {
                  "exists": {
                    "field": "process.name"
                  }
                }
              ],
              "minimum_should_match": 1
            }
          }
        ]
      }
    }
  },
  "dest": {
    "index": "brent-process2"
  },
  "sync": {
    "time": {
      "field": "@timestamp",
      "delay": "60s"
    }
  },
  "pivot": {
    "group_by": {
      "organization.id": {
        "terms": {
          "field": "organization.id"
        }
      },
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1h",
          "missing_bucket": true
        }
      }
    },
    "aggregations": {
      "process.name.terms": {
        "terms": {
          "field": "process.name"
        }
      }
    }
  },
  "settings": {},
  "retention_policy": {
    "time": {
      "field": "@timestamp",
      "max_age": "3d"
    }
  }
}

Output of GET _transform/brent-process6/_stats

{
  "count": 1,
  "transforms": [
    {
      "id": "brent-process6",
      "state": "failed",
      "reason": "Failed to index documents into destination index due to permanent error: [org.elasticsearch.xpack.transform.transforms.BulkIndexingException: Bulk index experienced [1] failures and at least 1 irrecoverable [org.elasticsearch.xpack.transform.transforms.TransformException: Destination index mappings are incompatible with the transform configuration.; org.elasticsearch.index.mapper.MapperParsingException: failed to parse field [process.name.terms] of type [flattened] in document with id 'AEd3S3VLUMATGYPzlKmOr_-PAAAAAAAA'. Preview of field's value: 'null'; java.lang.IllegalArgumentException: field name cannot be an empty string].; org.elasticsearch.xpack.transform.transforms.TransformException: Destination index mappings are incompatible with the transform configuration.; org.elasticsearch.index.mapper.MapperParsingException: failed to parse field [process.name.terms] of type [flattened] in document with id 'AEd3S3VLUMATGYPzlKmOr_-PAAAAAAAA'. Preview of field's value: 'null'; java.lang.IllegalArgumentException: field name cannot be an empty string]",
      "node": {
        "id": "KFZRFzWxTpOJ8ikSDhIJyQ",
        "name": "06-prd-iad-elasticsearch",
        "ephemeral_id": "gUJjyJx0RimMRHI2T5GPQw",
        "transport_address": "x.x.x.x:9300",
        "attributes": {}
      },
      "stats": {
        "pages_processed": 21,
        "documents_processed": 177835953,
        "documents_indexed": 10000,
        "documents_deleted": 0,
        "trigger_count": 1,
        "index_time_in_ms": 2825,
        "index_total": 20,
        "index_failures": 1,
        "search_time_in_ms": 2919750,
        "search_total": 21,
        "search_failures": 0,
        "processing_time_in_ms": 70,
        "processing_total": 21,
        "delete_time_in_ms": 0,
        "exponential_avg_checkpoint_duration_ms": 0,
        "exponential_avg_documents_indexed": 0,
        "exponential_avg_documents_processed": 0
      },
      "checkpointing": {
        "last": {
          "checkpoint": 0
        },
        "next": {
          "checkpoint": 1,
          "position": {
            "indexer_position": {
              "@timestamp": 1651143600000,
              "organization.id": "xxxxxx"
            }
          },
          "checkpoint_progress": {
            "docs_remaining": 4519755623,
            "total_docs": 4697591576,
            "percent_complete": 3.785683581104923,
            "docs_indexed": 10500,
            "docs_processed": 177835953
          },
          "timestamp_millis": 1661961740994,
          "time_upper_bound_millis": 1661961600000
        },
        "changes_last_detected_at": 1661961740988,
        "last_search_time": 1661961740988
      }
    }
  ]
}

@Hendrik_Muhs I know you are the expert :slight_smile:

UPDATE - I tried adding a runtime field to the transform to account for the null errors but it is still failing.

{
	"process_name": {
		"type": "keyword",
		"script": {
			"source": "if (doc.containsKey('process.name') && !doc['process.name'].empty() && doc['process.name'].value == '') { emit('emptystring'); }"
		}
	}
}

I also tried to set the mapping to a keyword, instead of flattened before starting the transform and that ended up failing as well

Failed to index documents into destination index due to permanent error: [org.elasticsearch.xpack.transform.transforms.BulkIndexingException: Bulk index experienced [500] failures and at least 1 irrecoverable [org.elasticsearch.xpack.transform.transforms.TransformException: Destination index mappings are incompatible with the transform configuration.; org.elasticsearch.index.mapper.MapperParsingException: failed to parse field [process.name.terms] of type [keyword] in document with id 'ADBoIkmrtMuFkDFCSy1p6Pt1AAAAAAAA'. Preview of field's value: '{dllhost={exe=2}, GoogleUpdate={exe=2}}'; java.lang.IllegalStateException: Can't get text on a START_OBJECT at 1:29].; org.elasticsearch.xpack.transform.transforms.TransformException: Destination index mappings are incompatible with the transform configuration.; org.elasticsearch.index.mapper.MapperParsingException: failed to parse field [process.name.terms] of type [keyword] in document with id 'ADBoIkmrtMuFkDFCSy1p6Pt1AAAAAAAA'. Preview of field's value: '{dllhost={exe=2}, GoogleUpdate={exe=2}}'; java.lang.IllegalStateException: Can't get text on a START_OBJECT at 1:29]

Can you try filtering in the terms agg:

"aggregations": {
      "process.name.terms": {
        "terms": {
          "field": "process.name",
          "exclude": ""
        }
      }

The root cause might be an empty key, e.g. something like this:

        {
          "key": "",
          "doc_count": 2
        },

Transform creates a flattened field from the agg output, however empty strings aren't allowed. A similar issue: Transform jobs can fail if there's a \0 in fields where we perform terms aggregation (stored in flattened fields) · Issue #75875 · elastic/elasticsearch · GitHub

Let me know if it works, so I can follow up with an issue.

@Hendrik_Muhs I added the exclude to the transform and it is still running without failing thus far, so it seems like that is working, thanks!

The flattened field type is still showing as Unknown within Kibana even though its mapped and listed as searchable and aggregable. I'm unable to Visualize on an unknown field type. Is there any way I can re-map it or should I be able to Visualize on a flattened field as well? I just want to show a graph of the top process names and counts from the agg'd transform.

I've added "exclude": [ "", "." ] to my transform and it is no longer failing.

The flattened mapping is still showing as Unknown within Kibana, thus no way for me to create a visualization. An example of a document is below. Is there something else I would need to do to split up the data so I can visualize it @Hendrik_Muhs?

[
  {
    "_index": "brent-process21",
    "_id": "ADAO3_01CiX9cU5wsMOIdRQ5AAAAAAAA",
    "_version": 1,
    "_score": 0,
    "_source": {
      "process": {
        "name": {
          "terms": {
            "SearchFilterHost.exe": 77,
            "smartscreen.exe": 41,
            "SearchProtocolHost.exe": 88,
            "PING.EXE": 1040,
            "backgroundTaskHost.exe": 123,
            "conhost.exe": 54,
            "svchost.exe": 257,
            "SenseCncProxy.exe": 88,
            "cmd.exe": 120,
            "identity_helper.exe": 48
          }
        }
      },
      "@timestamp": "2022-09-06T20:00:00.000Z",
      "x": {
        "organization": {
          "id": "xxxxxx"
        }
      }
    },
    "fields": {
      "organization.id": [
        "xxxx"
      ],
      "@timestamp": [
        "2022-09-06T20:00:00.000Z"
      ],
      "process.name.terms": [
        {
          "SearchFilterHost.exe": 77,
          "smartscreen.exe": 41,
          "SearchProtocolHost.exe": 88,
          "PING.EXE": 1040,
          "backgroundTaskHost.exe": 123,
          "conhost.exe": 54,
          "svchost.exe": 257,
          "SenseCncProxy.exe": 88,
          "cmd.exe": 120,
          "identity_helper.exe": 48
        }
      ]
    }
  }
]

Sorry for the late reply, I was on vacation.

I guess you want to visualize e.g. process.name.terms.cmd.exe ? For this the flattened data type does not provide the necessary granularity. If not provided upfront transform creates mappings best effort, e.g. a terms aggregation gets mapped to flattened, because we don't know how many different field names are in the data. The number of fields in an index is limited. However you can customize / overrule transform by creating the destination index upfront or by creating an index template and disabling mapping deduction (see deduce_mappings).

If you want to visualize the individual process names every possible field should be mapped to a numeric field, e.g. long. One way to achieve this is dynamic template:

PUT brent-process21
{
  "mappings": {
    "dynamic_templates": [
      {
        "full_name": {
          "path_match": "process.name.terms.*",
          "mapping": {
            "type": "long"
          }
        }
      }
    ]
  }
}

Such a mapping creates a new mapping whenever a new field name appears and it maps it to a long.

Caveat 1:

If you choose to create the mappings yourself, you must do this for all fields, not just the ones you overwrite. That means e.g. you need to map @timestamp to date and organization.id to keyword. If you are unsure about the mappings you can use the transform preview API to see the choices that transform would make if it creates the destination index.

Caveat 2:

Due to the creation of many sub fields below process.name.terms instead of 1 flattened field you do not only increase memory and storage requirements, but you also might run into a mapping limit (so called "mapping explosion"). The default limit is 1000, you can increase it:

PUT brent-process21
{
  "mappings": {
  ...
  },
  "settings": {
    "index.mapping.total_fields.limit": 20000
  }
}

Still, at some point you might run into the limit if your keys are arbitrary. If you only care about certain keys, you could create only mappings for those and let elasticsearch ignore the others. For your usecase it also might work if you create daily indices, that way you can have x mappings per day instead of a total number of mappings.

Thanks @Hendrik_Muhs, that's very helpful.

Are Transforms the most ideal solution to just aggregate on a single field and then visualize it? Is there a better option?

We decided not to go with Rollups due to them having no retention policy and no ability to roll-off the data.

The main use case for these is to speed up the time it takes for our visualizations to load, and only look through the aggregated data versus all the data in the documents from the endpoint index.

It still seems like a good solution to me. Maybe you can elaborate a bit more on the use case and why you choose the terms aggregation. Is the count important or do you only need to know if something appears or not?

@Hendrik_Muhs the count is important. We are basically trying to get our visualization load times faster.

For a partner, we may want to showcase what their top process names were, or their top registry values, etc.

I basically want a count of the top values, per field I choose. The fields I am interested in are the terms fields unless there's another way.

Thanks for the details. I have 2 more ideas:

Vega

Elasticsearch lets you write custom visualization using Vega. This might be useful to visualize the flattened data. However I am not an expert in this. The challenge seems to be to get the data into the right shape, that's why I looked into another option:

Scripted metric

I think the main problem is the representation of the data. A terms agg writes the result as

  "SearchFilterHost.exe": 77,
  "smartscreen.exe": 41,
...

I wrote a scripted metric to instead output this as:

[
  {"key":"SearchFilterHost.exe", "value": 77},
  {"key":"smartscreen.exe", "value": 41},
...
]

You can use the following aggregation instead of your terms agg.

      "process.name.terms": {
        "scripted_metric": {
          "init_script": "state.map=new HashMap()",
          "map_script": """def key = doc['process.name'].value;
                           if (state.map.containsKey(key)) {
                             state.map.put(key, state.map.get(key) + 1); 
                           } else {
                             state.map.put(key, 1)
                           }""",
          "combine_script": "return state",
          "reduce_script": """def list = new ArrayList();
                             def joinedMap = states.stream().flatMap(s -> s.map.entrySet().stream())
       .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue, (a,b)-> a+b));
       
                              for (s in joinedMap.entrySet()) {
                                def e = new HashMap(); 
                                e.put('key', s.getKey()); 
                                e.put('value', s.getValue()); 
                                list.add(e);
                              } 
                              return list;"""
        }
      }

Note: transform does not create mappings for scripted metric, that means the data gets dynamically mapped. You might want to write your own mappings instead.

Thanks @Hendrik_Muhs, that is working for me! Is there a way to add up the value count associated with the field? For example, if I now group by org id, it will not associate the value.

I hope I understand your question correctly.

You can run aggregations on the transform destination index[*], so although you grouped by org id, you can summarize the counts the same way you can summarize over time buckets.

If that doesn't help: can you post your current transform config[**]?

[*] That's the main difference to what we started with, the terms/flattened field combination wasn't aggregate-able and hence could not be visualized.
[**] We can also go over the support channel, support can help better than me when it comes to visualizations