Mapper_parsing_exception, illegal_state_exception, unpredictable data, dynamic mapping

I've been working on aggregating log data across many different microservices. I've got all of the developers logging data in JSON format now. It shows up in CloudWatch and is forwarded to a lambda which reformats them a bit and posts them directly to our AWS elasticsearch endpoint. The services perform pretty wildly different functions and are written in at least 2 different languages. The cloudwatch logs are stored in a new index each calendar day and I am using dynamic mapping templates to try to avoid some errors.

I am running into (at least) two distinct errors that are preventing logs from being posted successfully. I have done a lot of searching online and trying different things, but I don't seem to be able to get around either of these errors. Note that although the 'dialog' field is mentioned in both of these errors, it's not just that field having problems.

"status": 400,
"error": {
    "type": "mapper_parsing_exception",
    "reason": "object mapping for [dialog] tried to parse field [null] as object, but found a concrete value"
}

and

"status": 400,
"error": {
  "type": "mapper_parsing_exception",
  "reason": "failed to parse [dialog]",
  "caused_by": {
    "type": "illegal_state_exception",
    "reason": "Can't get text on a START_OBJECT at 1:97"
}

Now, I am pretty sure I know what at least one problem is. In this specific field's case, one service is logging dialog in a pretty complicated way, something like this:

"dialog": [
  {
    "order": 1,
    "message": "Hey {user_name}."
  }
]

While other services, not having access to detailed metadata about the dialog object, are logging stuff like this:

"dialog": "Hey user1234."

I suspect the illegal_state_exception happens when elasticsearch assumes the field will be a string, but it's an object, and mapper_parsing_exception happens when it's an object but elasticsearch expects a field. Working from there, I've come up with this template so far:

{
  "cloudwatch": {
    "order": 0,
    "index_patterns": [
      "*cwl*"
    ],
    "settings": {
      "index": {
        "mapping": {
          "ignore_malformed": "true"
        }
      }
    },
    "mappings": {
      "CloudWatchLogs": {
        "dynamic_templates": [
          {
            "deep_objects": {
              "match_mapping_type": "object",
              "path_match": "*.*.*",
              "mapping": {
                "type": "object",
                "enabled": false,
                "ignore_malformed": "true"
              }
            }
          },
          {
            "malformed": {
              "match_mapping_type": "*",
              "mapping": {
                "ignore_malformed": "true"
              }
            }
          }
        ]
      }
    },
    "aliases": {}
  }
}

As you can see, I've applied a template to stop mapping objects after a certain depth in order to prevent mapping explosion. That seems to have worked pretty well. After a bunch of searching, I thought I'd be able to recover from these mapping errors by setting every field to ignore_malformed via index-wide settings, but that doesn't seem to have helped. Neither does creating that catch-all dynamic mapping to ignore_malformed. The indexes come out looking something like this:

{
  "stageName.cwl-2018.09.21": {
    "aliases": {},
    "mappings": {
      "CloudWatchLogs": {
        "dynamic_templates": [
          {
            "deep_objects": {
              "path_match": "*.*.*",
              "match_mapping_type": "object",
              "mapping": {
                "enabled": false,
                "ignore_malformed": "true",
                "type": "object"
              }
            }
          },
          {
            "malformed": {
              "match_mapping_type": "*",
              "mapping": {
                "ignore_malformed": "true"
              }
            }
          }
        ],
        "properties": {
          "@id": {
            "type": "text"
          },
          "@log_group": {
            "type": "text"
          },
          "@log_stream": {
            "type": "text"
          },
          "@message": {
            "type": "text"
          },
          "@owner": {
            "type": "text"
          },
          "@timestamp": {
            "type": "date",
            "ignore_malformed": true
          },
          "awsLambdaRequestId": {
            "type": "text"
          },
          "awsRequestId": {
            "type": "text"
          },
          "aws_request_id": {
            "type": "text"
          },

I can see that something is marking some fields with ignore_malformed, but my index-wide setting and catch-all don't seem to be having any effect at all.

Now, to be honest, I couldn't care less what the value of 'dialog' is when it shows up in elasticsearch. It's not incredibly important and I doubt it's something we will ever need to search for or run metrics on. My problem is that there's a LOT of other contextual data that's being logged during these messages and that stuff is just getting dropped.

I am beginning to think that the only way to get around these errors is to try to coordinate an entire team of developers working in different code bases in different languages to log their fields in the same format, which sounds like a nightmare. Can anyone help me out here? Am I doing something wrong with these dynamic mappings? Is there any other way around this?

Ok, circling back to answer my own question.

What I was trying to do, keep all services' logs in the same daily index, was unnecessary. I reworked the lambda log forwarder to store each service's logs in a service-specific daily index. Now, instead of trying to force everyone to log common fields in the same format, I expect (somewhat erroneously) each service to log its own logs consistently, then create an index pattern over all of them in Kibana. Some services are still logging conflicting formats and preventing their own logs from reaching elasticsearch, but the problems have been greatly reduced.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.