XML filter causing ES mapping issues

At least, I think that is the problem.

I am getting a lot of rejected docs from Elasticsearch, and when I look at the reason in the dead letter queue file, it says "Can't get text on a START_OBJECT". The only thing I'm doing using the XML filter and then sending to Elasticsearch.

I think what is happening is that at one point in my XML documents, some of them have one level of a text element:

<text>Here is some text.</text>

while others have nested levels:

<text>Here is some text.
  <text> Here is some more text.</text>
</text>

I think that this means the first file's output will cause Elasticsearch to set the mapping for text to be text and then on later docs it will be an object.

Is that correct? If so, how can I handle this issue?

Thank you!

What's your pipeline config?

input{
  s3 {
    bucket => "mybucketname"
    access_key_id => "removed"
    secret_access_key => "removed"
    exclude_pattern => "^((?!XML$).)*$"
    region => "us-east-2"
    sincedb_path => "/etc/logstash/conf.d/.sincedb_files"
    codec => multiline {
      pattern => "<rootElement>"
      negate => true
      what => "previous"
      max_lines => 10000
      max_bytes => "100 MiB"
    }
  }
}

filter {
  xml {
    source => "message"
    target => "message"
    force_array => false
  }
}

output {
  stdout { codec => rubydebug }
  elasticsearch {
    hosts => "removed"
    index => "index_pattern-%{+YYYY.MM.dd}"
  }
}

Anyone have any ideas? I would imagine this would be a common issue if I'm correct about what's happening. No idea how to handle it though.

For anyone else experiencing this: I stopped trying to parse the entire XML file and just went with selecting each section of it with xpaths.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.