Logstash XML file parsing - awkward Splitting event Problem

I'm working with ELK 6.7.0 on docker with official images. This is my conf file:

input {
  file {
    path => "/usr/share/logstash/logs/*.xml"
    type => "xml"
    sincedb_path => "/dev/null"
    codec => multiline {
      pattern => "<root>"
      negate => "true"
      what => "previous"
    }
  }
}

filter {  
  xml {
    source => "message"
    store_xml => false
    xpath => [
        "/root/ChainId/text()", "ChainId",
        "/root/SubChainId/text()", "SubChainId",
        "/root/StoreId/text()", "StoreId",
        "/root/BikoretNo/text()", "BikoretNo",
        "/root/DllVerNo/text()", "DllVerNo"
    ]
  }
}

output {
  elasticsearch {
    hosts => "elasticsearch:9200"
    index => "xml_index"
  }

  stdout { 
    codec => rubydebug 
  }
}

My XML file is:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <ChainId>7290027600007</ChainId>
    <SubChainId>001</SubChainId>
    <StoreId>001</StoreId>
    <BikoretNo>9</BikoretNo>
    <DllVerNo>8.0.1.3</DllVerNo>
</root>

I'm trying to parse incoming XML files, but when a new file is created on the path folder logstash parsing it as following:

logstash_1       | {
logstash_1       |           "path" => "/usr/share/logstash/logs/example10.xml",
logstash_1       |       "@version" => "1",
logstash_1       |        "message" => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>",
logstash_1       |           "type" => "xml",
logstash_1       |     "@timestamp" => 2019-04-02T04:42:59.248Z,
logstash_1       |           "host" => "a4f1bf64a3d5"
logstash_1       | }

However, When I reload my conf file Logstash surprisingly is parsing my XML successfully:

logstash_1       | {
logstash_1       |        "StoreId" => [
logstash_1       |         [0] "001"
logstash_1       |     ],
logstash_1       |        "message" => "<root>\n    <ChainId>7290027600007</ChainId>\n    <SubChainId>001</SubChainId>\n    <StoreId>001</StoreId>\n    <BikoretNo>9</BikoretNo>\n    <DllVerNo>8.0.1.3</DllVerNo>",
logstash_1       |       "DllVerNo" => [
logstash_1       |         [0] "8.0.1.3"
logstash_1       |     ],
logstash_1       |           "type" => "xml",
logstash_1       |     "SubChainId" => [
logstash_1       |         [0] "001"
logstash_1       |     ],
logstash_1       |      "BikoretNo" => [
logstash_1       |         [0] "9"
logstash_1       |     ],
logstash_1       |           "path" => "/usr/share/logstash/logs/example10.xml",
logstash_1       |       "@version" => "1",
logstash_1       |        "ChainId" => [
logstash_1       |         [0] "7290027600007"
logstash_1       |     ],
logstash_1       |           "tags" => [
logstash_1       |         [0] "multiline"
logstash_1       |     ],
logstash_1       |     "@timestamp" => 2019-04-02T04:43:18.439Z,
logstash_1       |           "host" => "a4f1bf64a3d5"
logstash_1       | }
logstash_1       | {
logstash_1       |        "StoreId" => [
logstash_1       |         [0] "001"
logstash_1       |     ],
logstash_1       |        "message" => "<root>\n    <ChainId>7290027600007</ChainId>\n    <SubChainId>001</SubChainId>\n    <StoreId>001</StoreId>\n    <BikoretNo>9</BikoretNo>\n    <DllVerNo>8.0.1.3</DllVerNo>",
logstash_1       |       "DllVerNo" => [
logstash_1       |         [0] "8.0.1.3"
logstash_1       |     ],
logstash_1       |           "type" => "xml",
logstash_1       |     "SubChainId" => [
logstash_1       |         [0] "001"
logstash_1       |     ],
logstash_1       |      "BikoretNo" => [
logstash_1       |         [0] "9"
logstash_1       |     ],
logstash_1       |           "path" => "/usr/share/logstash/logs/example11.xml",
logstash_1       |       "@version" => "1",
logstash_1       |        "ChainId" => [
logstash_1       |         [0] "7290027600007"
logstash_1       |     ],
logstash_1       |           "tags" => [
logstash_1       |         [0] "multiline"
logstash_1       |     ],
logstash_1       |     "@timestamp" => 2019-04-02T04:43:18.440Z,
logstash_1       |           "host" => "a4f1bf64a3d5"
logstash_1       | }

The message field in both events is different parts of the file and seems like Logstash is splitting the file before and after the pattern. Even so, not clear why it doing it just on conf reload.

Working as expected. It reads the first line of the file. That does not match the pattern, so it adds it to an event (but does not flush the event). Then it reads the next line. That does match, so it flushes the previous event and adds line 2 to an event. It then reads every other line of the file and adds them to that event. It will not flush that event until it sees another line that matches root or the pipeline is reloaded.

So how do I solve it out? I do not want events which don't contain XPath, but when I'm using
pattern = "<\?xml version" with xpath => "/root/ChainId/text()", "ChainId" I don't get anything.

Does a file contain multiple root elements? If not, you can consume the entire file using a pattern that never matches and a timeout. For example

codec => multiline { pattern => "^Spalanzani" what => "previous" negate => true auto_flush_interval => 1 }

Ok. I Change my pattern to suggested one, however now my XPath is not taking into effect. I get the following output:

logstash_1       | {
logstash_1       |     "@timestamp" => 2019-04-03T14:03:04.092Z,
logstash_1       |        "message" => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<root>\n    <ChainId>7290027600007</ChainId>\n    <SubChainId>001</SubChainId>\n    <StoreId>001</StoreId>\n    <BikoretNo>9</BikoretNo>\n    <DllVerNo>8.0.1.3</DllVerNo>",
logstash_1       |       "@version" => "1",
logstash_1       |           "host" => "a4f1bf64a3d5",
logstash_1       |           "tags" => [
logstash_1       |         [0] "multiline"
logstash_1       |     ],
logstash_1       |           "type" => "xml",
logstash_1       |           "path" => "/usr/share/logstash/logs/example8.xml"
logstash_1       | }

Your message does not contain </root>, which suggests your multiline codec is not what I wrote. No matter, even without closing the element the xpath expressions work for me. I don't know what else to suggest.

This is my exact conf file:

input {
  file {
    path => "/usr/share/logstash/logs/*.xml"
    type => "xml"
    # start_position => "beginning"
    sincedb_path => "/dev/null"
    codec => multiline {
      pattern => "^Spalanzani"
      negate => "true"
      what => "previous"
      auto_flush_interval => 1
    }
  }
}

filter {  
  xml {
    source => "message"
    store_xml => false
    # target => "root"
    # remove_namespaces => true
    xpath => [
        "/root/ChainId/text()", "ChainId",
        "/root/SubChainId/text()", "SubChainId",
        "/root/StoreId/text()", "StoreId",
        "/root/BikoretNo/text()", "BikoretNo",
        "/root/DllVerNo/text()", "DllVerNo"
    ]
  }
}

output {
  elasticsearch {
    hosts => "elasticsearch:9200"
    index => "xml_index"
  }

  stdout { 
    codec => rubydebug 
  }
}

It works for me.

   "message" => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n<root>\r\n    <ChainId>7290027600007</ChainId>\r\n    <SubChainId>001</SubChainId>\r\n    <StoreId>001</StoreId>\r\n    <BikoretNo>9</BikoretNo>\r\n    <DllVerNo>8.0.1.3</DllVerNo>\r\n</root>\r",
   "ChainId" => [
    [0] "7290027600007"
],

etc.

It's really strange! I don't understand what may be the cause for this.

Are you running on docker?
Which Logstash version are you running?
Does the file is crlf or lf?

I am running 6.6.0 on Linux. Not docker.

The newlines have to be native to whatever platform you are running on. If you have Windows newlines on a UNIX platform you need to mutate+gsub the \r's out of the message.

I found the problem! My XML files were encoded with
UTF-8 BOM instead of UTF-8.

The Solution Lies Within the Problem

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.