Logstash - Parse txt with xml

Hi guys,
I am having trouble reading a txt file that contains free text (tab separated) and xml text that I would like to read. The goal is to read only the content of the xml text, but I have no idea how to reach the goal, I know only read the xml.

My log (txt):

REQ 1234 A 2022-05-30 12:34 CompanyA
RES 1234 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>....content...</xml
REQ 1235 A 2022-05-30 12:34 CompanyB
RES 1235 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>....content...</xml
REQ 1236 A 2022-05-30 12:34 CompanyC
RES 1236 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>....content...</xml

In order to read the xml I used the pipeline:

input {
  file {
    path => "C:/elastic_d/logstash/bin/data/file.txt"
    start_position => "beginning"
    sincedb_path => "NUL"
    type => "xml"
  }
}
filter {
  xml {
    source => "message"
    store_xml => true
    target => "theXML"
    force_array => false 
    xpath => [ ....mapping of fields...]

Any suggestion? How I could proceeed?

Thanks in advance!
Ely

Hi @Ely_96

A couple clarifying questions do you want to drop the non-xml lines?

You state "Tab" delimited, does the XML have normal spaces / not tab (assume so)

Sample Data

REQ	1234	A 2022-05-30 12:34 CompanyA
RES	1234	<?xml version="1.0" encoding="UTF-8" standalone="yes"?><data><name>Belgian Waffles</name><price>$5.95</price></data>
REQ	1235	A 2022-05-30 12:34 CompanyB
RES	1235	<?xml version="1.0" encoding="UTF-8" standalone="yes"?><data><name>French Toast</name><price>$4.50</price></data>
REQ	1236	A 2022-05-30 12:34 CompanyC
RES	1236	<?xml version="1.0" encoding="UTF-8" standalone="yes"?><data><name>Homestyle Breakfast</name><price>$6.95</price></data>

Conf that will read data with Tabs seperating first 3 columns and then will drop the non xml lines

input {
  file {
    path => "/Users/sbrown/workspace/sample-data/discuss/mixed-text-xml/mixed-txt-xml.txt"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    type => "xml"
  }
}


filter {
  grok {
    match => { "message" => "%{WORD:request_type}\t%{NUMBER:request_id}\t%{GREEDYDATA:msg_details}"}
  }

  xml {
    source => "msg_details"
    store_xml => true
    target => "xml_data"
    force_array => false
  }
  
  if  "_xmlparsefailure" in [tags] {
    drop { }
  }
}

output {
  stdout {codec => "rubydebug"}
}

Results : You can clean up and drop other fields etc...

{
        "@version" => "1",
        "xml_data" => {
        "price" => "$4.50",
         "name" => "French Toast"
    },
      "@timestamp" => 2022-05-30T16:07:32.522032Z,
    "request_type" => "RES",
         "message" => "RES\t1235\t<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>French Toast</name><price>$4.50</price></data>",
      "request_id" => "1235",
            "type" => "xml",
            "host" => {
        "name" => "hyperion.local"
    },
           "event" => {
        "original" => "RES\t1235\t<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>French Toast</name><price>$4.50</price></data>"
    },
             "log" => {
        "file" => {
            "path" => "/Users/sbrown/workspace/sample-data/discuss/mixed-text-xml/mixed-txt-xml.txt"
        }
    },
     "msg_details" => "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>French Toast</name><price>$4.50</price></data>"
}
{
        "@version" => "1",
        "xml_data" => {
        "price" => "$6.95",
         "name" => "Homestyle Breakfast"
    },
      "@timestamp" => 2022-05-30T16:07:32.522308Z,
    "request_type" => "RES",
         "message" => "RES\t1236\t<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>Homestyle Breakfast</name><price>$6.95</price></data>",
      "request_id" => "1236",
            "type" => "xml",
            "host" => {
        "name" => "hyperion.local"
    },
           "event" => {
        "original" => "RES\t1236\t<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>Homestyle Breakfast</name><price>$6.95</price></data>"
    },
             "log" => {
        "file" => {
            "path" => "/Users/sbrown/workspace/sample-data/discuss/mixed-text-xml/mixed-txt-xml.txt"
        }
    },
     "msg_details" => "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>Homestyle Breakfast</name><price>$6.95</price></data>"
}
{
        "@version" => "1",
        "xml_data" => {
        "price" => "$5.95",
         "name" => "Belgian Waffles"
    },
      "@timestamp" => 2022-05-30T16:07:32.521740Z,
    "request_type" => "RES",
         "message" => "RES\t1234\t<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>Belgian Waffles</name><price>$5.95</price></data>",
      "request_id" => "1234",
            "type" => "xml",
            "host" => {
        "name" => "hyperion.local"
    },
           "event" => {
        "original" => "RES\t1234\t<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>Belgian Waffles</name><price>$5.95</price></data>"
    },
             "log" => {
        "file" => {
            "path" => "/Users/sbrown/workspace/sample-data/discuss/mixed-text-xml/mixed-txt-xml.txt"
        }
    },
     "msg_details" => "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>Belgian Waffles</name><price>$5.95</price></data>"

Hi @stephenb thanks!
Yes, I would like to drop the non-xml lines and yes, I can confirm... I can find the tab (\t) on my txt, not the normal space.

If I use my source file I can find a lot of

"tags": [
      "_grokparsefailure"
    ],

and I think the issue is

match => { "message" => "%{WORD:request_type}\t%{NUMBER:request_id}\t%{GREEDYDATA:msg_details}"}

because in my data I can see 3 types of lines:
1st type: a string like

#REQ:123-44aa-4fe1-b88a-123aa#REQ:1#VARNAME:VarValue#PROC:1234-aaaa-1234-aba1234

2nd type:

REQ	12222-aa	1	2022-05-30 18:38:49.609

3rd type:

RES	12222-aa-b1-12	<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

Could you please help to configure the correct grok?
My REQ ID and RES ID, like ["12222-aa-b1-12", "12222-aa" and 123-44aa-4fe1-b88a-123aa] are always 36 chars [a-z0-9-].

Thanks a lot!!!
Ely

In the future, if you provide a better more representative sample, we can help quicker instead of iterating. Just something to think about in the future.

In fact at this point I am unclear what your data looks like your samples have short ID's and you are explaining it is long... the closer you can provide real data the better... so I am still guessing.

Also if you are going to a lot of this I would read / learn a bit more about grok and dissect from here tl;dr grok more flexible and dissect is more efficient / faster you could probably use either I showed both

So test data with long ids etc...

#REQ:123-44aa-4fe1-b88a-123aa#REQ:1#VARNAME:VarValue#PROC:1234-aaaa-1234-aba1234
REQ	12222-aa	1	2022-05-30 18:38:49.609	CompanyA
RES	12222-aa-b1-12-alsdkjfh-salkdfjhas-kalaskdjfh	<?xml version="1.0" encoding="UTF-8" standalone="yes"?><data><name>Belgian Waffles</name><price>$5.95</price></data>
REQ	12223-aa	1	2022-05-30 18:38:49.609 CompanyB
RES	12223-aa-b1-12-sakdjfhsaldkjf-lsakdjfhsaldfkjh	<?xml version="1.0" encoding="UTF-8" standalone="yes"?><data><name>French Toast</name><price>$4.50</price></data>
REQ	12224-aa	1	2022-05-30 18:38:49.609 CompanyC
RES	12224-aa-b1-12-lasdkjfhsadlfkjhsadflkj	<?xml version="1.0" encoding="UTF-8" standalone="yes"?><data><name>Homestyle Breakfast</name><price>$6.95</price></data>

Pipeline I gave you both grok and dissect ... you can figure it our from here... Take a look at the docs they will help.

input {
  file {
    path => "/Users/sbrown/workspace/sample-data/discuss/mixed-text-xml/mixed-txt-xml.txt"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    type => "xml"
  }
}


filter {

  # For Grok use \t for tabs
  # grok {
  #   match => { "message" => "%{WORD:request_type}\t%{DATA:request_id}\t%{GREEDYDATA:msg_details}"}
  # }


  # dissect you need to paste in actual tabs
  dissect {
    mapping => { "message" => "%{request_type}	%{request_id}	%{msg_details}"}
  }

  # you could put some if logic around this if you want to on parse the xml if the grok or dissect is succesfull
  xml {
    source => "msg_details"
    store_xml => true
    target => "xml_data"
    force_array => false
  }
  
  # For Grok
  # if  "_grokparsefailure" in [tags] or "_xmlparsefailure" in [tags] {
  #   drop {}
  # }

  # For Dissect
  if  "_dissectfailure" in [tags] or "_xmlparsefailure" in [tags] {
    drop {}
  }

}

output {
  stdout {codec => "rubydebug"}
}

results with long ids...

{
        "@version" => "1",
        "xml_data" => {
        "price" => "$6.95",
         "name" => "Homestyle Breakfast"
    },
      "@timestamp" => 2022-05-30T17:08:41.438868Z,
    "request_type" => "RES",
         "message" => "RES\t12224-aa-b1-12-lasdkjfhsadlfkjhsadflkj\t<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>Homestyle Breakfast</name><price>$6.95</price></data>",
      "request_id" => "12224-aa-b1-12-lasdkjfhsadlfkjhsadflkj",
            "type" => "xml",
            "host" => {
        "name" => "hyperion.local"
    },
           "event" => {
        "original" => "RES\t12224-aa-b1-12-lasdkjfhsadlfkjhsadflkj\t<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>Homestyle Breakfast</name><price>$6.95</price></data>"
    },
             "log" => {
        "file" => {
            "path" => "/Users/sbrown/workspace/sample-data/discuss/mixed-text-xml/mixed-txt-xml.txt"
        }
    },
     "msg_details" => "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>Homestyle Breakfast</name><price>$6.95</price></data>"
}
{
        "@version" => "1",
        "xml_data" => {
        "price" => "$5.95",
         "name" => "Belgian Waffles"
    },
      "@timestamp" => 2022-05-30T17:08:41.438348Z,
    "request_type" => "RES",
         "message" => "RES\t12222-aa-b1-12-alsdkjfh-salkdfjhas-kalaskdjfh\t<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>Belgian Waffles</name><price>$5.95</price></data>",
      "request_id" => "12222-aa-b1-12-alsdkjfh-salkdfjhas-kalaskdjfh",
            "type" => "xml",
            "host" => {
        "name" => "hyperion.local"
    },
           "event" => {
        "original" => "RES\t12222-aa-b1-12-alsdkjfh-salkdfjhas-kalaskdjfh\t<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>Belgian Waffles</name><price>$5.95</price></data>"
    },
             "log" => {
        "file" => {
            "path" => "/Users/sbrown/workspace/sample-data/discuss/mixed-text-xml/mixed-txt-xml.txt"
        }
    },
     "msg_details" => "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>Belgian Waffles</name><price>$5.95</price></data>"
}
{
        "@version" => "1",
        "xml_data" => {
        "price" => "$4.50",
         "name" => "French Toast"
    },
      "@timestamp" => 2022-05-30T17:08:41.438604Z,
    "request_type" => "RES",
         "message" => "RES\t12223-aa-b1-12-sakdjfhsaldkjf-lsakdjfhsaldfkjh\t<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>French Toast</name><price>$4.50</price></data>",
      "request_id" => "12223-aa-b1-12-sakdjfhsaldkjf-lsakdjfhsaldfkjh",
            "type" => "xml",
            "host" => {
        "name" => "hyperion.local"
    },
           "event" => {
        "original" => "RES\t12223-aa-b1-12-sakdjfhsaldkjf-lsakdjfhsaldfkjh\t<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>French Toast</name><price>$4.50</price></data>"
    },
             "log" => {
        "file" => {
            "path" => "/Users/sbrown/workspace/sample-data/discuss/mixed-text-xml/mixed-txt-xml.txt"
        }
    },
     "msg_details" => "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><data><name>French Toast</name><price>$4.50</price></data>"

I am sure you can figure it out from here :slight_smile:

Hi @stephenb yesss! Just solved :slight_smile: thanks a lot for your time! I will provide more details in future, on my next questions.. sorry for now :slight_smile:

Thanks a lot
Ely

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.