Unstructured data source

I am trying to read data from an unstructured data source.

What is the best filter to start with? Grok?

Example:-

The below log details should fall into one document.

LogFile-

2019-06-17T00:00:01 ProcessId[1234] TransactionId[34566] - Processing

<UserDataRequest>   
<id>1</id>
---
--
</UserDataRequest>

Processing logs Details - .....

<UserDataResponse>
<id>1</id>
---
---
--
</UserDataResponse>

Use a multiline codec on the input to combine lines. Perhaps

codec => multiline { pattern => "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2} " negate => true what => previous auto_flush_interval => 1 }

Then you can parse it using something like

    dissect { mapping => { "message" => "%{[@metadata][ts]} ProcessId[%{processId}] TransactionId[%{tranId}]%{}" } }
    date { match => [ "[@metadata][ts]", "YYYY-MM-dd'T'HH:mm:ss" ] }
    grok {
        break_on_match => false
        match => {
            "message" => [
                "(?<[@metadata][request]>\<UserDataRequest>.*\</UserDataRequest>)",
                "(?<[@metadata][response]>\<UserDataResponse>.*\</UserDataResponse>)"
            ]
        }
    }
    xml { source => "[@metadata][request]" target => "request" force_array => false }
    xml { source => "[@metadata][response]" target => "response"  force_array => false}

Thank you for providing the right way to process my data.

My data log have little more constraint.

If you see below, the request and it's respective response are not sequential which I need to parse the log until I get the matching request and it's response for a given event then send it to log stash.

Aggregrate filter will help here?

2019-06-17T00:00:01 ProcessId[1234] TransactionId[34566] - Processing

<UserDataRequest>   
<id>5</id>
---
--
</UserDataRequest>

2019-06-17T00:00:01 ProcessId[1234] TransactionId[34566] - Processing
2019-06-17T00:00:01 ProcessId[1234] TransactionId[34566] - Processing
2019-06-17T00:00:01 ProcessId[1234] TransactionId[34566] - Processing

<UserDataRequest>   
<id>6</id>
---
--
</UserDataRequest>

2019-06-17T00:00:01 ProcessId[1234] TransactionId[34566] - Processing
2019-06-17T00:00:01 ProcessId[1234] TransactionId[34566] - Processing

<UserDataResponse>
<id>5</id>
---
---
--
</UserDataResponse>

Yes, you could use an aggregate filter and save the request for a TransactionId in the map until you see the corresponding response.

Thank you for the suggestion.

I am unsuccessful in reading a xml content from the file. The file just have a xml data.

input
{ file{ 
    path => "C:/Ashok/logstash-7.1.1/files/xmlFile.txt"
	start_position => "beginning"  
	sincedb_path => "NUL"
	
	codec => multiline {
     pattern => "^<UserDataRequest>" 
     negate => "true"
     what => "previous"
}
	
 }
}

filter{

 xml { source => "message" target => "request" force_array => false }

}

 output 
{ 
elasticsearch {
	  hosts => "localhost:9200"
      index => "hackxml"
      
}

stdout{
		codec => rubydebug
	}

}

That will combine every line that does not start with <UserDataRequest> with the preceding line that does start with <UserDataRequest>. When it see the next line that starts with <UserDataRequest> it will push to the pipeline whatever it has accumulated as an event. In other words, the last <UserDataRequest> in the file never gets pushed. You can fix that using the auto_flush_interval option on the codec.

@Badger I am still confused about the Multiline Codec . How the pattern, negative and what works!!!

I read the documentation but still confused about the previous lines

I doubt I can explain it better than the documentation. I suggest you create some dummy data and experiment with different configurations.

My conf file is

input
{ file{ 
    path => "C:/Ashok/logstash-7.1.1/files/xmlFile.txt"
	start_position => "beginning"  
	sincedb_path => "NUL"
	
	codec => multiline {
     pattern => "<UserData>" 
	 negate => true
     what => "previous"
	 auto_flush_interval => 10
}
	
 }
}

filter{

 xml { source => "message" target => "request" force_array => false }

}

 output 
{ 
elasticsearch {
	  hosts => "localhost:9200"
      index => "hackxml"
      
}

stdout{
		codec => rubydebug
	}

}

xml file:-

 <UserData>
	<name>Adam</name>
	<age>30</age>
	<address>
		<apt>11</apt>
		<streetname>Apricot Ave</streetname>
		<city>Boston</city>
	</address>
 </UserData>

Console Output:-

 [2019-06-20T14:17:11,000][WARN ][logstash.filters.xml     ] Error parsing xml with XmlSimple {:source=>"message", :value=>" <UserData>\r\n\t<name>Adam</name>\r\n\t<age>30</age>\r\n\t<address>\r\n\t\t<apt>11</apt>\r\n\t\t<streetname>Apricot Ave</streetname>\r\n\t\t<city>Boston</city>\r\n\t</address>\r", :exception=>#<REXML::ParseException: No close tag for /UserData
Line: 8
Position: 153
Last 80 unconsumed characters:
>, :backtrace=>["uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/parsers/treeparser.rb:28:in `parse'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/document.rb:288:in `build'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/document.rb:45:in `initialize'", "C:/Ashok/logstash-7.1.1/vendor/bundle/jruby/2.5.0/gems/xml-simple-1.1.5/lib/xmlsimple.rb:971:in `parse'", "C:/Ashok/logstash-7.1.1/vendor/bundle/jruby/2.5.0/gems/xml-simple-1.1.5/lib/xmlsimple.rb:164:in `xml_in'", "C:/Ashok/logstash-7.1.1/vendor/bundle/jruby/2.5.0/gems/xml-simple-1.1.5/lib/xmlsimple.rb:203:in `xml_in'", "C:/Ashok/logstash-7.1.1/vendor/bundle/jruby/2.5.0/gems/logstash-filter-xml-4.0.7/lib/logstash/filters/xml.rb:185:in `filter'", "C:/Ashok/logstash-7.1.1/logstash-core/lib/logstash/filters/base.rb:143:in `do_filter'", "C:/Ashok/logstash-7.1.1/logstash-core/lib/logstash/filters/base.rb:162:in `block in multi_filter'", "org/jruby/RubyArray.java:1792:in `each'", "C:/Ashok/logstash-7.1.1/logstash-core/lib/logstash/filters/base.rb:159:in `multi_filter'", "org/logstash/config/ir/compiler/AbstractFilterDelegatorExt.java:115:in `multi_filter'", "C:/Ashok/logstash-7.1.1/logstash-core/lib/logstash/java_pipeline.rb:235:in `block in start_workers'"]}
{
      "@version" => "1",
    "@timestamp" => 2019-06-20T21:17:10.483Z,
          "tags" => [
        [0] "multiline",
        [1] "_xmlparsefailure"
    ],
          "host" => "L-SJL-11016089",
          "path" => "C:/Ashok/logstash-7.1.1/files/xmlFile.txt",
       "message" => " <UserData>\r\n\t<name>Adam</name>\r\n\t<age>30</age>\r\n\t<address>\r\n\t\t<apt>11</apt>\r\n\t\t<streetname>Apricot Ave</streetname>\r\n\t\t<city>Boston</city>\r\n\t</address>\r"
}

Are you sure there is a line terminator on the </UserData> liine? Try adding a blank line at the end of file to be certain.

Perfect! After adding the blank line, it started working.

However, I am bit confused about the message field and request field in the below response.

Is it possible to save the XML as a XML content in a field? Rather than in JSON format!

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "hackxml",
        "_type" : "_doc",
        "_id" : "vWTTdmsBqOSNlZHlA_vL",
        "_score" : 1.0,
        "_source" : {
          "@timestamp" : "2019-06-20T21:37:48.559Z",
          "path" : "C:/Ashok/logstash-7.1.1/files/xmlFile.txt",
          "@version" : "1",
          "host" : "L-SJL-11016089",
          "message" : "\r"
        }
      },
      {
        "_index" : "hackxml",
        "_type" : "_doc",
        "_id" : "vmTTdmsBqOSNlZHlLfum",
        "_score" : 1.0,
        "_source" : {
          "host" : "L-SJL-11016089",
          "tags" : [
            "multiline"
          ],
          "@timestamp" : "2019-06-20T21:37:59.078Z",
          "path" : "C:/Ashok/logstash-7.1.1/files/xmlFile.txt",
          "@version" : "1",
          "message" : """
 <UserData>
	<name>Adam</name>
	<age>30</age>
	<address>
		<apt>11</apt>
		<streetname>Apricot Ave</streetname>
		<city>Boston</city>
	</address>
 </UserData>
 
""",
          "request" : {
            "name" : "Adam",
            "age" : "30",
            "address" : {
              "apt" : "11",
              "city" : "Boston",
              "streetname" : "Apricot Ave"
            }
          }
        }
      }
    ]
  }
}

You are looking at the data returned by elasticsearch, which is always JSON.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.