Unstructured data source

ashokchinna · June 18, 2019, 4:58pm

I am trying to read data from an unstructured data source.

What is the best filter to start with? Grok?

Example:-

The below log details should fall into one document.

LogFile-

2019-06-17T00:00:01 ProcessId[1234] TransactionId[34566] - Processing

<UserDataRequest>   
<id>1</id>
---
--
</UserDataRequest>

Processing logs Details - .....

<UserDataResponse>
<id>1</id>
---
---
--
</UserDataResponse>

Badger · June 18, 2019, 5:32pm

Use a multiline codec on the input to combine lines. Perhaps

codec => multiline { pattern => "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2} " negate => true what => previous auto_flush_interval => 1 }

Then you can parse it using something like

    dissect { mapping => { "message" => "%{[@metadata][ts]} ProcessId[%{processId}] TransactionId[%{tranId}]%{}" } }
    date { match => [ "[@metadata][ts]", "YYYY-MM-dd'T'HH:mm:ss" ] }
    grok {
        break_on_match => false
        match => {
            "message" => [
                "(?<[@metadata][request]>\<UserDataRequest>.*\</UserDataRequest>)",
                "(?<[@metadata][response]>\<UserDataResponse>.*\</UserDataResponse>)"
            ]
        }
    }
    xml { source => "[@metadata][request]" target => "request" force_array => false }
    xml { source => "[@metadata][response]" target => "response"  force_array => false}

ashokchinna · June 19, 2019, 6:31am

Thank you for providing the right way to process my data.

My data log have little more constraint.

If you see below, the request and it's respective response are not sequential which I need to parse the log until I get the matching request and it's response for a given event then send it to log stash.

Aggregrate filter will help here?

2019-06-17T00:00:01 ProcessId[1234] TransactionId[34566] - Processing

<UserDataRequest>   
<id>5</id>
---
--
</UserDataRequest>

2019-06-17T00:00:01 ProcessId[1234] TransactionId[34566] - Processing
2019-06-17T00:00:01 ProcessId[1234] TransactionId[34566] - Processing
2019-06-17T00:00:01 ProcessId[1234] TransactionId[34566] - Processing

<UserDataRequest>   
<id>6</id>
---
--
</UserDataRequest>

2019-06-17T00:00:01 ProcessId[1234] TransactionId[34566] - Processing
2019-06-17T00:00:01 ProcessId[1234] TransactionId[34566] - Processing

<UserDataResponse>
<id>5</id>
---
---
--
</UserDataResponse>

Badger · June 19, 2019, 12:51pm

Yes, you could use an aggregate filter and save the request for a TransactionId in the map until you see the corresponding response.

ashokchinna · June 20, 2019, 6:38pm

Thank you for the suggestion.

I am unsuccessful in reading a xml content from the file. The file just have a xml data.

input
{ file{ 
    path => "C:/Ashok/logstash-7.1.1/files/xmlFile.txt"
	start_position => "beginning"  
	sincedb_path => "NUL"
	
	codec => multiline {
     pattern => "^<UserDataRequest>" 
     negate => "true"
     what => "previous"
}
	
 }
}

filter{

 xml { source => "message" target => "request" force_array => false }

}

 output 
{ 
elasticsearch {
	  hosts => "localhost:9200"
      index => "hackxml"
      
}

stdout{
		codec => rubydebug
	}

}

Badger · June 20, 2019, 7:03pm

That will combine every line that does not start with <UserDataRequest> with the preceding line that does start with <UserDataRequest>. When it see the next line that starts with <UserDataRequest> it will push to the pipeline whatever it has accumulated as an event. In other words, the last <UserDataRequest> in the file never gets pushed. You can fix that using the auto_flush_interval option on the codec.

ashokchinna · June 20, 2019, 8:42pm

@Badger I am still confused about the Multiline Codec . How the pattern, negative and what works!!!

I read the documentation but still confused about the previous lines

Badger · June 20, 2019, 9:23pm

I doubt I can explain it better than the documentation. I suggest you create some dummy data and experiment with different configurations.

ashokchinna · June 20, 2019, 9:31pm

My conf file is

input
{ file{ 
    path => "C:/Ashok/logstash-7.1.1/files/xmlFile.txt"
	start_position => "beginning"  
	sincedb_path => "NUL"
	
	codec => multiline {
     pattern => "<UserData>" 
	 negate => true
     what => "previous"
	 auto_flush_interval => 10
}
	
 }
}

filter{

 xml { source => "message" target => "request" force_array => false }

}

 output 
{ 
elasticsearch {
	  hosts => "localhost:9200"
      index => "hackxml"
      
}

stdout{
		codec => rubydebug
	}

}

xml file:-

 <UserData>
	<name>Adam</name>
	<age>30</age>
	<address>
		<apt>11</apt>
		<streetname>Apricot Ave</streetname>
		<city>Boston</city>
	</address>
 </UserData>

ashokchinna · June 20, 2019, 9:31pm

Console Output:-

 [2019-06-20T14:17:11,000][WARN ][logstash.filters.xml     ] Error parsing xml with XmlSimple {:source=>"message", :value=>" <UserData>\r\n\t<name>Adam</name>\r\n\t<age>30</age>\r\n\t<address>\r\n\t\t<apt>11</apt>\r\n\t\t<streetname>Apricot Ave</streetname>\r\n\t\t<city>Boston</city>\r\n\t</address>\r", :exception=>#<REXML::ParseException: No close tag for /UserData
Line: 8
Position: 153
Last 80 unconsumed characters:
>, :backtrace=>["uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/parsers/treeparser.rb:28:in `parse'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/document.rb:288:in `build'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/document.rb:45:in `initialize'", "C:/Ashok/logstash-7.1.1/vendor/bundle/jruby/2.5.0/gems/xml-simple-1.1.5/lib/xmlsimple.rb:971:in `parse'", "C:/Ashok/logstash-7.1.1/vendor/bundle/jruby/2.5.0/gems/xml-simple-1.1.5/lib/xmlsimple.rb:164:in `xml_in'", "C:/Ashok/logstash-7.1.1/vendor/bundle/jruby/2.5.0/gems/xml-simple-1.1.5/lib/xmlsimple.rb:203:in `xml_in'", "C:/Ashok/logstash-7.1.1/vendor/bundle/jruby/2.5.0/gems/logstash-filter-xml-4.0.7/lib/logstash/filters/xml.rb:185:in `filter'", "C:/Ashok/logstash-7.1.1/logstash-core/lib/logstash/filters/base.rb:143:in `do_filter'", "C:/Ashok/logstash-7.1.1/logstash-core/lib/logstash/filters/base.rb:162:in `block in multi_filter'", "org/jruby/RubyArray.java:1792:in `each'", "C:/Ashok/logstash-7.1.1/logstash-core/lib/logstash/filters/base.rb:159:in `multi_filter'", "org/logstash/config/ir/compiler/AbstractFilterDelegatorExt.java:115:in `multi_filter'", "C:/Ashok/logstash-7.1.1/logstash-core/lib/logstash/java_pipeline.rb:235:in `block in start_workers'"]}
{
      "@version" => "1",
    "@timestamp" => 2019-06-20T21:17:10.483Z,
          "tags" => [
        [0] "multiline",
        [1] "_xmlparsefailure"
    ],
          "host" => "L-SJL-11016089",
          "path" => "C:/Ashok/logstash-7.1.1/files/xmlFile.txt",
       "message" => " <UserData>\r\n\t<name>Adam</name>\r\n\t<age>30</age>\r\n\t<address>\r\n\t\t<apt>11</apt>\r\n\t\t<streetname>Apricot Ave</streetname>\r\n\t\t<city>Boston</city>\r\n\t</address>\r"
}

Badger · June 20, 2019, 9:34pm

Are you sure there is a line terminator on the </UserData> liine? Try adding a blank line at the end of file to be certain.

ashokchinna · June 20, 2019, 9:41pm

Perfect! After adding the blank line, it started working.

However, I am bit confused about the message field and request field in the below response.

Is it possible to save the XML as a XML content in a field? Rather than in JSON format!

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "hackxml",
        "_type" : "_doc",
        "_id" : "vWTTdmsBqOSNlZHlA_vL",
        "_score" : 1.0,
        "_source" : {
          "@timestamp" : "2019-06-20T21:37:48.559Z",
          "path" : "C:/Ashok/logstash-7.1.1/files/xmlFile.txt",
          "@version" : "1",
          "host" : "L-SJL-11016089",
          "message" : "\r"
        }
      },
      {
        "_index" : "hackxml",
        "_type" : "_doc",
        "_id" : "vmTTdmsBqOSNlZHlLfum",
        "_score" : 1.0,
        "_source" : {
          "host" : "L-SJL-11016089",
          "tags" : [
            "multiline"
          ],
          "@timestamp" : "2019-06-20T21:37:59.078Z",
          "path" : "C:/Ashok/logstash-7.1.1/files/xmlFile.txt",
          "@version" : "1",
          "message" : """
 <UserData>
	<name>Adam</name>
	<age>30</age>
	<address>
		<apt>11</apt>
		<streetname>Apricot Ave</streetname>
		<city>Boston</city>
	</address>
 </UserData>
 
""",
          "request" : {
            "name" : "Adam",
            "age" : "30",
            "address" : {
              "apt" : "11",
              "city" : "Boston",
              "streetname" : "Apricot Ave"
            }
          }
        }
      }
    ]
  }
}

Badger · June 20, 2019, 9:47pm

You are looking at the data returned by elasticsearch, which is always JSON.

system · July 18, 2019, 9:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Advice to get different unstrctured log data into structured format Logstash	1	206	May 12, 2022
Trouble Parsing Unstructured Logs With Custom Grok Patterns Logstash	2	436	August 2, 2018
Problem with Grok filter with my logfile Logstash	2	600	July 6, 2017
Logstash expecting \n for it to process an XML Logstash	8	678	October 25, 2017
Parsing Log With grok Filter Logstash	11	3772	May 9, 2017

Unstructured data source

Related topics