Logstash - XML Filter not working properly


(Miguel Freitas) #1

Hi,
I'm trying to parse a very complex XML with nested arrays into logstash, but for some reason XML filter only parses right 2 objects, discarding many of them.

Here below my configuration for Logstash:

   input {
       file {
          path => "C:/ELK/results.xml"
          start_position => "beginning"
          sincedb_path => "nul"
          type => "xml"
          codec => multiline {
            pattern => "<CxXMLResults"
            negate => true
            what => "previous"
          }
       }
    }

    filter {
      xml {
        source => "message"
        store_xml => false
        xpath => ["CxXMLResults/@InitiatorName", "initiator_name"]
        xpath => ["CxXMLResults/@Owner", "owner"]
        xpath => ["CxXMLResults/@ScanId", "scan_id"]
        xpath => ["CxXMLResults/@ProjectId", "project_id"]
        xpath => ["CxXMLResults/@ProjectName", "project_name"]
        xpath => ["CxXMLResults/@TeamFullPathOnReportDate", "team_full_path"]
        xpath => ["CxXMLResults/@DeepLink", "scan_link"]
        xpath => ["CxXMLResults/@ScanStart", "scan_start"]
        xpath => ["CxXMLResults/@Preset", "preset"]
        xpath => ["CxXMLResults/@ScanTime", "scan_time"]
        xpath => ["CxXMLResults/@LinesOfCodeScanned", "loc"]
        xpath => ["CxXMLResults/@FilesScanned", "files_scanned"]
        xpath => ["CxXMLResults/@ReportCreationTime", "report_creation_date"]
        xpath => ["CxXMLResults/@Team", "team"]
        xpath => ["CxXMLResults/@CheckmarxVersion", "cx_version"]
        xpath => ["CxXMLResults/@ScanComments", "scan_comments"]
        xpath => ["CxXMLResults/@ScanType", "scan_type"]
        xpath => ["CxXMLResults/@SourceOrigin", "source_origin"]
        xpath => ["CxXMLResults/@Visibility", "visibility"]
        xpath => ["CxXMLResults/Query", "queries"]
      }
      split { 
        field => "queries"
      }
      xml {
        source => "queries"
        store_xml => false
        xpath => ["Query/@id", "query_id"]
        xpath => ["Query/@Categories", "query_categories"]
        xpath => ["Query/@cweId", "query_cwe_id"]
        xpath => ["Query/@name", "query_name"]
        xpath => ["Query/@group", "query_group"]
        xpath => ["Query/@Severity", "query_severity"]
        xpath => ["Query/@Language", "query_language"]
        xpath => ["Query/@LanguageHash", "query_language_hash"]
        xpath => ["Query/@LanguageChangeDate", "query_language_change_date"]
        xpath => ["Query/@SeverityIndex", "query_severity_index"]
        xpath => ["Query/@QueryPath", "query_path"]
        xpath => ["Query/@QueryVersionCode", "query_version_code"]
        xpath => ["Query/Result", "results"]
      }
      split { 
        field => "results"
      }
      xml {
        source => "results"
        store_xml => false
        xpath => ["Result/@NodeId", "result_node_id"]
        xpath => ["Result/@FileName", "result_filename"]
        xpath => ["Result/@Status", "result_status"]
        xpath => ["Result/@Line", "result_line"]
        xpath => ["Result/@Column", "result_column"]
        xpath => ["Result/@FalsePositive", "result_false_positive"]
        xpath => ["Result/@Severity", "result_severity"]
        xpath => ["Result/@AssignToUser", "result_assigned_user"]
        xpath => ["Result/@state", "result_state"]
        xpath => ["Result/@Remark", "result_remark"]
        xpath => ["Result/@DeepLink", "result_link"]
        xpath => ["Result/@SeverityIndex", "result_severity_index"]
        xpath => ["Result/Path/@ResultId", "result_id"]
        xpath => ["Result/Path/@PathId", "result_path_id"]
        xpath => ["Result/Path/@SimilarityId", "result_similarity_id"]
        xpath => ["Result/Path/PathNode[1]/FileName/text()", "result_source_filename"]
        xpath => ["Result/Path/PathNode[1]/Line/text()", "result_source_line"]
        xpath => ["Result/Path/PathNode[1]/Column/text()", "result_source_column"]
        xpath => ["Result/Path/PathNode[1]/NodeId/text()", "result_source_node_id"]
        xpath => ["Result/Path/PathNode[1]/Name/text()", "result_source_name"]
        xpath => ["Result/Path/PathNode[1]/Type/text()", "result_source_type"]
        xpath => ["Result/Path/PathNode[1]/Length/text()", "result_source_length"]
        xpath => ["Result/Path/PathNode[1]/Snippet/Line/Number/text()", "result_source_snippet_line_number"]
        xpath => ["Result/Path/PathNode[1]/Snippet/Line/Code/text()", "result_source_snippet_line_code"]
        xpath => ["Result/Path/PathNode[last()]/FileName/text()", "result_dest_filename"]
        xpath => ["Result/Path/PathNode[last()]/Line/text()", "result_dest_line"]
        xpath => ["Result/Path/PathNode[last()]/Column/text()", "result_dest_column"]
        xpath => ["Result/Path/PathNode[last()]/NodeId/text()", "result_dest_node_id"]
        xpath => ["Result/Path/PathNode[last()]/Name/text()", "result_dest_name"]
        xpath => ["Result/Path/PathNode[last()]/Type/text()", "result_dest_type"]
        xpath => ["Result/Path/PathNode[last()]/Length/text()", "result_dest_length"]
        xpath => ["Result/Path/PathNode[last()]/Snippet/Line/Number/text()", "result_dest_snippet_line_number"]
        xpath => ["Result/Path/PathNode[last()]/Snippet/Line/Code/text()", "result_dest_snippet_line_code"]
      }
      mutate {
        remove_field => [ "message", "queries", "results" ]
      }
      if "_split_type_failure" in [tags] {
        drop {}
      }
    }

    output {
  stdout {
    codec=>rubydebug
  }
  file {
    path => "C:/ELK/ResultsXML.json"
  }
    }

Why xml filter only parses right 2 objects, giving the error for the other ones "Only String and Array types are splittable. field:queries is of type = NilClass"?

The expected number of entries in logstash are 162, but right now I can only see in Kibana that only 2 went there, in fact. The main goal is to present every "Result" from the XML with information from the parents ("Query", "CxXMLResults") and children ("Path", "PathNode")

How can I fix this ?

Thanks!


(Miguel Freitas) #2

Here the output of ruby debug:

[2019-04-14T03:57:24,585][INFO ][filewatch.observingtail  ] START, creating Discoverer, Watch with file and sincedb collections
[2019-04-14T03:57:24,592][INFO ][logstash.pipeline        ] Pipeline started successfully {:pipeline_id=>"main", :thread=>"#<Thread:0x2c8366e sleep>"}
[2019-04-14T03:57:24,594][INFO ][logstash.agent           ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[2019-04-14T03:57:24,796][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,797][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,832][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,836][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,857][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,860][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,837][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,837][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,863][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,863][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,863][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,863][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,864][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,863][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,864][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,863][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,864][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,864][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,864][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,864][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,864][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,865][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T03:57:24,864][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,865][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,865][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,865][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,865][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,865][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,866][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,867][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,867][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,867][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,867][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,867][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,960][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,962][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,962][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,961][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,963][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,964][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,964][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,964][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,964][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
[2019-04-14T03:57:24,965][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass

(Miguel Freitas) #3

rubydebug output continuation:

[2019-04-14T03:57:24,965][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
{ Object 1},    
{ Object 2}
[2019-04-14T03:57:25,038][INFO ][logstash.outputs.file    ] Opening file {:path=>"C:/ELK/ResultsXML.json"}`

#4

In that case the xpath "CxXMLResults/Query" is not returning anything, so the queries field is not getting created.


(Miguel Freitas) #5

Hi,

I have a xml like this:

<?xml version="1.0" encoding="utf-8"?>
<CxXMLResults attr1="value1" ...>
    <Query attr="value" ...>
        <Result attr="value" ...>...</Result>
        <Result attr="value" ...>...</Result>
        ...
    </Query>
    <Query attr="value" ...>
        <Result attr="value" ...>...</Result>
        <Result attr="value" ...>...</Result>
        ...
    </Query>
    <Query attr="value" ...>
        <Result attr="value" ...>...</Result>
        <Result attr="value" ...>...</Result>
        ...
    </Query>
    ....
</CxXMLResults>

I wanna show every "Result" with fields containing information from "Query" and "CxXMLResults". There are other more fields inside "Result" that I'm also interested in as well like "Path" and "PathNode". For some reason I'm only able to parse correctly 2 objects and the other ones are discarded...because of the error in the "split" filter.

How can I fix this ?
Why "CxXMLResults/Query" does not contains anything ?

Thanks in advance!


#6

The evidence says that is not correct. If I take this XML

<?xml version="1.0" encoding="utf-8"?>
<CxXMLResults attr1="value1">
    <Query attr="value">
        <Result attr="value">...</Result>
        <Result attr="value">...</Result>
    </Query>
    <Query attr="value">
        <Result attr="value">...</Result>
        <Result attr="value">...</Result>
    </Query>
    <Query attr="value">
        <Result attr="value">...</Result>
        <Result attr="value">...</Result>
    </Query>
</CxXMLResults>

and update your final XML filter to include

xpath => ["Result/text()", "lookHere"]

Then I get 6 events, each of which contains

  "lookHere" => [
    [0] "..."
],

I suggest you do not remove_field "message" and "queries", remove the drop filter, and use

output { stdout { codec => rubydebug } }

Then see what the actual structure is for the missing events. I do not believe they will have a [queries] field after the first xml filter parses [message].


(Miguel Freitas) #7

Hi badger,

That was only an example.
Unfortunately, I cannot upload the full XML to here because the discussion replies only allows 7000 characters and the full XML has about 32K characters.

Please see below a small example with right fields, the output should be 4 final "Result" objects.


(Miguel Freitas) #8
<?xml version="1.0" encoding="utf-8"?>
<CxXMLResults InitiatorName="admin admin" Owner="admin@cx" ScanId="1080857" ProjectId="42" ProjectName=".NetGoat" TeamFullPathOnReportDate="CxServer\SP\Company\Users" DeepLink="http://localhost/CxWebClient/ViewerMain.aspx?scanid=1080857&amp;projectid=42" ScanStart="Friday, February 22, 2019 9:30:30 AM" Preset="Checkmarx Default" ScanTime="00h:03m:32s" LinesOfCodeScanned="47558" FilesScanned="139" ReportCreationTime="Friday, February 22, 2019 9:34:04 AM" Team="Users" CheckmarxVersion="8.9.0.117" ScanComments="" ScanType="Full" SourceOrigin="GIT" Visibility="Public">
  <Query id="431" categories="PCI DSS v3.2;PCI DSS (3.2) - 6.5.7 - Cross-site scripting (XSS),OWASP Top 10 2013;A3-Cross-Site Scripting (XSS),FISMA 2014;System And Information Integrity,NIST SP 800-53;SI-15 Information Output Filtering (P0),OWASP Top 10 2017;A7-Cross-Site Scripting (XSS),Test Custom Category;A03 - Cross Site Scripting (XSS),Test Custom Category1;A03 - Cross Site Scripting (XSS),Test Json Category;A03 - Cross Site Scripting (XSS)" cweId="79" name="Stored_XSS" group="CSharp_High_Risk" Severity="High" Language="CSharp" LanguageHash="8608329450395249" LanguageChangeDate="2018-10-08T00:00:00.0000000" SeverityIndex="3" QueryPath="CSharp\Cx\CSharp High Risk\Stored XSS Version:0" QueryVersionCode="431">
    <Result NodeId="10808570056" FileName="WebSite/BusinessLogic/Data/ProductRepository.cs" Status="Recurrent" Line="23" Column="51" FalsePositive="False" Severity="High" AssignToUser="" state="0" Remark="" DeepLink="http://localhost/CxWebClient/ViewerMain.aspx?scanid=1080857&amp;projectid=42&amp;pathid=56" SeverityIndex="3">
      <Path ResultId="1080857" PathId="56" SimilarityId="616201512">
        <PathNode>
          <FileName>WebSite/BusinessLogic/Data/ProductRepository.cs</FileName>
          <Line>23</Line>
          <Column>51</Column>
          <NodeId>1</NodeId>
          <Name>Orders</Name>
          <Type></Type>
          <Length>6</Length>
          <Snippet>
            <Line>
              <Number>23</Number>
              <Code>            var topProducts = (from o in _context.Orders</Code>
            </Line>
          </Snippet>
        </PathNode>
      </Path>
    </Result>
    <Result NodeId="10808570057" FileName="WebSite/BusinessLogic/Data/ProductRepository.cs" Status="Recurrent" Line="25" Column="52" FalsePositive="False" Severity="High" AssignToUser="" state="0" Remark="" DeepLink="http://localhost/CxWebClient/ViewerMain.aspx?scanid=1080857&amp;projectid=42&amp;pathid=57" SeverityIndex="3">
      <Path ResultId="1080857" PathId="57" SimilarityId="420485798">
        <PathNode>
          <FileName>WebSite/BusinessLogic/Data/ProductRepository.cs</FileName>
          <Line>25</Line>
          <Column>52</Column>
          <NodeId>1</NodeId>
          <Name>OrderDetails</Name>
          <Type></Type>
          <Length>12</Length>
          <Snippet>
            <Line>
              <Number>25</Number>
              <Code>                               join od in _context.OrderDetails on o.OrderId equals od.OrderId</Code>
            </Line>
          </Snippet>
        </PathNode>
      </Path>
    </Result>
  </Query>
  <Query id="427" categories="PCI DSS v3.2;PCI DSS (3.2) - 6.5.7 - Cross-site scripting (XSS),OWASP Top 10 2013;A3-Cross-Site Scripting (XSS),FISMA 2014;System And Information Integrity,NIST SP 800-53;SI-15 Information Output Filtering (P0),OWASP Top 10 2017;A7-Cross-Site Scripting (XSS),Test Custom Category;A03 - Cross Site Scripting (XSS),Test Custom Category1;A03 - Cross Site Scripting (XSS),Test Json Category;A03 - Cross Site Scripting (XSS)" cweId="79" name="Reflected_XSS_All_Clients" group="CSharp_High_Risk" Severity="High" Language="CSharp" LanguageHash="8608329450395249" LanguageChangeDate="2018-10-08T00:00:00.0000000" SeverityIndex="3" QueryPath="CSharp\Cx\CSharp High Risk\Reflected XSS All Clients Version:1" QueryVersionCode="54386807">
    <Result NodeId="10808570049" FileName="WebSite/AddUserTemp.aspx.cs" Status="Recurrent" Line="31" Column="86" FalsePositive="True" Severity="High" AssignToUser="" state="1" Remark="admin admin .NetGoat, [Friday, February 22, 2019 9:15:10 AM]: Changed status to Not Exploitable" DeepLink="http://localhost/CxWebClient/ViewerMain.aspx?scanid=1080857&amp;projectid=42&amp;pathid=49" SeverityIndex="3">
      <Path ResultId="1080857" PathId="49" SimilarityId="-365121898">
        <PathNode>
          <FileName>WebSite/AddUserTemp.aspx.cs</FileName>
          <Line>31</Line>
          <Column>86</Column>
          <NodeId>1</NodeId>
          <Name>Text</Name>
          <Type></Type>
          <Length>4</Length>
          <Snippet>
            <Line>
              <Number>31</Number>
              <Code>                lblErrorMessage.Text = string.Format("{0} was created.", txtUsername.Text);</Code>
            </Line>
          </Snippet>
        </PathNode>
      </Path>
    </Result>
    <Result NodeId="10808570050" FileName="WebSite/AddUserTemp.aspx.cs" Status="Recurrent" Line="31" Column="86" FalsePositive="True" Severity="High" AssignToUser="" state="1" Remark="admin admin .NetGoat, [Friday, February 22, 2019 9:15:10 AM]: Changed status to Not Exploitable" DeepLink="http://localhost/CxWebClient/ViewerMain.aspx?scanid=1080857&amp;projectid=42&amp;pathid=49" SeverityIndex="3">
      <Path ResultId="1080857" PathId="50" SimilarityId="-1211565865">
        <PathNode>
          <FileName>WebSite/BlogCreate.aspx.cs</FileName>
          <Line>23</Line>
          <Column>44</Column>
          <NodeId>1</NodeId>
          <Name>Text</Name>
          <Type></Type>
          <Length>4</Length>
          <Snippet>
            <Line>
              <Number>23</Number>
              <Code>                var contents = txtContents.Text;</Code>
            </Line>
          </Snippet>
        </PathNode>
      </Path>
    </Result>
  </Query>
</CxXMLResults>

(Miguel Freitas) #9

Here above a small example, but the original one has about 162 different "Result" objects inside each "Query" object

Here the config I used:

input {
   file {
      path => "C:/ELK/results.xml"
      start_position => "beginning"
      sincedb_path => "nul"
      type => "xml"
      codec => multiline {
        pattern => "<CxXMLResults"
        negate => true
        what => "previous"
      }
   }
}

filter {
  xml {
    source => "message"
    store_xml => false
    xpath => ["CxXMLResults/@InitiatorName", "initiator_name"]
    xpath => ["CxXMLResults/@Owner", "owner"]
    xpath => ["CxXMLResults/@ScanId", "scan_id"]
    xpath => ["CxXMLResults/@ProjectId", "project_id"]
    xpath => ["CxXMLResults/@ProjectName", "project_name"]
    xpath => ["CxXMLResults/@TeamFullPathOnReportDate", "team_full_path"]
    xpath => ["CxXMLResults/@DeepLink", "scan_link"]
    xpath => ["CxXMLResults/@ScanStart", "scan_start"]
    xpath => ["CxXMLResults/@Preset", "preset"]
    xpath => ["CxXMLResults/@ScanTime", "scan_time"]
    xpath => ["CxXMLResults/@LinesOfCodeScanned", "loc"]
    xpath => ["CxXMLResults/@FilesScanned", "files_scanned"]
    xpath => ["CxXMLResults/@ReportCreationTime", "report_creation_date"]
    xpath => ["CxXMLResults/@Team", "team"]
    xpath => ["CxXMLResults/@CheckmarxVersion", "cx_version"]
    xpath => ["CxXMLResults/@ScanComments", "scan_comments"]
    xpath => ["CxXMLResults/@ScanType", "scan_type"]
    xpath => ["CxXMLResults/@SourceOrigin", "source_origin"]
    xpath => ["CxXMLResults/@Visibility", "visibility"]
    xpath => ["CxXMLResults/Query", "queries"]
  }
  split { 
    field => "queries"
  }
  xml {
    source => "queries"
    store_xml => false
    xpath => ["Query/@id", "query_id"]
    xpath => ["Query/@Categories", "query_categories"]
    xpath => ["Query/@cweId", "query_cwe_id"]
    xpath => ["Query/@name", "query_name"]
    xpath => ["Query/@group", "query_group"]
    xpath => ["Query/@Severity", "query_severity"]
    xpath => ["Query/@Language", "query_language"]
    xpath => ["Query/@LanguageHash", "query_language_hash"]
    xpath => ["Query/@LanguageChangeDate", "query_language_change_date"]
    xpath => ["Query/@SeverityIndex", "query_severity_index"]
    xpath => ["Query/@QueryPath", "query_path"]
    xpath => ["Query/@QueryVersionCode", "query_version_code"]
    xpath => ["Query/Result", "results"]
  }
  split { 
    field => "results"
  }
  xml {
    source => "results"
    store_xml => false
    xpath => ["Result/@NodeId", "result_node_id"]
    xpath => ["Result/@FileName", "result_filename"]
    xpath => ["Result/@Status", "result_status"]
    xpath => ["Result/@Line", "result_line"]
    xpath => ["Result/@Column", "result_column"]
    xpath => ["Result/@FalsePositive", "result_false_positive"]
    xpath => ["Result/@Severity", "result_severity"]
    xpath => ["Result/@AssignToUser", "result_assigned_user"]
    xpath => ["Result/@state", "result_state"]
    xpath => ["Result/@Remark", "result_remark"]
    xpath => ["Result/@DeepLink", "result_link"]
    xpath => ["Result/@SeverityIndex", "result_severity_index"]
    xpath => ["Result/Path/@ResultId", "result_id"]
    xpath => ["Result/Path/@PathId", "result_path_id"]
    xpath => ["Result/Path/@SimilarityId", "result_similarity_id"]
    xpath => ["Result/Path/PathNode[1]/FileName/text()", "result_source_filename"]
    xpath => ["Result/Path/PathNode[1]/Line/text()", "result_source_line"]
    xpath => ["Result/Path/PathNode[1]/Column/text()", "result_source_column"]
    xpath => ["Result/Path/PathNode[1]/NodeId/text()", "result_source_node_id"]
    xpath => ["Result/Path/PathNode[1]/Name/text()", "result_source_name"]
    xpath => ["Result/Path/PathNode[1]/Type/text()", "result_source_type"]
    xpath => ["Result/Path/PathNode[1]/Length/text()", "result_source_length"]
    xpath => ["Result/Path/PathNode[1]/Snippet/Line/Number/text()", "result_source_snippet_line_number"]
    xpath => ["Result/Path/PathNode[1]/Snippet/Line/Code/text()", "result_source_snippet_line_code"]
    xpath => ["Result/Path/PathNode[last()]/FileName/text()", "result_dest_filename"]
    xpath => ["Result/Path/PathNode[last()]/Line/text()", "result_dest_line"]
    xpath => ["Result/Path/PathNode[last()]/Column/text()", "result_dest_column"]
    xpath => ["Result/Path/PathNode[last()]/NodeId/text()", "result_dest_node_id"]
    xpath => ["Result/Path/PathNode[last()]/Name/text()", "result_dest_name"]
    xpath => ["Result/Path/PathNode[last()]/Type/text()", "result_dest_type"]
    xpath => ["Result/Path/PathNode[last()]/Length/text()", "result_dest_length"]
    xpath => ["Result/Path/PathNode[last()]/Snippet/Line/Number/text()", "result_dest_snippet_line_number"]
    xpath => ["Result/Path/PathNode[last()]/Snippet/Line/Code/text()", "result_dest_snippet_line_code"]
  }
}

output {
  stdout {
    codec=>rubydebug
  }
}

Got this output for the XML example above:

[2019-04-14T21:08:44,443][INFO ][logstash.pipeline        ] Pipeline started successfully {:pipeline_id=>"main", :thread=>"#<Thread:0x4b61c21d sleep>"}
[2019-04-14T21:08:44,443][INFO ][filewatch.observingtail  ] START, creating Discoverer, Watch with file and sincedb collections
[2019-04-14T21:08:44,443][INFO ][logstash.agent           ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[2019-04-14T21:08:44,647][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:queries is of type = NilClass
[2019-04-14T21:08:44,647][WARN ][logstash.filters.split   ] Only String and Array types are splittable. field:results is of type = NilClass
{
  "@version" => "1",
      "tags" => [
    [0] "_split_type_failure"
],
      "type" => "xml",
      "path" => "C:/ELK/results.xml",
"@timestamp" => 2019-04-14T20:08:44.522Z,
      "host" => "MiguelF-Laptop",
   "message" => "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r"
}

#10

With that XML in a generator input and that filter, I get 4 messages. It might be time to take a closer look at how your multiline pattern is performing.

Getting the first line of XML (the <?xml...) as a separate line, which of course will not parse, is expected. You will also be missing the last entry in the file, since there is no line matching " <CxXMLResults" to flush the last event.

Using auto_flush_interval on the file input might help.

If you just want to consume the entire file in a single event (which you can then split) I normally use a pattern that never matches.

codec => multiline { pattern => "^Spalanzani" what => "previous" negate => true auto_flush_interval => 1 } } }

(Miguel Freitas) #11

Hi Badger,

Thank you very much for your help so far, but the parsing is still not working properly.

I don't want to consume the file as a single event, but many events according to the number of "Result" tags.

After adding the auto_flush_interval and using the original XML file (with 162 Results and more than 11k Lines) I got this:

  • 25 events were parsed as expected
  • 137 (162-25) events were NOT parsed as expected due to "multiline_codec_max_lines_reached" and "_split_type_failure" issues

#12

You realize that number is adjustable, right?


(Miguel Freitas) #13

What is the reason I get this tag "multiline_codec_max_lines_reached" ?

The XML file can have sometimes 1k or 2k lines but it can also assume huge amount of lines such as 50k, 100k or more lines

Which value should I set for max lines ?

Also auto_flush_interval should be set to what in my case ?

Thanks !


(Miguel Freitas) #14

Hi Badger,

After changing the codec to:

  codec => multiline {
    pattern => "<CxXMLResults"
    negate => true
    what => "previous"
    auto_flush_interval => 1
    max_lines => 10000000  ### Line added
  }

I got 163 events:

  • 162 well parsed results

  • 1 event NOT well parsed (<?xml version="1.0" encoding="utf-8"?>)

Now I want to do some improvements, such as:

  • Remove "message", "queries" and "results" fields because are not useful anymore
  • Ignore the line NOT well parsed(<?xml version="1.0" encoding="utf-8"?>)

Any suggestion ? Any more improvements you can suggest ?

Thank you very much for your help !!!


#15

The documentation I linked to explains why. The multiline_codec_max_lines does not relate to the size of the file, it limits the number of lines that can be combined into a single CxXMLResults event. Having a half-million line file is not a problem if it is being parsed as 2,000 events of 250 lines. However, if it is 250 events of 2,000 lines that option would need to be tuned.

You need auto_flush_interval to be longer than the time it takes the multiline filter to assemble the largest CxXMLResults event. It only impacts the last line, so I think it would be OK to err on the high side. You can then see how long it takes the multiline filter to assemble events (based on the gaps in @timestamp) and adjust it.