Help Logstash XML Parsing to Xpath

Hello!

I need help how to parse this kind of XML File:

<NewData>
  <Data ID="1234" OtherID="48" Description="This is a sample" Type="Inside" Instructions="Only Here" OtherInstructions="NONE">
    <EntryDate>
      <CCYY>2020</CCYY>
      <Month>4</Month>
      <Day>13</Day>
    </EntryDate>
    <OutDate>
      <CCYY>2020</CCYY>
      <Month>4</Month>
      <Day>13</Day>
    </OutDate>
    <ClientConfig>
      <ClientSetting name="Some_Value1">true</ClientSetting>
      <ClientSetting name="Some_Value2">true</ClientSetting>
      <ClientSetting name="Some_Value3">true</ClientSetting>
      <ClientSetting name="Some_Value4">true</ClientSetting>
      <ClientSetting name="Some_Value5">true</ClientSetting>
    </ClientConfig>
	</Data>
  </NewData>

You would use an xml filter. Either store the whole XML

xml { source => "message" target => "theXML" force_array => false }

or pull parts of it out using xpath

    xml {
        source => "message"
        store_xml => false
        xpath => {
            "/NewData/Data/@ID" => "ID"
            "/NewData/Data/ClientConfig/ClientSetting/text()" => "Setting"
        }
    }

would get you

   "Setting" => [
    [0] "true",
    [1] "true",
    [2] "true",
    [3] "true",
    [4] "true"
],
        "ID" => [
    [0] "1234"
]

Hello thank you! How to parse if the ID is unique for certain file as well as client setting? Thanks!

I do not understand the question.




<NewData>
  <Data ID="1234" OtherID="48" Description="This is a sample" Type="Inside" Instructions="Only Here" OtherInstructions="NONE">
    <EntryDate>
      <CCYY>2020</CCYY>
      <Month>4</Month>
      <Day>13</Day>
    </EntryDate>
    <OutDate>
      <CCYY>2020</CCYY>
      <Month>4</Month>
      <Day>13</Day>
    </OutDate>
    <ClientConfig>
      <ClientSetting name="Some_Value1">true</ClientSetting>
      <ClientSetting name="Some_Value2">true</ClientSetting>
      <ClientSetting name="Some_Value3">true</ClientSetting>
      <ClientSetting name="Some_Value4">true</ClientSetting>
      <ClientSetting name="Some_Value5">true</ClientSetting>
    </ClientConfig>
	</Data>

<Document ID="1" Type="-1" Description="XXX" Instructions="THIS">
    <Text>        Some text here 
 </Text>
  </Document>
  <Document ID="2" Type="-1" Description="YYY" Instructions="THIS">
    <Text>           
Some Text here B         
 </Text>
    
  </Document>

  </NewData>

I'll just rephrase. How to parse all text inside unique Document, this could be variable and can be looped --- eg: Document 1 to 40 but unique texts each.

Again, either use store_xml => true, or use xpath and deal with merging all the arrays of data.

Can you help me how it's going to be stored and merged?

Example, I want to show:

Data ID = 1234
Document ID 1= Some text here
Document ID2 = Somet text here B

You will need ruby to iterate over the documents

    xml { source => "message" target => "[@metadata][theXML]" xpath => { "/NewData/Data/@ID" => "DataID" } }
    mutate { replace => { "DataID" => "%{[DataID][0]}" } }
    ruby {
        code => '
            docs = event.get("[@metadata][theXML][Document]")
            if docs.is_a? Array
                docs.each { |x|
                    id = x["ID"]
                    text = x["Text"][0]
                    event.set("documentId#{id}", text)
                }
            end
        '
    }

will get you

"documentId1" => "        Some text here \n ",
     "DataID" => "1234",
"documentId2" => "           \nSome Text here B         \n ",

The metadata field looks like this:

     "theXML" => {
    "Document" => [
        [0] {
                      "ID" => "1",
            "Instructions" => "THIS",
             "Description" => "XXX",
                    "Type" => "-1",
                    "Text" => [
                [0] "        Some text here \n "
            ]
        },
        [1] {
                      "ID" => "2",
            "Instructions" => "THIS",
             "Description" => "YYY",
                    "Type" => "-1",
                    "Text" => [
                [0] "           \nSome Text here B         \n "
            ]
        }
    ],
        "Data" => [
        [0] {
                 "Instructions" => "Only Here",
                  "Description" => "This is a sample",
            "OtherInstructions" => "NONE",
                    "EntryDate" => [
                [0] {
                     "CCYY" => [
                        [0] "2020"
                    ],
                      "Day" => [
                        [0] "13"
                    ],
                    "Month" => [
                        [0] "4"
                    ]
                }
            ],
                           "ID" => "1234",
                 "ClientConfig" => [
                [0] {
                    "ClientSetting" => [
                        [0] {
                               "name" => "Some_Value1",
                            "content" => "true"
                        },
                        [1] {
                               "name" => "Some_Value2",
                            "content" => "true"
                        },
                        [2] {
                               "name" => "Some_Value3",
                            "content" => "true"
                        },
                        [3] {
                               "name" => "Some_Value4",
                            "content" => "true"
                        },
                        [4] {
                               "name" => "Some_Value5",
                            "content" => "true"
                        }
                    ]
                }
            ],
                      "OutDate" => [
                [0] {
                     "CCYY" => [
                        [0] "2020"
                    ],
                      "Day" => [
                        [0] "13"
                    ],
                    "Month" => [
                        [0] "4"
                    ]
                }
            ],
                      "OtherID" => "48",
                         "Type" => "Inside"
        }
    ]
}

Looks like my texts are parsed separately..

If I have a paragraph under text do I have to specify anything in the input?

Example Text:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nihil acciderat ei, quod nollet, nisi quod anulum, quo delectabatur, in mari abiecerat. Ne amores quidem sanctos a sapiente alienos esse arbitrantur. Quid censes in Latino fore? Is ita vivebat, ut nulla tam exquisita posset inveniri voluptas, qua non abundaret. Primum cur ista res digna odio est, nisi quod est turpis? Duo Reges: constructio interrete.

Quid dubitas igitur, inquam, summo bono a te ita constituto, ut id totum in non dolendo sit, id tenere unum, id tueri, id defendere? Sunt etiam turpitudines plurimae, quae, nisi honestas natura plurimum valeat, cur non cadant in sapientem non est facile defendere. Sed tempus est, si videtur, et recta quidem ad me. At iam decimum annum in spelunca iacet. Hinc ceteri particulas arripere conati suam quisque videro voluit afferre sententiam. Vobis autem, quibus nihil est aliud propositum nisi rectum atque honestum, unde officii, unde agendi principlum nascatur non reperietis. An quod ita callida est, ut optime possit architectari voluptates? Nam diligi et carum esse iucundum est propterea, quia tutiorem vitam et voluptatem pleniorem efficit. Sed mehercule pergrata mihi oratio tua. In enumerandis autem corporis commodis si quis praetermissam a nobis voluptatem putabit, in aliud tempus ea quaestio differatur. Ita finis bonorum existit secundum naturam vivere sic affectum, ut optime is affici possit ad naturamque accommodatissime.

Octavio fuit, cum illam severitatem in eo filio adhibuit, quem in adoptionem D. Ea, quae dialectici nunc tradunt et docent, nonne ab illis instituta sunt aut inventa sunt? Quod autem ratione actum est, id officium appellamus. Quis est enim aut quotus quisque, cui, mora cum adpropinquet, non refugiat timido sanguen átque exalbescát metu? Dat enim intervalla et relaxat. Est autem eius generis actio quoque quaedam, et quidem talis, ut ratio postulet agere aliquid et facere eorum. Sed quia studebat laudi et dignitati, multum in virtute processerat. Restatis igitur vos; Diodorus, eius auditor, adiungit ad honestatem vacuitatem doloris.


I do not understand how that relates to the <text> element in the XML.

I mean, the texts are not parsed correctly if there are too many.
Right now, I'm still having issue errors like:

Error parsing xml with XmlSimple {:source=>"message", :value=>"</NewData>", :exception=>#<REXML::ParseException: Missing end tag for '' (got "NewData")
Line: 1

That suggests your input has broken up a single XML object into multiple events.

Hmmm. Any reason why this is broken?

You have said nothing about your inputs so I could not possibly say.

Ok sorry for that. My input is basically from google cloud storage with a similar structure above including the Text value that basically is part of Document>

Here's how it looks like:

input {
    google_cloud_storage {
    interval => 60
    bucket_id => "somebucket-id"
    json_key_file => "/etc/logstash/conf.d/serviceaccount.json"
    file_matches => ".*\.xml"
    type => "xml"
   }
   }

The google_cloud_storage input appears to consume "files" a line at a time. If an XML object is split across multiple lines I would expect that to fail.

what's the best approach so that this would not split? Should I use the normal input instead?

Not sure. I have never used Google cloud storage.

I mean should I just store the .xml file not in GCS bucket instead? Like inside a server?

input {
file
{
    path  => "/etc/logstash/source/*.xml"
    start_position => "beginning"
    codec => multiline { pattern => "</NewData>" negate => true  what => "previous" }
    sincedb_path => "/dev/null"
   }
   }

Is this correct?