Logstash XML Filter Plugin - XML Parsing General Question

Hi all

I need to parse all xml-elements on all hierarchies in millions of xml-files with logstash as pasted below and store them in separate json fields in Elasticsearch. One xml file shall end up in one json document in ES. I do not want to use xpath and to list all possible fields, just because I cannot know them. they will change over time. Also the levels of the elements can change from file to file. Eg. in the sample below, there is 5 elements <AV9APIDCAPI>, <TRADE>, <INSTSPECIFIER>, <ANNOTATIONS>, <NOTE>, whereas <TRADE> is a child of <AV9APIDCAPI> and <INSTSPECIFIER>, <ANNOTATIONS> are childs of <TRADE>. As well <NOTE> is child of <ANNOTATIONS>. It can happen that a file comes without <TRADE> and only <ANNOTATIONS> and vice/versa.

So far I have not found a smart way to parse all elements wheter there is child elements or not or there is 250 fields or only 3

Any suggestions for smart filter conf in this case?

Thx a lot and Kind regards!

<?xml version="1.0" encoding="utf-16"?>
<AV9APIDATA xmlns="av9api-platform-com">
  <TRADE EngineID="2" TradeID="0123456" RouteID="012" RouteName="House" Action="Update" DateTime="2024-06-05T08:02:47.624Z" DateTimeNanoSecondsPart="0" Price="11.000" Volume="22" AggressorCompany="CmpName" AggressorCompanyID="0" AggressorTrader="" AggressorTraderID="0" AggressorUser="" AggressorUserID="0" AggressorAction="Sell" AggressorBroker="BrokerName" AggressorBrokerID="3" InitiatorCompany="someOtherCmpName" InitiatorCompanyID="023" InitiatorTrader="Trader Name" InitiatorTraderID="0987" InitiatorUser="User Name" InitiatorUserID="9876" InitiatorAction="Buy" InitiatorBroker="Another Broker Name" InitiatorBrokerID="123" LastUpdate="2024-06-05T08:03:04.626Z" LastUpdateNanoSecondsPart="0" ForeignLastUpdate="2024-06-05T08:03:04.626Z" ManualDeal="false" VoiceDeal="false" InitSleeve="false" AggSleeve="false" PNC="false" ClearingStatus="Refused" ClearingID="0" InitiatorOwnedSpread="false" AggressorOwnedSpread="false" UnderInvestigation="false" ClearingHouse="Name of Clearinghouse" JTT="false" FromBrokenSpread="false" OtcGiveUp="false" ExecutionVenueID="SOME_STRING_VALUE" ForeignContractID="SOME_VALUE|Value|1234|5|6789|0" InitiatorTradingCapacity="SOMETHING" InitiatorDecisionMaker="01235" InitiatorExecutionMaker="9876" InitiatorDerivativeIndicator="false" InitiatorDEA="false" InitiatorLiquidityProvision="false" ProductClassification="Productname" IsMarketData="true" IsOwnData="true" VenueEntity="This is a text here">
    <INSTSPECIFIER InstID="9874563" InstName="Inst Name String" FirstSequenceID="1234569" SeqSpan="Single" FirstSequenceItemID="0123" SecondSequenceItemID="0" FirstSequenceItemName="Thu 06/06/24" SecondSequenceItemName="" TermFormatID="0123987456" ExternalInstID="7893125" />
    <ANNOTATIONS>
      <NOTE Label="CurrencyIsoCode">CHF</NOTE>
      <NOTE Label="ExecutionDT">2024-06-05T08:02:51.733</NOTE>
      <NOTE Label="ExecutionWorkflow">Some-Text</NOTE>
      <NOTE Label="NegotiationStatus">Another-Text</NOTE>
      <NOTE Label="Unit">WM</NOTE>
      <NOTE Label="UnitGUID">aa11b313-abc6-0123-4567-987c1e2f7q0t</NOTE>
      <NOTE Label="UnitID">98</NOTE>
      <NOTE Label="CPTY_Calendar">CH</NOTE>
      <NOTE Label="CPTY_CalendarID">01234f53-abc0-11e2-d98c-1f7123abc9g4</NOTE>
      <NOTE Label="CPTY_CurrencyScale">1.00</NOTE>
      <NOTE Label="CPTY_DealCreationDT">2024-06-05T08:02:47.624</NOTE>
      <NOTE Label="CPTY_Execution">Some note here</NOTE>
      <NOTE Label="CPTY_ExecutionVenue">Another note here</NOTE>
      <NOTE Label="CPTY_NegotiationBroker">This is a text field</NOTE>
      <NOTE Label="CPTY_NegotiationBrokerID">123456</NOTE>
      <NOTE Label="CPTY_SEGMENT_MIC">HELLO</NOTE>
      <NOTE Label="CPTY_UTI">9595989845645645454564QQ4545454545212112552</NOTE>
    </ANNOTATIONS>
  </TRADE>
</AV9APIDATA>

What do you not like about the product of an xml filter?

xml {
    source => "message"
    target => "theXML"
    remove_field => [ "message" ]
    force_array => false
}

which produces

    "theXML" => {
    "TRADE" => {
                  "InitiatorCompanyID" => "023",
                                 "PNC" => "false",
                         "ANNOTATIONS" => {
            "NOTE" => [
                [ 0] {
                      "Label" => "CurrencyIsoCode",
                    "content" => "CHF"
                },
                [ 1] {
                      "Label" => "ExecutionDT",
                    "content" => "2024-06-05T08:02:51.733"
                },
...
2 Likes

Hi Badger. Nothing! I like it very much, but with this one xml file as an input to logstash and proposed xml input conf. from you ...

<?xml version="1.0" encoding="utf-16"?>
<AV9APIDATA xmlns="av9api-platform-com">
  <ORDER EngineID="2" PersistentOrderID="6316763-2264455855" OrderID="6316763-226545545" OldEngineID="0" OldOrderID="0" Action="Insert" DateTime="2024-06-04T14:27:33.235291Z" DateTimeNanoSecondsPart="0" Price="8700" Volume="500" HiddenVolume="0" PriceDelta="0.000" Side="Bid" Status="Firm" Company="CMP NAME" CompanyID="123" Broker="Some Broker" BrokerID="50" RouteID="123" User="User Name" UserID="321456" Trader="Trader Name" TraderID="123456" OrderType="GoodForDay" AllOrNone="false" CounterPartyOk="true" ImpliedType="None" AccountID="132456789" AccountName="1325|SOME-STUFF" IsTradable="No" OrderDealt="false" Execution="" TradingCapacity="KTMNO" ExecutionMaker="3" DerivativeIndicator="false" DEA="true" DEAClientID="789456" LiquidityProvision="false" ProductClassification="RM - CERTIDF" ManualOrderIndicator="true" IsMarketData="false" IsOwnData="true">
    <INSTSPECIFIER InstID="132456789" InstName="Some Inst" FirstSequenceID="123456789" SeqSpan="Text" FirstSequenceItemID="11" SecondSequenceItemID="858" FirstSequenceItemName="QWERTZ 287" SecondSequenceItemName="Gugux" TermFormatID="789465123" ExternalInstID="132456789" />
  </ORDER>
</AV9APIDATA>

... I'll receive 6 resulting json documents in ES instead of one:

... and expanded, the 3rd document in the list above looks like this. So it has not parsed a single field. and outputs "_xmlparsefailure. I wonder why

I see error messages like "Error parsing xml with XmlSimple {:source=>"message", :value=>" <ANNOTATIONS>\n", :exception=>#<REXML::ParseException: No close tag for /ANNOTATIONS e.g. for other xml files (the log for particular one above I can't find at the moment). So one error seems to be not conforming xml files. But the xml file I have posted in this post seems to be okay. No? So I wonder why it doesn't get parsed.

KR d.

Your new example XML has six lines. A file input like

file { path => "/home/user/foo.txt" sincedb_path => "/dev/null" start_position => beginning }

will consume that as six separate events. For the second one:

<AV9APIDATA xmlns="av9api-platform-com">

the xml filter with complain ":exception=>#<REXML::ParseException: No close tag for /AV9APIDATA". That's because the closing /AV9APIDATA tag is in the sixth event, not the second.

You need to use a multiline codec to consume the entire XML document as a single event. For example, if you need to consume the entire file as one event you could use

file {
    path => "/home/user/foo.txt"
    sincedb_path => "/dev/null"
    start_position => beginning 
    codec => multiline { 
        pattern => "^Spalanzani" 
        negate => true 
        what => previous 
        auto_flush_interval => 2
    }
}

If you do that then the xml filter will parse it just fine.

Note, if you have two XML documents in a file, for example

<?xml version="1.0" encoding="utf-16"?>
<AV9APIDATA xmlns="av9api-platform-com"> <ORDER EngineID="2"> </ORDER>
</AV9APIDATA>
<?xml version="1.0" encoding="utf-16"?>
<AV9APIDATA xmlns="av9api-platform-com"> <ORDER EngineID="3"> </ORDER>
</AV9APIDATA>

then you will get a different exception: attempted adding second root element to document.

In that case, use a different pattern to consume documents

codec => multiline { 
    pattern => "^</" 
    negate => true 
    what => next  # Note previous changed to next
    auto_flush_interval => 2 
}

That will work provided that your XML is pretty-printed with indentation. If you have nested elements that are left aligned then it will break and you may have to resort to something like

codec => multiline { 
    pattern => "^</AV9APIDATA" 
    negate => true 
    what => next  # Note previous changed to next
    auto_flush_interval => 2 
}

which provides very little flexibility.

1 Like

This is the way, Badger! Thank you so much! :smiley: I'll share my working config as it sources the files from S3. That might help others:

input {
  s3 {
    bucket => "bucket-name"
    region => "eu-central-1"
    sincedb_path => "/etc/logstash/sincedb/sincedb"
    include_object_properties => true
    codec => multiline {
      pattern => "^<\?xml"
      negate => true
      what => "previous"
      auto_flush_interval => 1
    }
  }
}

filter {
  xml {
    source => "message"
    target => "theXML"
    store_xml => true
    remove_field => [ "message" ]
  }

  mutate {
    add_field => { "filename" => "%{[@metadata][s3][key]}" }
  }
}

output {
  elasticsearch {
    index => "index-name"
    cloud_id => "--:--"
    cloud_auth => "--:--"
    ssl_enabled => true
  }
}

Hi @Badger. In the meantime I receive xml files like the one like below. When I parse it with the current configuration

      pattern => "^<\?xml"
      negate => true
      what => "previous"
      auto_flush_interval => 2

I'll get multiple events in one document and the fields (eg the price or the action) is saved as an array in this one document. Would the change from "what => previous" to "what => next" parse all orders as separate events in the file below but still work for the first and the second xml samples in before posts of this thread?

This is the new XML file:

<AV9APIDATA  xmlns="av9api-platform-com">
  <ORDER EngineID="2" PersistentOrderID="Ident-02312345689,01010202,247]|34.555|A|O" OrderID="Ident-02312345689,01010202,247]|1235|A|O" OldEngineID="2" OldOrderID="Ident-02312345689,01010202,247]|34.555|A|O" Action="Update" DateTime="2024-06-18T10:34:46.8128529Z" DateTimeNanoSecondsPart="99" Price="1.555" Volume="1" HiddenVolume="0" PriceDelta="0.000" Side="Ask" Status="Firm" Company="" CompanyID="0" Broker="XYZ" BrokerID="0123" RouteID="341" OldBrokerID="0123" User="" UserID="0" Trader="" TraderID="0" OrderType="GoodTillCancelled" AllOrNone="false" CounterPartyOk="true" ImpliedType="None" IsTradable="No" OrderDealt="false" Execution="" IsMarketData="true" IsOwnData="false">
    <INSTSPECIFIER InstID="12345689" InstName="Some-Instrument-Name" FirstSequenceID="01010202" SeqSpan="Single" FirstSequenceItemID="247" SecondSequenceItemID="0" FirstSequenceItemName="Jul-24" SecondSequenceItemName="" TermFormatID="2930457764" ExternalInstID="12345689" />
  </ORDER>
  <ORDER EngineID="2" PersistentOrderID="Ident-02312345689,01010202,250]|37.05|A|O" OrderID="Ident-02312345689,01010202,250]|4569|A|O" OldEngineID="0" OldOrderID="0" Action="Insert" DateTime="2024-06-18T10:34:46.9628963Z" DateTimeNanoSecondsPart="94" Price="2.35" Volume="70" HiddenVolume="0" PriceDelta="0.000" Side="Ask" Status="Firm" Company="" CompanyID="0" Broker="XYZ" BrokerID="0123" RouteID="341" User="" UserID="0" Trader="" TraderID="0" OrderType="GoodTillCancelled" AllOrNone="false" CounterPartyOk="true" ImpliedType="None" IsTradable="No" OrderDealt="false" Execution="" IsMarketData="true" IsOwnData="false">
    <INSTSPECIFIER InstID="12345689" InstName="Some-Instrument-Name" FirstSequenceID="01010202" SeqSpan="Single" FirstSequenceItemID="250" SecondSequenceItemID="0" FirstSequenceItemName="Oct-24" SecondSequenceItemName="" TermFormatID="2930457764" ExternalInstID="12345689" />
  </ORDER>
  <ORDER EngineID="2" PersistentOrderID="Ident-02312345689,01010202,250]|37.045|A|O" OrderID="Ident-02312345689,01010202,250]|256|A|O" OldEngineID="0" OldOrderID="0" Action="Insert" DateTime="2024-06-18T10:34:46.9630666Z" DateTimeNanoSecondsPart="93" Price="2.36" Volume="70" HiddenVolume="0" PriceDelta="0.000" Side="Ask" Status="Firm" Company="" CompanyID="0" Broker="XYZ" BrokerID="0123" RouteID="341" User="" UserID="0" Trader="" TraderID="0" OrderType="GoodTillCancelled" AllOrNone="false" CounterPartyOk="true" ImpliedType="None" IsTradable="No" OrderDealt="false" Execution="" IsMarketData="true" IsOwnData="false">
    <INSTSPECIFIER InstID="12345689" InstName="Some-Instrument-Name" FirstSequenceID="01010202" SeqSpan="Single" FirstSequenceItemID="250" SecondSequenceItemID="0" FirstSequenceItemName="Oct-24" SecondSequenceItemName="" TermFormatID="2930457764" ExternalInstID="12345689" />
  </ORDER>
  <ORDER EngineID="2" PersistentOrderID="Ident-02312345689,01010202,250]|37.095|A|O" OrderID="Ident-02312345689,01010202,250]|123|A|O" OldEngineID="2" OldOrderID="Ident-02312345689,01010202,250]|37.095|A|O" Action="Update" DateTime="2024-06-18T10:34:46.9630666Z" DateTimeNanoSecondsPart="93" Price=2.96" Volume="70" HiddenVolume="0" PriceDelta="0.000" Side="Ask" Status="Firm" Company="" CompanyID="0" Broker="XYZ" BrokerID="0123" RouteID="341" OldBrokerID="0123" User="" UserID="0" Trader="" TraderID="0" OrderType="GoodTillCancelled" AllOrNone="false" CounterPartyOk="true" ImpliedType="None" IsTradable="No" OrderDealt="false" Execution="" IsMarketData="true" IsOwnData="false">
    <INSTSPECIFIER InstID="12345689" InstName="Some-Instrument-Name" FirstSequenceID="01010202" SeqSpan="Single" FirstSequenceItemID="250" SecondSequenceItemID="0" FirstSequenceItemName="Oct-24" SecondSequenceItemName="" TermFormatID="2930457764" ExternalInstID="12345689" />
  </ORDER>
  <ORDER EngineID="2" PersistentOrderID="Ident-02312345689,01010202,250]|37.1|A|O" OrderID="Ident-02312345689,01010202,250]|123|A|O" OldEngineID="0" OldOrderID="0" Action="Remove" DateTime="2024-06-18T10:34:46.9628963Z" DateTimeNanoSecondsPart="94" Price="2.39" Volume="70" HiddenVolume="0" PriceDelta="0.000" Side="Ask" Status="Firm" Company="" CompanyID="0" Broker="XYZ" BrokerID="0123" RouteID="341" User="" UserID="0" Trader="" TraderID="0" OrderType="GoodTillCancelled" AllOrNone="false" CounterPartyOk="true" ImpliedType="None" IsTradable="No" OrderDealt="false" Execution="" IsMarketData="true" IsOwnData="false">
    <INSTSPECIFIER InstID="12345689" InstName="Some-Instrument-Name" FirstSequenceID="01010202" SeqSpan="Single" FirstSequenceItemID="250" SecondSequenceItemID="0" FirstSequenceItemName="Oct-24" SecondSequenceItemName="" TermFormatID="2930457764" ExternalInstID="12345689" />
  </ORDER>
</AV9APIDATA >

creates one single document in ES with an array for each field

How can I parse all three xml samples I posted so far with one single config?
In general I am most interested in the data within ORDERS and TRADE. Thx in advance for help - very much appreciated!

I would suggest keeping the same multiline codec but using a split filter to create five events from the ORDER array.

 split { field => "[theXML][ORDER]" }
1 Like