XML to Elasticsearch via Logstash

magnusbaeck · February 21, 2018, 8:33pm

What do you want to do, exactly? Please give an example of the event you currently have and what you'd like to have instead.

Errors · February 21, 2018, 9:56pm

Copy and paste from my post above:

My problem right now is that I have too many fields as well as too much data to sift through it all manually to determine which ones sometimes contain arrays. Is there a way to do this through the ruby filter perhaps? Would it be possible to write code which reads in each variable and, if it contains an array, it flattens/splits it somehow to ensure kibana can handle it properly?

As I'm using the first option you mentioned (where I set store_xml to true), I don't need to define the variables myself in the config file. So considering that, I'm not sure how to grab variables, see if they contain an array and then alter it to a more kibana friendly type

magnusbaeck · February 22, 2018, 6:53am

Yes, I read that paragraph but I was hoping for a concrete example expressed as JSON so there aren't any misunderstandings and so that I wouldn't have to make up test data myself.

Errors · February 22, 2018, 7:57pm

To be honest, I'm not sure how I want it to look so I'll just give you an example from my current output:

{
    "parsed" => {

        "FIELD1" => {
            "DOC_TYPE" => "pdf",
            "NUMBER" => "123456789"
        },       
        "UPDATE" => {
            "VALUE" => [
                [0] "1990-01-01",
                [1] "1990-01-02",
                [2] "1990-01-03"
            ]
        }
    }
}

Given something like this, what's the best way to display this data in terms of kibana (mainly looking at the section which goes UPDATE -> VALUE, and then contains an array)?
Would you say the best solution would be to have three separate value variables? I'm open to ideas here.

magnusbaeck · February 22, 2018, 9:48pm

It depends on how you want to use the data. Do you want to aggregate on the date field to e.g. track the most common dates, or just search for documents from 1990-01-01 (or whatever the date field means)? Or do you just want to treat them (are they always three?) as separate entities since they have completely different meanings?

Errors · February 22, 2018, 11:27pm

I'd like to be able to search for the date (or any other value which might have a similar format) and locate it in my database

magnusbaeck · February 23, 2018, 6:58am

Then I don't think there's any point in changing what your events look like.

Errors · February 26, 2018, 9:53am

Oh, that's fair enough then.
I do have one more question though which may be easier to answer. When data in a date format is being pushed to elasticsearch, the following error occurs:

Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"somelist", :_type=>"doc", :_routing=>nil}, #<LogStash::Event:0x6899c686>], :response=>{"index"=>{"_index"=>"somelist", "_type"=>"doc", "_id"=>"YSKF0WEBFHzExGa-KD9N", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse [parsed.LISTED_ON]", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"Invalid format: \"2010-10-01-03:00\" is malformed at \"-03:00\""}}}}}

I'm guessing the issue is that the value is being read in as a date (as its not in quote marks) but the '-' between the data and the time is throwing the whole thing off? I was wondering how I should go about fixing this?

Thanks!

magnusbaeck · February 26, 2018, 10:55am

The mapping of the field doesn't match the data you're trying to push to it, so either adjust the mapping or what the field contains.

Errors · February 26, 2018, 8:57pm

I tried the following:

   date {
            match => ["LISTED_ON", "yyyy-MM-dd'-'HH:mm"]
            target => "LISTED_ON"
    }

but it isn't working. Just a note, not all "LISTED_ON" values have that time value attached like in the above error code ("2010-10-01-03:00"), some just look like this:
"LISTED_ON" => "2011-01-01"
Considering this, how would I manipulate the LISTED_ON variable to remove the time on values which contain it? In other words, how would I just keep the value for this field as always being the date
"yyyy-MM-dd"?

magnusbaeck · February 27, 2018, 6:59am

but it isn't working

What does that mean? Is the date filter not working or is ES still rejecting the events?

Considering this, how would I manipulate the LISTED_ON variable to remove the time on values which contain it? In other words, how would I just keep the value for this field as always being the date "yyyy-MM-dd"?

You could list both "yyyy-MM-dd" and "yyyy-MM-dd-HH:mm" as date patterns. In the former case the hours and minutes would default to 00:00.

If you don't want to use the date filter at all and only want the field to contain a yyyy-MM-dd date you can use a mutate filter's gsub option to strip the time from the field.

Errors · February 28, 2018, 8:15am

ES is rejecting the events is what I meant.
I tried using the following gsub:
gsub => ["LISTED_ON", "\d{4}-\d{2}-\d{2}"]
and got the following error code (seems to be too long to indent sorry):

[2018-02-28T21:12:32,817][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"template"=>"logstash-", "version"=>60001, "settings"=>{"index.refresh_interval"=>"5s"}, "mappings"=>{"default"=>{"dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}}}}], "properties"=>{"@timestamp"=>{"type"=>"date"}, "@version"=>{"type"=>"keyword"}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}}
[2018-02-28T21:12:32,834][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["//localhost:9200"]}
[2018-02-28T21:12:35,118][ERROR][logstash.pipeline ] Error registering plugin {:pipeline_id=>"main", :plugin=>"#<LogStash::FilterDelegator:0x7ad631a0 @metric_events_out=org.jruby.proxy.org.logstash.instrument.metrics.counter.LongCounter$Proxy2 - namespace: [stats, pipelines, main, plugins, filters, bec429198c63d5dcc77eea23444ae84a8546f3bf8bbf6b9078785731485a5795, events] key: out value:0, @metric_events_in=org.jruby.proxy.org.logstash.instrument.metrics.counter.LongCounter$Proxy2 - namespace: [stats, pipelines, main, plugins, filters, bec429198c63d5dcc77eea23444ae84a8546f3bf8bbf6b9078785731485a5795, events] key: in value:0, @logger=#<LogStash::Logging::Logger:0x1e08a9ad @logger=#Java::OrgApacheLoggingLog4jCore::Logger:0x15527a36>, @metric_events_time=org.jruby.proxy.org.logstash.instrument.metrics.counter.LongCounter$Proxy2 - namespace: [stats, pipelines, main, plugins, filters, bec429198c63d5dcc77eea23444ae84a8546f3bf8bbf6b9078785731485a5795, events] key: duration_in_millis value:0, @id="bec429198c63d5dcc77eea23444ae84a8546f3bf8bbf6b9078785731485a5795", @klass=LogStash::Filters::Mutate, @metric_events=#<LogStash::Instrument::NamespacedMetric:0x702bd72d @metric=#<LogStash::Instrument::Metric:0x2bb55d44 @collector=#<LogStash::Instrument::Collector:0x2db001ce @agent=nil, @metric_store=#<LogStash::Instrument::MetricStore:0x4bbb8bc1 @store=#<Concurrent:0x00000000000fb8 entries=3 default_proc=nil>, @structured_lookup_mutex=#Mutex:0x6507bf24, @fast_lookup=#<Concurrent:0x00000000000fbc entries=68 default_proc=nil>>>>, @namespace_name=[:stats, :pipelines, :main, :plugins, :filters, :bec429198c63d5dcc77eea23444ae84a8546f3bf8bbf6b9078785731485a5795, :events]>, @filter=<LogStash::Filters::Mutate remove_field=>["@timestamp", "message", "host", "@version", "path", "tags"], gsub=>["LISTED_ON", "\\d{4}-\\d{2}-\\d{2}"], id=>"bec429198c63d5dcc77eea23444ae84a8546f3bf8bbf6b9078785731485a5795", enable_metric=>true, periodic_flush=>false>>", :error=>"translation missing: en.logstash.agent.configuration.invalid_plugin_register", :thread=>"#<Thread:0x494a5289@C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:290 run>"}
[2018-02-28T21:12:35,130][ERROR][logstash.pipeline ] Pipeline aborted due to error {:pipeline_id=>"main", :exception=>#<LogStash::ConfigurationError: translation missing: en.logstash.agent.configuration.invalid_plugin_register>, :backtrace=>["C:/ELK-Stack/logstash/vendor/bundle/jruby/2.3.0/gems/logstash-filter-mutate-3.1.7/lib/logstash/filters/mutate.rb:198:in block in register'", "org/jruby/RubyArray.java:1778:ineach_slice'", "C:/ELK-Stack/logstash/vendor/bundle/jruby/2.3.0/gems/logstash-filter-mutate-3.1.7/lib/logstash/filters/mutate.rb:196:in register'", "C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:388:inregister_plugin'", "C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:399:in block in register_plugins'", "org/jruby/RubyArray.java:1734:ineach'", "C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:399:in register_plugins'", "C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:801:inmaybe_setup_out_plugins'", "C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:409:in start_workers'", "C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:333:inrun'", "C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:293:in `block in start'"], :thread=>"#<Thread:0x494a5289@C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:290 run>"}
[2018-02-28T21:12:35,159][ERROR][logstash.agent ] Failed to execute action {:id=>:main, :action_type=>LogStash::ConvergeResult::FailedAction, :message=>"Could not execute action: LogStash::PipelineAction::Create/pipeline_id:main, action_result: false", :backtrace=>nil}

What should I put in the filter instead? Is my regex pattern not right?

magnusbaeck · February 28, 2018, 9:03am

gsub takes a three-element array. If you want to remove the time at the end you want something like ["name-of-field", "-\d+:\d+$", ""].

wwalker · February 28, 2018, 3:48pm

To add to MagnusBaeck's comment, the three element array of gsub is [ "FieldName", "String/ExpressionToTarget", "ReplacementString/Expression" ]

Errors · March 1, 2018, 7:57pm

Okay so I tried your suggestion:

mutate {
        remove_field => ['@timestamp', 'message', 'host', '@version', 'path', 'tags']
        gsub => ["parsed.LISTED_ON", "-\d+:\d+$", ""]
    }

and this error keeps coming up:

[WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"removinghr", :_type=>"doc", :_routing=>nil}, #<LogStash::Event:0x5f08e3f2>], :response=>{"index"=>{"_index"=>"removinghr", "_type"=>"doc", "_id"=>"4qIg42EBqZVRQ8uGLGHu", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse [parsed.LISTED_ON]", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"Invalid format: \"2015-03-27-04:00\" is malformed at \"-04:00\""}}}}}

The only piece of code above mutate inside 'filter' is this:

xml {
        store_xml => true
        source => "message"
        force_array => false
        target => "parsed"
    }

...

magnusbaeck · March 1, 2018, 7:59pm

You're using the wrong syntax to reference the subfield, see https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html#logstash-config-field-references.

Errors · March 1, 2018, 8:25pm

Hmmm, I've tried the following options with no luck (same error):

        gsub => ["[INDIVIDUAL][LISTED_ON]", "-\d+:\d+$", ""]
        gsub => ["[INDIVIDUAL][parsed.LISTED_ON]", "-\d+:\d+$", ""]
        gsub => ["[INDIVIDUALS][INDIVIDUAL][parsed.LISTED_ON]", "-\d+:\d+$", ""]
        gsub => ["[CONSOLIDATED_LIST][INDIVIDUALS][INDIVIDUAL][parsed.LISTED_ON]", "-\d+:\d+$", ""]

magnusbaeck · March 2, 2018, 6:54am

What does the event look like? Copy/paste from Kibana's JSON tab or use a stdout { codec => rubydebug } output.

Errors · March 3, 2018, 6:32am

[WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"removinghr", :_type=>"doc", :_routing=>nil}, #<LogStash::Event:0x5f08e3f2>], :response=>{"index"=>{"_index"=>"removinghr", "_type"=>"doc", "_id"=>"4qIg42EBqZVRQ8uGLGHu", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse [parsed.LISTED_ON]", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"Invalid format: \"2015-03-27-04:00\" is malformed at \"-04:00\""}}}}}

So when the date isn't able to be parsed, it isn't pushed into elasticsearch and the error displayed is always the same (as posted previously). I'm thinking now it might be because parsed.LISTED_ON doesnt exist in the XML, but LISTED_ON does (the 'parsed.' prefix has been added in as thats the target name)

I tried to convert the variable to a string with this:

mutate {
         convert => { "parsed.LISTED_ON" => "string" }
}

But still got this error:

[2018-03-03T19:25:39,162][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"retry", :_type=>"doc", :_routing=>nil}, #<LogStash::Event:0x5af6789d>], :response=>{"index"=>{"_index"=>"retry", "_type"=>"doc", "_id"=>"rvmJ6mEBsfYjCDuxjhJ1", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse [parsed.LISTED_ON]", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"Invalid format: \"2015-04-07-04:00\" is malformed at \"-04:00\""}}}}}
[2018-03-03T19:25:39,202][FATAL][logstash.runner          ] SIGINT received. Terminating immediately..

My XML schema has the following format (LISTED_ON is one of the many fields existing under INDIVIDUAL):
<CONSOLIDATED_LIST xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ....>
<INDIVIDUALS>
<INDIVIDUAL>
<LISTED_ON>

Could the reason possibly exist because of this section in my input?

codec => multiline {
            pattern => "<INDIVIDUAL>"
            negate => "true"
            what => "previous"
            max_lines => 50000
        }

wwalker · March 3, 2018, 7:24am

I'm guessing it doesn't like the hyphen separating the date from the time. Looks like it may be possible to use gsub to replace the hyphens in the field with a space and then the date filter can convert it.

Topic		Replies	Views
How to split xml arrays? Logstash	15	7644	July 6, 2017
Help with parsing XML content Logstash	16	19663	July 6, 2017
Manipulating XML before hitting XML Filter Logstash	33	3940	March 30, 2018
Logstash - XML Filter not working properly Logstash	15	1868	May 12, 2019
XML Filter Help Required Logstash	7	7723	July 6, 2017

XML to Elasticsearch via Logstash

Related topics