XML to Elasticsearch via Logstash

What do you want to do, exactly? Please give an example of the event you currently have and what you'd like to have instead.

Copy and paste from my post above:

My problem right now is that I have too many fields as well as too much data to sift through it all manually to determine which ones sometimes contain arrays. Is there a way to do this through the ruby filter perhaps? Would it be possible to write code which reads in each variable and, if it contains an array, it flattens/splits it somehow to ensure kibana can handle it properly?

As I'm using the first option you mentioned (where I set store_xml to true), I don't need to define the variables myself in the config file. So considering that, I'm not sure how to grab variables, see if they contain an array and then alter it to a more kibana friendly type

Yes, I read that paragraph but I was hoping for a concrete example expressed as JSON so there aren't any misunderstandings and so that I wouldn't have to make up test data myself.

To be honest, I'm not sure how I want it to look so I'll just give you an example from my current output:

{
    "parsed" => {

        "FIELD1" => {
            "DOC_TYPE" => "pdf",
            "NUMBER" => "123456789"
        },       
        "UPDATE" => {
            "VALUE" => [
                [0] "1990-01-01",
                [1] "1990-01-02",
                [2] "1990-01-03"
            ]
        }
    }
}

Given something like this, what's the best way to display this data in terms of kibana (mainly looking at the section which goes UPDATE -> VALUE, and then contains an array)?
Would you say the best solution would be to have three separate value variables? I'm open to ideas here.

It depends on how you want to use the data. Do you want to aggregate on the date field to e.g. track the most common dates, or just search for documents from 1990-01-01 (or whatever the date field means)? Or do you just want to treat them (are they always three?) as separate entities since they have completely different meanings?

I'd like to be able to search for the date (or any other value which might have a similar format) and locate it in my database :slight_smile:

Then I don't think there's any point in changing what your events look like.

Oh, that's fair enough then.
I do have one more question though which may be easier to answer. When data in a date format is being pushed to elasticsearch, the following error occurs:

Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"somelist", :_type=>"doc", :_routing=>nil}, #<LogStash::Event:0x6899c686>], :response=>{"index"=>{"_index"=>"somelist", "_type"=>"doc", "_id"=>"YSKF0WEBFHzExGa-KD9N", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse [parsed.LISTED_ON]", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"Invalid format: \"2010-10-01-03:00\" is malformed at \"-03:00\""}}}}}

I'm guessing the issue is that the value is being read in as a date (as its not in quote marks) but the '-' between the data and the time is throwing the whole thing off? I was wondering how I should go about fixing this?

Thanks!

The mapping of the field doesn't match the data you're trying to push to it, so either adjust the mapping or what the field contains.

I tried the following:

   date {
            match => ["LISTED_ON", "yyyy-MM-dd'-'HH:mm"]
            target => "LISTED_ON"
    }

but it isn't working. Just a note, not all "LISTED_ON" values have that time value attached like in the above error code ("2010-10-01-03:00"), some just look like this:
"LISTED_ON" => "2011-01-01"
Considering this, how would I manipulate the LISTED_ON variable to remove the time on values which contain it? In other words, how would I just keep the value for this field as always being the date
"yyyy-MM-dd"?

but it isn't working

What does that mean? Is the date filter not working or is ES still rejecting the events?

Considering this, how would I manipulate the LISTED_ON variable to remove the time on values which contain it? In other words, how would I just keep the value for this field as always being the date "yyyy-MM-dd"?

You could list both "yyyy-MM-dd" and "yyyy-MM-dd-HH:mm" as date patterns. In the former case the hours and minutes would default to 00:00.

If you don't want to use the date filter at all and only want the field to contain a yyyy-MM-dd date you can use a mutate filter's gsub option to strip the time from the field.

ES is rejecting the events is what I meant.
I tried using the following gsub:
gsub => ["LISTED_ON", "\d{4}-\d{2}-\d{2}"]
and got the following error code (seems to be too long to indent sorry):

[2018-02-28T21:12:32,817][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"template"=>"logstash-", "version"=>60001, "settings"=>{"index.refresh_interval"=>"5s"}, "mappings"=>{"default"=>{"dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}}}}], "properties"=>{"@timestamp"=>{"type"=>"date"}, "@version"=>{"type"=>"keyword"}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}}
[2018-02-28T21:12:32,834][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["//localhost:9200"]}
[2018-02-28T21:12:35,118][ERROR][logstash.pipeline ] Error registering plugin {:pipeline_id=>"main", :plugin=>"#<LogStash::FilterDelegator:0x7ad631a0 @metric_events_out=org.jruby.proxy.org.logstash.instrument.metrics.counter.LongCounter$Proxy2 - namespace: [stats, pipelines, main, plugins, filters, bec429198c63d5dcc77eea23444ae84a8546f3bf8bbf6b9078785731485a5795, events] key: out value:0, @metric_events_in=org.jruby.proxy.org.logstash.instrument.metrics.counter.LongCounter$Proxy2 - namespace: [stats, pipelines, main, plugins, filters, bec429198c63d5dcc77eea23444ae84a8546f3bf8bbf6b9078785731485a5795, events] key: in value:0, @logger=#<LogStash::Logging::Logger:0x1e08a9ad @logger=#Java::OrgApacheLoggingLog4jCore::Logger:0x15527a36>, @metric_events_time=org.jruby.proxy.org.logstash.instrument.metrics.counter.LongCounter$Proxy2 - namespace: [stats, pipelines, main, plugins, filters, bec429198c63d5dcc77eea23444ae84a8546f3bf8bbf6b9078785731485a5795, events] key: duration_in_millis value:0, @id="bec429198c63d5dcc77eea23444ae84a8546f3bf8bbf6b9078785731485a5795", @klass=LogStash::Filters::Mutate, @metric_events=#<LogStash::Instrument::NamespacedMetric:0x702bd72d @metric=#<LogStash::Instrument::Metric:0x2bb55d44 @collector=#<LogStash::Instrument::Collector:0x2db001ce @agent=nil, @metric_store=#<LogStash::Instrument::MetricStore:0x4bbb8bc1 @store=#<Concurrent::map:0x00000000000fb8 entries=3 default_proc=nil>, @structured_lookup_mutex=#Mutex:0x6507bf24, @fast_lookup=#<Concurrent::map:0x00000000000fbc entries=68 default_proc=nil>>>>, @namespace_name=[:stats, :pipelines, :main, :plugins, :filters, :bec429198c63d5dcc77eea23444ae84a8546f3bf8bbf6b9078785731485a5795, :events]>, @filter=<LogStash::Filters::Mutate remove_field=>["@timestamp", "message", "host", "@version", "path", "tags"], gsub=>["LISTED_ON", "\\d{4}-\\d{2}-\\d{2}"], id=>"bec429198c63d5dcc77eea23444ae84a8546f3bf8bbf6b9078785731485a5795", enable_metric=>true, periodic_flush=>false>>", :error=>"translation missing: en.logstash.agent.configuration.invalid_plugin_register", :thread=>"#<Thread:0x494a5289@C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:290 run>"}
[2018-02-28T21:12:35,130][ERROR][logstash.pipeline ] Pipeline aborted due to error {:pipeline_id=>"main", :exception=>#<LogStash::ConfigurationError: translation missing: en.logstash.agent.configuration.invalid_plugin_register>, :backtrace=>["C:/ELK-Stack/logstash/vendor/bundle/jruby/2.3.0/gems/logstash-filter-mutate-3.1.7/lib/logstash/filters/mutate.rb:198:in block in register'", "org/jruby/RubyArray.java:1778:ineach_slice'", "C:/ELK-Stack/logstash/vendor/bundle/jruby/2.3.0/gems/logstash-filter-mutate-3.1.7/lib/logstash/filters/mutate.rb:196:in register'", "C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:388:inregister_plugin'", "C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:399:in block in register_plugins'", "org/jruby/RubyArray.java:1734:ineach'", "C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:399:in register_plugins'", "C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:801:inmaybe_setup_out_plugins'", "C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:409:in start_workers'", "C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:333:inrun'", "C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:293:in `block in start'"], :thread=>"#<Thread:0x494a5289@C:/ELK-Stack/logstash/logstash-core/lib/logstash/pipeline.rb:290 run>"}
[2018-02-28T21:12:35,159][ERROR][logstash.agent ] Failed to execute action {:id=>:main, :action_type=>LogStash::ConvergeResult::FailedAction, :message=>"Could not execute action: LogStash::PipelineAction::Create/pipeline_id:main, action_result: false", :backtrace=>nil}

What should I put in the filter instead? Is my regex pattern not right?

gsub takes a three-element array. If you want to remove the time at the end you want something like ["name-of-field", "-\d+:\d+$", ""].

To add to MagnusBaeck's comment, the three element array of gsub is [ "FieldName", "String/ExpressionToTarget", "ReplacementString/Expression" ]

Okay so I tried your suggestion:

mutate {
        remove_field => ['@timestamp', 'message', 'host', '@version', 'path', 'tags']
        gsub => ["parsed.LISTED_ON", "-\d+:\d+$", ""]
    }

and this error keeps coming up:

[WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"removinghr", :_type=>"doc", :_routing=>nil}, #<LogStash::Event:0x5f08e3f2>], :response=>{"index"=>{"_index"=>"removinghr", "_type"=>"doc", "_id"=>"4qIg42EBqZVRQ8uGLGHu", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse [parsed.LISTED_ON]", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"Invalid format: \"2015-03-27-04:00\" is malformed at \"-04:00\""}}}}}

The only piece of code above mutate inside 'filter' is this:

xml {
        store_xml => true
        source => "message"
        force_array => false
        target => "parsed"
    }

...

You're using the wrong syntax to reference the subfield, see https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html#logstash-config-field-references.

Hmmm, I've tried the following options with no luck (same error):

        gsub => ["[INDIVIDUAL][LISTED_ON]", "-\d+:\d+$", ""]
        gsub => ["[INDIVIDUAL][parsed.LISTED_ON]", "-\d+:\d+$", ""]
        gsub => ["[INDIVIDUALS][INDIVIDUAL][parsed.LISTED_ON]", "-\d+:\d+$", ""]
        gsub => ["[CONSOLIDATED_LIST][INDIVIDUALS][INDIVIDUAL][parsed.LISTED_ON]", "-\d+:\d+$", ""]

What does the event look like? Copy/paste from Kibana's JSON tab or use a stdout { codec => rubydebug } output.

[WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"removinghr", :_type=>"doc", :_routing=>nil}, #<LogStash::Event:0x5f08e3f2>], :response=>{"index"=>{"_index"=>"removinghr", "_type"=>"doc", "_id"=>"4qIg42EBqZVRQ8uGLGHu", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse [parsed.LISTED_ON]", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"Invalid format: \"2015-03-27-04:00\" is malformed at \"-04:00\""}}}}}

So when the date isn't able to be parsed, it isn't pushed into elasticsearch and the error displayed is always the same (as posted previously). I'm thinking now it might be because parsed.LISTED_ON doesnt exist in the XML, but LISTED_ON does (the 'parsed.' prefix has been added in as thats the target name)

I tried to convert the variable to a string with this:

mutate {
         convert => { "parsed.LISTED_ON" => "string" }
}

But still got this error:

[2018-03-03T19:25:39,162][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"retry", :_type=>"doc", :_routing=>nil}, #<LogStash::Event:0x5af6789d>], :response=>{"index"=>{"_index"=>"retry", "_type"=>"doc", "_id"=>"rvmJ6mEBsfYjCDuxjhJ1", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse [parsed.LISTED_ON]", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"Invalid format: \"2015-04-07-04:00\" is malformed at \"-04:00\""}}}}}
[2018-03-03T19:25:39,202][FATAL][logstash.runner          ] SIGINT received. Terminating immediately..

My XML schema has the following format (LISTED_ON is one of the many fields existing under INDIVIDUAL):
<CONSOLIDATED_LIST xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ....>
<INDIVIDUALS>
<INDIVIDUAL>
<LISTED_ON>

Could the reason possibly exist because of this section in my input?

codec => multiline {
            pattern => "<INDIVIDUAL>"
            negate => "true"
            what => "previous"
            max_lines => 50000
        }

I'm guessing it doesn't like the hyphen separating the date from the time. Looks like it may be possible to use gsub to replace the hyphens in the field with a space and then the date filter can convert it.