Parsing difficulties filters/plugin

zbay · March 21, 2017, 6:00pm

hello everyone, I've been having a little difficulty dealing with this collection of files I'm trying to get indexed into the elk stack. As of the current conf file the system is parsing everything into the system and adding the appropriate tags to the data however as you will see that the original document contains multiple fields, and as currently configured each key value pair produces it's on document. I've come to the conclusion that I'm apparently making this more difficult that I should be so now it's time to reach out to the better minds out there for assistance.

In the interest of saving space I have removed most of the value fields. most note worthy is that the post_content field holds html make up which is the reason for the regex expressions.

	{
		"ID" : 12187,
		"post_author" : 24562,
		"post_date" : "2015-06-15 16:27:59",
		"post_date_gmt" : "2015-06-15 20:27:59",
		"post_content" : "contents of article containing html markup",
		"post_title" : "title of article plain text",
		"post_excerpt" : "",
		"post_status" : "publish",
		"comment_status" : "open",
		"ping_status" : "open",
		"post_password" : "",
		"post_name" : "post name, plain text",
		"to_ping" : "",
		"pinged" : "",
		"post_modified" : "2016-08-02 19:34:58",
		"post_modified_gmt" : "2016-08-02 23:34:58",
		"post_content_filtered" : "",
		"post_parent" : 0,
		"guid" : "https address",
		"menu_order" : 0,
		"post_type" : "msc",
		"post_mime_type" : "",
		"comment_count" : 22
	},
	{
		"ID" : 12544,
		"post_author" : 113708,
		"post_date" : "2015-06-17 16:20:01",
		"post_date_gmt" : "2015-06-17 20:20:01",
		"post_content" : "",
		"post_title" : "Automatas Finitos",
		"post_excerpt" : "",
		"post_status" : "publish",
		"comment_status" : "open",
		"ping_status" : "open",
		"post_password" : "",
		"post_name" : "automatas-finitos",
		"to_ping" : "",
		"pinged" : "",
		"post_modified" : "2016-02-24 19:48:23",
		"post_modified_gmt" : "2016-02-25 00:48:23",
		"post_content_filtered" : "",
		"post_parent" : 0,
		"guid" : " https web addresss",
		"menu_order" : 0,
		"post_type" : "msc",
		"post_mime_type" : "",
		"comment_count" : 0
	}
   ]

here is the configuration file, nothing particularly interesting however I'm assuming I'm either using the wrong plugins or have them configured incorrectly.

input {
  s3 {
    access_key_id => "redacted"
    secret_access_key => "redacted"
    bucket => "bucket_name"
    region => "us-east-1" 
    prefix => "/"
    delete => "true"
    enable_metric => "false" 
 }
}
filter {
   mutate {
     gsub => [ "message", "(?<=<post_content>[\\",])|(<(.*?)>)|(\s)", " " ] 
  }
  kv {
    source => "message"
    target => "[dataclass][tags]"
    field_split => ","
    value_split => ":"
    #trim => "([[{\]\}])"
    trimkey => "(\\\")|(\s)"
    remove_field => ["message"]
    include_brackets => "true"
    include_keys => ["ID", "post_author", "post_date", "post_date_gmt", "post_content", "post_title", "post_excerpt", "post_status", "comment_status", "ping_status", 
                     "post_password", "post_name", "to_ping", "pinged", "post_modified", "post_modified_gmt", "post_content_filtered", "post_parent", "guid", "menu_order",
                     "post_type", "post_mime_type", "comment_count"]
 }
}
output {
elasticsearch {
    user => "redacted"
    password => "redacted"
    hosts => "localhost:9200"
    index => "logstash-articles"
    document_type => "articles"
  }
}

currently it is returning a json for each field however the fields are adding correctly (an example)

{
  "_index": "logstash-articles",
  "_type": "articles",
  "_id": "AVrx64omhUtoxq_cTdpn",
  "_score": null,
  "_source": {
    "@timestamp": "2017-03-21T17:30:29.748Z",
    "dataclass": {
      "tags": {
        "ID": "85946"
      }
    },
    "@version": "1"
  },
  "fields": {
    "@timestamp": [
      1490117429748
    ]
  },
  "sort": [
    1490117429748
  ]
}

thanks for any advice in advance

dannygoulder · March 21, 2017, 6:14pm

so you are using the kv filter to split up a json array. why not use the json codec and the split filter in order to convert your json objects into individual logstash events.

zbay · March 21, 2017, 6:33pm

Thanks for the quick reply, I will look into the filters you mentioned. I started off using the json codec for both input and output however I was getting errors due to the html markup in the "post_content" field, I believe that the html marking was having prematurely end giving me a lot of responses that were brackets and html tags.

dannygoulder · March 21, 2017, 6:42pm

is the input data valid json? can you check it with another tool? perhaps there is an issue with the json parser in logstash. Which version are you using?

zbay · March 21, 2017, 6:46pm

just put into json lint, seems valid according to them. I'll make some changes to the conf file and see if I can reproduce the error logs associated with using the json codec.

zbay · March 21, 2017, 6:51pm

Sorry also I"m currently using version 5.2.2 for everything

zbay · March 21, 2017, 7:24pm

this is the error returned

 JSON parse error, original data now in message field {:error=>#<LogStash::Json::ParserError: incompatible json object type=java.lang.String , only hash map or arrays are suppoted>, :data=>"\t\t\"post_parent\" : 0,\n"}

then says something about to going to the message field

zbay · March 22, 2017, 1:03am

Ok, I've been looking into the split plugin like you said. So according to the documentation,

{ field1: ...,
 results: [
   { result ... },
   { result ... },
   { result ... },
   ...
] }

I was still having difficulty with the json codec so I'm wondering if this would work basically keep the kv which breaks it down into a bunch of fields ``dataclass.tags.ID, dataclass.tags.post_author, ..., etc ```

or would it be better to do it based off the message key?

system · April 19, 2017, 1:03am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Parsing Emails with logstash Logstash	10	3343	May 18, 2018
Help with parsing xml data Logstash	3	612	May 7, 2018
Can't even parse very simple data and get _grokparsefailure as a tag Logstash	19	1072	December 12, 2018
Bad charset encoding in field names (II) Elasticsearch	6	740	June 20, 2018
Having problems parsing data Logstash	4	306	February 18, 2021

Parsing difficulties filters/plugin

Related topics