Issues parsing JSON with deep nesting

Dallas_Toth · April 10, 2019, 5:05am

the array looks like this.
data.children.data.replies.data.children.data.replies.data.children.data.replies.data.children.data.replies.data.children.....ETC

I have this JSON array that nests and I would like to get all the data with in each
[children][data]
so if [replies] is present in that there will be another set of [children][data] and so on and so on.

An example data set is
[https://www.reddit.com/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/.json](http://JSON data example)

I am looking for a way to bring each of the [data] section under a single array so all comments are present as their own documents.

Dallas_Toth · April 10, 2019, 2:04pm

I found this but have no idea on how to get this working for my data set since I am a complete noob when it comes to ruby.
Flattening an Array with Recursion

def recursive_flatten(array, results = [])
array.each do |element|
if element.class == Array
  recursive_flatten(element, results)
else
  results << element
end
end
results
end

Badger · April 10, 2019, 3:19pm

If you want these five ids

    [0] "ekizeo5",
    [1] "ekizbej",
    [2] "ekize5c",
    [3] "ekizc4b",
    [4] "ekizdfx"

then you could start with

    json { source => "message" target => "[@metadata][json]" remove_field => "message" }
    ruby {
        code => '
            a = []

            def processReplies(p, a)
                if  p["replies"] and
                    p["replies"]["data"] and
                    p["replies"]["data"]["children"] and
                    p["replies"]["data"]["children"].kind_of?(Array)
                        p["replies"]["data"]["children"].each { |v1|
                            processReplies(v1["data"], a)
                            v1["data"].delete("replies")
                            a << v1["data"]
                        }
                end
            end

            theData = event.get("[@metadata][json]")
            theData.each { |v1|
                v2 = v1["data"]["children"]
                v2.each { |v3|
                    processReplies(v3["data"], a)
                }
            }
            event.set("replies", a)
        '
    }

Apologies if my ruby coding style (or lack of it) makes your eye-balls bleed

Dallas_Toth · April 10, 2019, 7:39pm

Thx but the code you have there doesn't seem to be recursion. My JSON was just an example there could be 5 or there could be 10 or more nested JSON in different examples.

Badger · April 10, 2019, 7:40pm

Look again.

Dallas_Toth · April 10, 2019, 9:52pm

It's only returning a single doc.

input {
  http_poller {
    urls => {
      test1 => "https://www.reddit.com/r/softwaretesting/comments/bbhytv/.json"
    }
    request_timeout => 60
    # Supports "cron", "every", "at" and "in" schedules by rufus scheduler
    schedule => { cron => "*/10 * * * * UTC"}
    # proxy => {host => "proxy.org", port => 80, scheme => 'http', user => 'username@host', password => 'password'}
    codec => "json"
    tags => ["testing"]
  }
}
filter {
if "testing" in [tags]
  {
    json 
    {
      source => "message"
      target => "[@metadata][json]" 
      remove_field => "message"
    }
    ruby {
        code => '
            a = []

            def processReplies(p, a)
                if  p["replies"] and
                    p["replies"]["data"] and
                    p["replies"]["data"]["children"] and
                    p["replies"]["data"]["children"].kind_of?(Array)
                        p["replies"]["data"]["children"].each { |v1|
                            processReplies(v1["data"], a)
                            v1["data"].delete("replies")
                            a << v1["data"]
                        }
                end
            end

            theData = event.get("[@metadata][json]")
            theData.each { |v1|
                v2 = v1["data"]["children"]
                v2.each { |v3|
                    processReplies(v3["data"], a)
                }
            }
            event.set("replies", a)
        '
    }
  }
}
output {
  if "testing" in [tags]
  { elasticsearch {
    hosts => ["http://server:9200"]
    index => "logstash-testing-reply-%{+YYYY.MM.dd}"
    }
  }
}

I know the JSON comes in with a 0 and 1 array and the comments are in the 1 only I don't need the 0. Could that be why?

Badger · April 10, 2019, 10:04pm

I took the JSON you linked to in your first posted put it on a single line and read it using a file input. What I get back from my code is

   "replies" => [
    [0] {
         [...]
                            "permalink" => "/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/ekizeo5/",
         [...]
                            "body_html" => "&lt;div class=\"md\"&gt;&lt;p&gt;reply to 2nd comment&lt;/p&gt;\n&lt;/div&gt;",
         [...]
    },
    [1] {
         [...]
                            "permalink" => "/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/ekize5c/",
         [...]
                                 "body" => "&gt;here is a reply to the reply of the 1st comment\n\nanother  here is a reply to the reply of the 1st comment",
         [...]
    },
    [2] {
         [...]
                            "permalink" => "/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/ekizdfx/", 
         [...]
                                 "body" => "another level deeper on the 1st comment",
         [...]
    },
    [3] {
         [...]
                            "permalink" => "/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/ekizc4b/",
         [...]
                                 "body" => "here is a reply to the reply of the 1st comment",
         [...]
    },
    [4] {
         [...]
                            "permalink" => "/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/ekizbej/",
         [...]
                                 "body" => "here is a reply to the 1st comment",
         [...]
    }
],
         [...]

}

Badger · April 10, 2019, 10:11pm

Get rid of the json codec on the input. The filter expects the json to serialized in [message].

Alternatively, if you have other reasons for using the codec, get rid of the json filter and change the ruby code. The codec appears to split the array into events, so this piece of code will need to change

        theData.each { |v1|
            v2 = v1["data"]["children"]
            v2.each { |v3|
                processReplies(v3["data"], a)
            }
        }

It's going to be something more like

        theData = event.get("[data][children]")
        if theData.kind_of?(Array)
            theData.each { |v1|
                processReplies(v1["data"], a)
            }
        end
        #event.remove("data")

but that isn't quite right yet. I'll take another look in about 14 hours.

Dallas_Toth · April 10, 2019, 11:59pm

Thx for the help, I really appreciate the time you are spending.

I did remove the codec json but now I am receiving an error when it tries to index the data. I did stdout but it also shows a _rubyexception

[2019-04-10T17:30:01,523][ERROR][logstash.filters.ruby    ] Ruby exception occurred: undefined method `each' for nil:NilClass


[2019-04-10T17:30:01,529][ERROR][logstash.filters.ruby    ] Ruby exception occurred: undefined method `each' for nil:NilClass


[2019-04-10T17:30:01,677][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-2ndtest-reply-2019.04.10", :_type=>"_doc", :routing=>nil}, #<LogStash::Event:0x55abba13>], :response=>{"index"=>{"_index"=>"logstash-2ndtest-reply-2019.04.10", "_type"=>"_doc", "_id"=>"lcmWCWoBpdFPTkX2Ohre", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field [data.children.data.edited] of type [boolean] in document with id 'lcmWCWoBpdFPTkX2Ohre'", "caused_by"=>{"type"=>"json_parse_exception", "reason"=>"Current token (VALUE_NUMBER_FLOAT) not of boolean type\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@655dde02; line: 1, column: 175676]"}}}}}

Badger · April 11, 2019, 3:42pm

If you want to keep the codec try

        ruby {
    code => '
        a = []

        def processReplies(p, a)
            if  p["replies"] and
                p["replies"]["data"] and
                p["replies"]["data"]["children"] and
                p["replies"]["data"]["children"].kind_of?(Array)
                    p["replies"]["data"]["children"].each { |v1|
                        processReplies(v1["data"], a)
                        # Cannot remove it until after iterating over it
                        v1["data"].delete("replies")
                        a << v1["data"]
                    }
            end
        end


        theData = event.get("[data][children]")
        if theData.kind_of?(Array)
            theData.each { |v1|
                processReplies(v1["data"], a)
            }
        end
        #event.remove("data")
        if a != []
            event.set("replies", a)
        end
    '
}

For the sample data that gets you two events, and the second one has an array of 5 replies.

Dallas_Toth · April 11, 2019, 5:30pm

It's got to be an issue with the HTTP_Poller and how it is bringing in the JSON. It's still is not flattening the message with the Comments. It's only bringing in that 1st obj which is just the post title and other values. "Test JSON output of comments"

input {
  http_poller {
    urls => {
      test1 => "https://www.reddit.com/r/softwaretesting/comments/bbhytv/.json"
    }
    request_timeout => 60
    schedule => { cron => "* * * * * UTC"}
    codec => "json"
  }
}
filter {
  ruby {
      code => '
          a = []

          def processReplies(p, a)
              if  p["replies"] and
                  p["replies"]["data"] and
                  p["replies"]["data"]["children"] and
                  p["replies"]["data"]["children"].kind_of?(Array)
                      p["replies"]["data"]["children"].each { |v1|
                          processReplies(v1["data"], a)
                          # Cannot remove it until after iterating over it
                          v1["data"].delete("replies")
                          a << v1["data"]
                      }
              end
          end


          theData = event.get("[data][children]")
          if theData.kind_of?(Array)
              theData.each { |v1|
                  processReplies(v1["data"], a)
              }
          end
          #event.remove("data")
          if a != []
              event.set("replies", a)
          end
      '
  }
}
output {
  elasticsearch {
    hosts => ["http://server:9200"]
    index => "logstash-test-reply-%{+YYYY.MM.dd}"
    }
}

Badger · April 11, 2019, 5:53pm

Even with an http_poller input I get two events and the five replies on the second one.

Dallas_Toth · April 11, 2019, 6:27pm

I get this error in the logs.

[2019-04-11T12:26:00,500][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-tester-reply-2019.04.11", :_type=>"_doc", :routing=>nil}, #<LogStash::Event:0x2d2d0b3d>], :response=>{"index"=>{"_index"=>"logstash-tester-reply-2019.04.11", "_type"=>"_doc", "_id"=>"3d-mDWoBqKIyXB-qQN-S", "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"Can't merge a non object mapping [data.children.data.replies] with an object mapping [data.children.data.replies]"}}}}

Dallas_Toth · April 11, 2019, 6:52pm

I think I see what the issue is. When the data is sent to Elastic it sends both obj. in the same Event and that is why I am only getting the 1st which does not have the comments only the post details.
Then the 2nd Obj is failing to index.
I really only need the 2nd obj anyways is there a way to remove that 1st obj with just the post details.

Badger · April 11, 2019, 6:56pm

Oh. That's not good. In the second event there are multiple entries in the [data][children] array. In the first one (Couldn't you do that in your own subreddit...) [data][replies] is

                                      "replies" => "",

That tells elasticsearch that it should expect [data.children.data.replies] to be a string. In the second one (Here is a 2nd comment) [data][replies] is an object

                                      "replies" => {
                    "data" => {
                           "after" => nil,
                         "modhash" => "",
                          "before" => nil,
                        "children" => [
                            [0] {
                                "data" => {
[...]

A field in elasticsearch can be a string, or it can be an object, but it has to consistently be one or the other. So when it is a string, we need to change it to be an object

      theData = event.get("[data][children]")
      if theData.kind_of?(Array)
          theData.each_index { |x|
              if theData[x]["data"]["replies"] == ""
                event.set("[data][children][#{x}][data][replies]", {})
              end
          }
          theData.each { |v1|
              processReplies(v1["data"], a)
          }
      end

Badger · April 11, 2019, 7:16pm

You could do that based on whether or not [data][children][0][data][replies] exists. The normal test for field existence does not work here, you have to do it in ruby.

if ! event.get("[data][children][0][data][replies]")
    event.cancel
end

Dallas_Toth · April 11, 2019, 7:51pm

getting so close.
I used the solution to turn the strings into events as well as I got rid of that 1st obj. but now we are running into issues with the other comments in the replies.

Can't merge a non object mapping [data.children.data.replies.data.children.data.replies] with an object mapping [data.children.data.replies.data.children.data.replies]"}}}}

Badger · April 11, 2019, 8:01pm

Do you need all that nested data to be kept on the event? If you have moved the replies up to the top level can you just remove the replies field?

Dallas_Toth · April 11, 2019, 8:07pm

Correct I only need the the replies as well as the data.children.data
I just tried to do that as well as make a few changes with mutate and it didn't work. All it did was removed the replies and data and didn't leave the root level fields like I defined in the mutate.

if [replies][body]
  {
  mutate
    {
      add_field => {
        "id" => "%{[replies][id]}"
        "replyid" => "%{[replies][id]}"
        "reference" => "%{[replies][subreddit]}"
        "text" => "%{[replies][body]}"
        "postid" => "%{[replies][link_id]}"
        "author" => "%{[replies][author]}"
        "score" => "%{[replies][score]}"
        "link" => "http://reddit.com%{[replies][permalink]}"
      }
    }
  }
if [data][children][data][body]
  {
  mutate
    {
      add_field => {
        "id" => "%{[data][children][data][id]}"
        "replyid" => "%{[data][children][data][id]}"
        "reference" => "%{[data][children][data][subreddit]}"
        "text" => "%{[data][children][data][body]}"
        "postid" => "%{[data][children][data][link_id]}"
        "author" => "%{[data][children][data][author]}"
        "score" => "%{[data][children][data][score]}"
        "link" => "http://reddit.com%{[data][children][data][permalink]}"
      }
    }
  }
mutate
  {
    remove_field => ["[data]","[replies]"]
  }

Badger · April 11, 2019, 8:11pm

Unless you have added a split filter both [replies] and [data][children] are arrays, so you would have to add [0] to everything to reference the first entry.

Topic		Replies	Views
Flatten nested json array Logstash	3	3681	September 18, 2019
Flatten recursively a JSON to Logstash in Ruby Logstash	2	2161	July 26, 2019
Parse JSON array with nested objects Logstash	3	230	July 19, 2022
Parsing nested json object and make as a single filed Logstash	4	332	August 10, 2018
Make Json flattening more dynamic to process nested json arrays/objects Logstash	3	1610	October 21, 2019

Issues parsing JSON with deep nesting

Related topics