Issues parsing JSON with deep nesting

the array looks like this.
data.children.data.replies.data.children.data.replies.data.children.data.replies.data.children.data.replies.data.children.....ETC

I have this JSON array that nests and I would like to get all the data with in each
[children][data]
so if [replies] is present in that there will be another set of [children][data] and so on and so on.

An example data set is
[https://www.reddit.com/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/.json](http://JSON data example)

I am looking for a way to bring each of the [data] section under a single array so all comments are present as their own documents.

I found this but have no idea on how to get this working for my data set since I am a complete noob when it comes to ruby.
Flattening an Array with Recursion

def recursive_flatten(array, results = [])
array.each do |element|
if element.class == Array
  recursive_flatten(element, results)
else
  results << element
end
end
results
end

If you want these five ids

    [0] "ekizeo5",
    [1] "ekizbej",
    [2] "ekize5c",
    [3] "ekizc4b",
    [4] "ekizdfx"

then you could start with

    json { source => "message" target => "[@metadata][json]" remove_field => "message" }
    ruby {
        code => '
            a = []

            def processReplies(p, a)
                if  p["replies"] and
                    p["replies"]["data"] and
                    p["replies"]["data"]["children"] and
                    p["replies"]["data"]["children"].kind_of?(Array)
                        p["replies"]["data"]["children"].each { |v1|
                            processReplies(v1["data"], a)
                            v1["data"].delete("replies")
                            a << v1["data"]
                        }
                end
            end

            theData = event.get("[@metadata][json]")
            theData.each { |v1|
                v2 = v1["data"]["children"]
                v2.each { |v3|
                    processReplies(v3["data"], a)
                }
            }
            event.set("replies", a)
        '
    }

Apologies if my ruby coding style (or lack of it) makes your eye-balls bleed :slight_smile:

Thx but the code you have there doesn't seem to be recursion. My JSON was just an example there could be 5 or there could be 10 or more nested JSON in different examples.

Look again.

It's only returning a single doc.

input {
  http_poller {
    urls => {
      test1 => "https://www.reddit.com/r/softwaretesting/comments/bbhytv/.json"
    }
    request_timeout => 60
    # Supports "cron", "every", "at" and "in" schedules by rufus scheduler
    schedule => { cron => "*/10 * * * * UTC"}
    # proxy => {host => "proxy.org", port => 80, scheme => 'http', user => 'username@host', password => 'password'}
    codec => "json"
    tags => ["testing"]
  }
}
filter {
if "testing" in [tags]
  {
    json 
    {
      source => "message"
      target => "[@metadata][json]" 
      remove_field => "message"
    }
    ruby {
        code => '
            a = []

            def processReplies(p, a)
                if  p["replies"] and
                    p["replies"]["data"] and
                    p["replies"]["data"]["children"] and
                    p["replies"]["data"]["children"].kind_of?(Array)
                        p["replies"]["data"]["children"].each { |v1|
                            processReplies(v1["data"], a)
                            v1["data"].delete("replies")
                            a << v1["data"]
                        }
                end
            end

            theData = event.get("[@metadata][json]")
            theData.each { |v1|
                v2 = v1["data"]["children"]
                v2.each { |v3|
                    processReplies(v3["data"], a)
                }
            }
            event.set("replies", a)
        '
    }
  }
}
output {
  if "testing" in [tags]
  { elasticsearch {
    hosts => ["http://server:9200"]
    index => "logstash-testing-reply-%{+YYYY.MM.dd}"
    }
  }
}

I know the JSON comes in with a 0 and 1 array and the comments are in the 1 only I don't need the 0. Could that be why?

I took the JSON you linked to in your first posted put it on a single line and read it using a file input. What I get back from my code is

   "replies" => [
    [0] {
         [...]
                            "permalink" => "/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/ekizeo5/",
         [...]
                            "body_html" => "&lt;div class=\"md\"&gt;&lt;p&gt;reply to 2nd comment&lt;/p&gt;\n&lt;/div&gt;",
         [...]
    },
    [1] {
         [...]
                            "permalink" => "/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/ekize5c/",
         [...]
                                 "body" => "&gt;here is a reply to the reply of the 1st comment\n\nanother  here is a reply to the reply of the 1st comment",
         [...]
    },
    [2] {
         [...]
                            "permalink" => "/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/ekizdfx/", 
         [...]
                                 "body" => "another level deeper on the 1st comment",
         [...]
    },
    [3] {
         [...]
                            "permalink" => "/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/ekizc4b/",
         [...]
                                 "body" => "here is a reply to the reply of the 1st comment",
         [...]
    },
    [4] {
         [...]
                            "permalink" => "/r/softwaretesting/comments/bbhytv/test_json_output_of_comments/ekizbej/",
         [...]
                                 "body" => "here is a reply to the 1st comment",
         [...]
    }
],
         [...]

}

Get rid of the json codec on the input. The filter expects the json to serialized in [message].

Alternatively, if you have other reasons for using the codec, get rid of the json filter and change the ruby code. The codec appears to split the array into events, so this piece of code will need to change

        theData.each { |v1|
            v2 = v1["data"]["children"]
            v2.each { |v3|
                processReplies(v3["data"], a)
            }
        }

It's going to be something more like

        theData = event.get("[data][children]")
        if theData.kind_of?(Array)
            theData.each { |v1|
                processReplies(v1["data"], a)
            }
        end
        #event.remove("data")

but that isn't quite right yet. I'll take another look in about 14 hours.

1 Like

Thx for the help, I really appreciate the time you are spending.

I did remove the codec json but now I am receiving an error when it tries to index the data. I did stdout but it also shows a _rubyexception

[2019-04-10T17:30:01,523][ERROR][logstash.filters.ruby    ] Ruby exception occurred: undefined method `each' for nil:NilClass


[2019-04-10T17:30:01,529][ERROR][logstash.filters.ruby    ] Ruby exception occurred: undefined method `each' for nil:NilClass


[2019-04-10T17:30:01,677][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-2ndtest-reply-2019.04.10", :_type=>"_doc", :routing=>nil}, #<LogStash::Event:0x55abba13>], :response=>{"index"=>{"_index"=>"logstash-2ndtest-reply-2019.04.10", "_type"=>"_doc", "_id"=>"lcmWCWoBpdFPTkX2Ohre", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field [data.children.data.edited] of type [boolean] in document with id 'lcmWCWoBpdFPTkX2Ohre'", "caused_by"=>{"type"=>"json_parse_exception", "reason"=>"Current token (VALUE_NUMBER_FLOAT) not of boolean type\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@655dde02; line: 1, column: 175676]"}}}}}

If you want to keep the codec try

        ruby {
    code => '
        a = []

        def processReplies(p, a)
            if  p["replies"] and
                p["replies"]["data"] and
                p["replies"]["data"]["children"] and
                p["replies"]["data"]["children"].kind_of?(Array)
                    p["replies"]["data"]["children"].each { |v1|
                        processReplies(v1["data"], a)
                        # Cannot remove it until after iterating over it
                        v1["data"].delete("replies")
                        a << v1["data"]
                    }
            end
        end


        theData = event.get("[data][children]")
        if theData.kind_of?(Array)
            theData.each { |v1|
                processReplies(v1["data"], a)
            }
        end
        #event.remove("data")
        if a != []
            event.set("replies", a)
        end
    '
}

For the sample data that gets you two events, and the second one has an array of 5 replies.

It's got to be an issue with the HTTP_Poller and how it is bringing in the JSON. It's still is not flattening the message with the Comments. It's only bringing in that 1st obj which is just the post title and other values. "Test JSON output of comments"

input {
  http_poller {
    urls => {
      test1 => "https://www.reddit.com/r/softwaretesting/comments/bbhytv/.json"
    }
    request_timeout => 60
    schedule => { cron => "* * * * * UTC"}
    codec => "json"
  }
}
filter {
  ruby {
      code => '
          a = []

          def processReplies(p, a)
              if  p["replies"] and
                  p["replies"]["data"] and
                  p["replies"]["data"]["children"] and
                  p["replies"]["data"]["children"].kind_of?(Array)
                      p["replies"]["data"]["children"].each { |v1|
                          processReplies(v1["data"], a)
                          # Cannot remove it until after iterating over it
                          v1["data"].delete("replies")
                          a << v1["data"]
                      }
              end
          end


          theData = event.get("[data][children]")
          if theData.kind_of?(Array)
              theData.each { |v1|
                  processReplies(v1["data"], a)
              }
          end
          #event.remove("data")
          if a != []
              event.set("replies", a)
          end
      '
  }
}
output {
  elasticsearch {
    hosts => ["http://server:9200"]
    index => "logstash-test-reply-%{+YYYY.MM.dd}"
    }
}

Even with an http_poller input I get two events and the five replies on the second one.

I get this error in the logs.

[2019-04-11T12:26:00,500][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-tester-reply-2019.04.11", :_type=>"_doc", :routing=>nil}, #<LogStash::Event:0x2d2d0b3d>], :response=>{"index"=>{"_index"=>"logstash-tester-reply-2019.04.11", "_type"=>"_doc", "_id"=>"3d-mDWoBqKIyXB-qQN-S", "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"Can't merge a non object mapping [data.children.data.replies] with an object mapping [data.children.data.replies]"}}}}

I think I see what the issue is. When the data is sent to Elastic it sends both obj. in the same Event and that is why I am only getting the 1st which does not have the comments only the post details.
Then the 2nd Obj is failing to index.
I really only need the 2nd obj anyways is there a way to remove that 1st obj with just the post details.

Oh. That's not good. In the second event there are multiple entries in the [data][children] array. In the first one (Couldn't you do that in your own subreddit...) [data][replies] is

                                      "replies" => "",

That tells elasticsearch that it should expect [data.children.data.replies] to be a string. In the second one (Here is a 2nd comment) [data][replies] is an object

                                      "replies" => {
                    "data" => {
                           "after" => nil,
                         "modhash" => "",
                          "before" => nil,
                        "children" => [
                            [0] {
                                "data" => {
[...]

A field in elasticsearch can be a string, or it can be an object, but it has to consistently be one or the other. So when it is a string, we need to change it to be an object

      theData = event.get("[data][children]")
      if theData.kind_of?(Array)
          theData.each_index { |x|
              if theData[x]["data"]["replies"] == ""
                event.set("[data][children][#{x}][data][replies]", {})
              end
          }
          theData.each { |v1|
              processReplies(v1["data"], a)
          }
      end

You could do that based on whether or not [data][children][0][data][replies] exists. The normal test for field existence does not work here, you have to do it in ruby.

if ! event.get("[data][children][0][data][replies]")
    event.cancel
end

getting so close.
I used the solution to turn the strings into events as well as I got rid of that 1st obj. but now we are running into issues with the other comments in the replies.

Can't merge a non object mapping [data.children.data.replies.data.children.data.replies] with an object mapping [data.children.data.replies.data.children.data.replies]"}}}}

Do you need all that nested data to be kept on the event? If you have moved the replies up to the top level can you just remove the replies field?

Correct I only need the the replies as well as the data.children.data
I just tried to do that as well as make a few changes with mutate and it didn't work. All it did was removed the replies and data and didn't leave the root level fields like I defined in the mutate.

if [replies][body]
  {
  mutate
    {
      add_field => {
        "id" => "%{[replies][id]}"
        "replyid" => "%{[replies][id]}"
        "reference" => "%{[replies][subreddit]}"
        "text" => "%{[replies][body]}"
        "postid" => "%{[replies][link_id]}"
        "author" => "%{[replies][author]}"
        "score" => "%{[replies][score]}"
        "link" => "http://reddit.com%{[replies][permalink]}"
      }
    }
  }
if [data][children][data][body]
  {
  mutate
    {
      add_field => {
        "id" => "%{[data][children][data][id]}"
        "replyid" => "%{[data][children][data][id]}"
        "reference" => "%{[data][children][data][subreddit]}"
        "text" => "%{[data][children][data][body]}"
        "postid" => "%{[data][children][data][link_id]}"
        "author" => "%{[data][children][data][author]}"
        "score" => "%{[data][children][data][score]}"
        "link" => "http://reddit.com%{[data][children][data][permalink]}"
      }
    }
  }
mutate
  {
    remove_field => ["[data]","[replies]"]
  }

Unless you have added a split filter both [replies] and [data][children] are arrays, so you would have to add [0] to everything to reference the first entry.